On the detection and refinement of transcription factor binding sites using ChIP-Seq data

Center for Statistical Genetics, University of Michigan, Ann Arbor, Michigan 48109, USA.
Nucleic Acids Research (Impact Factor: 9.11). 04/2010; 38(7):2154-67. DOI: 10.1093/nar/gkp1180
Source: PubMed


Coupling chromatin immunoprecipitation (ChIP) with recently developed massively parallel sequencing technologies has enabled genome-wide detection of protein-DNA interactions with unprecedented sensitivity and specificity. This new technology, ChIP-Seq, presents opportunities for in-depth analysis of transcription regulation. In this study, we explore the value of using ChIP-Seq data to better detect and refine transcription factor binding sites (TFBS). We introduce a novel computational algorithm named Hybrid Motif Sampler (HMS), specifically designed for TFBS motif discovery in ChIP-Seq data. We propose a Bayesian model that incorporates sequencing depth information to aid motif identification. Our model also allows intra-motif dependency to describe more accurately the underlying motif pattern. Our algorithm combines stochastic sampling and deterministic 'greedy' search steps into a novel hybrid iterative scheme. This combination accelerates the computation process. Simulation studies demonstrate favorable performance of HMS compared to other existing methods. When applying HMS to real ChIP-Seq datasets, we find that (i) the accuracy of existing TFBS motif patterns can be significantly improved; and (ii) there is significant intra-motif dependency inside all the TFBS motifs we tested; modeling these dependencies further improves the accuracy of these TFBS motif patterns. These findings may offer new biological insights into the mechanisms of transcription factor regulation.

Download full-text


Available from: Zhaohui Qin,

Click to see the full-text of:

Article: On the detection and refinement of transcription factor binding sites using ChIP-Seq data

5.01 MB

See full-text
  • Source
    • "Kulakovskiy compares the efficiency of Weeder Pavesi et al. [9], Gibbs Sampler Lawrence et al. [7] and MEME Suite Bailey et al. [10]. Kulakovskiy also discusses the efficiency of cERMIT [11] another algorithm which takes advantage of the properties of ChIPSeq and HMS [12] which reduces the stochastic sampling set size and selects the alignment variable chauvinistically. Kulakovskiy provides ChIPMunk which is suited for work on significantly larger scales than many of the previous Gibbs Sampling algorithms. "
    [Show abstract] [Hide abstract]
    ABSTRACT: DNA@Home is a volunteer computing project that aims to use Gibbs Sampling for the identification and location of DNA control signals on full genome-scale datasets. A fault tolerant and asynchronous implementation of Gibbs sampling using the Berkeley Open Infrastructure for Network Computing (BOINC) was used to identify the location of binding sites of the SNAI1 (Snail) and SNAI2 (Slug) transcription factors across the human genome. A set of genes that are regulated by Slug but not Snail, and a set of genes that are regulated by Snail but not Slug were used to provide two datasets with known motifs. These datasets contained up to 994 DNA sequences, which to our knowledge is largest scale use of Gibbs sampling for discovery of binding sites. These genomic regions were analyzed using datasets containing various numbers of intergenomic regions. 1,000 parallel sampling walks were used to search for the presence of 1, 2 or 3 possible motifs. These runs were performed over a period of two months using over 1,500 volunteered computing hosts, and generated over 2.2 Terabytes of sampling data. High performance computing resources were used for post processing of the Gibbs Sampler output. This paper presents how intra-and interwalk analyses can aid in determining overall walk convergence. The results were validated against current biological knowledge of the Snail and Slug promoter regions, and present potential avenues for further biological study.
    11th IEEE International Conference on eScience, Munich, Germany; 08/2015
  • Source
    • "These TF-based ChIP-seq experiments can define potential TFBRs that are enriched with the binding of this TF on the genome scale, the so-called ChIP-seq peak regions. With several hundreds to thousands of potential TFBRs, computational methods can then be applied to discover TFBSs in these regions [6] [7] [8] [9] [10] [11] [12]. Although ChIP-seq experiments with a specific antibody against a TF can be a powerful means to define potential TFBRs and subsequently discover TFBSs, the challenge is that TF-specific antibodies remain unknown for the majority of TFs [13]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Histone modification (HM) patterns are widely applied to identify transcription factor binding regions (TFBRs). However, how frequently TFBRs overlap with genomic regions enriched with certain types of HMs and which HM marker is more effective to pinpoint TFBRs have not been systematically investigated. To address these problems, we studied 149 transcription factor (TF) ChIP-seq datasets and 33 HM ChIP-seq datasets in three cell lines. We found that on average about 90% of TFBRs overlap with the H3K4me2-enriched regions. Moreover, the H3K4me2-enriched regions with stronger signals of H3K4me2 enrichment more likely overlap with TFBRs than those with weaker signals. In addition, we showed that H3K4me2-enriched regions together with H3K27ac-enriched regions can greatly reduce false positive predictions of TFBRs. Our study sheds light on comprehensive discovery of TFBRs using HeK4me-enriched regions, especially when no good antibody to a TF exists.
    Genomics 02/2014; 103(2-3). DOI:10.1016/j.ygeno.2014.02.002 · 2.28 Impact Factor
  • Source
    • "An unusually large number of EVI1 binding sites were identified within 1.5kb of annotated genes, indicating binding within promoter regions and raising the possibility of interactions with other transcription factors (Figure 9). To determine if other transcription factors might bind within the ±1.5 kb regions centered about the annotated EVI1 DNA binding sites, we performed an analysis using the MATCH program and TRANSFAC database [57]. In DA-1 leukemic cells, 79 transcription factors were found to share binding within the promoter regions of EVI1 target genes (p<0.05). "
    [Show abstract] [Hide abstract]
    ABSTRACT: The ecotropic virus integration site 1 (EVI1) transcription factor is associated with human myeloid malignancy of poor prognosis and is overexpressed in 8-10% of adult AML and strikingly up to 27% of pediatric MLL-rearranged leukemias. For the first time, we report comprehensive genomewide EVI1 binding and whole transcriptome gene deregulation in leukemic cells using a combination of ChIP-Seq and RNA-Seq expression profiling. We found disruption of terminal myeloid differentiation and cell cycle regulation to be prominent in EVI-induced leukemogenesis. Specifically, we identified EVI1 directly binds to and downregulates the master myeloid differentiation gene Cebpe and several of its downstream gene targets critical for terminal myeloid differentiation. We also found EVI1 binds to and downregulates Serpinb2 as well as numerous genes involved in the Jak-Stat signaling pathway. Finally, we identified decreased expression of several ATP-dependent P2X purinoreceptors genes involved in apoptosis mechanisms. These findings provide a foundation for future study of potential therapeutic gene targets for EVI1-induced leukemia.
    PLoS ONE 06/2013; 8(6):e67134. DOI:10.1371/journal.pone.0067134 · 3.23 Impact Factor
Show more