On the detection and refinement of transcription factor binding sites using ChIP-Seq data

Center for Statistical Genetics, University of Michigan, Ann Arbor, Michigan 48109, USA.
Nucleic Acids Research (Impact Factor: 9.11). 04/2010; 38(7):2154-67. DOI: 10.1093/nar/gkp1180
Source: PubMed

ABSTRACT Coupling chromatin immunoprecipitation (ChIP) with recently developed massively parallel sequencing technologies has enabled genome-wide detection of protein-DNA interactions with unprecedented sensitivity and specificity. This new technology, ChIP-Seq, presents opportunities for in-depth analysis of transcription regulation. In this study, we explore the value of using ChIP-Seq data to better detect and refine transcription factor binding sites (TFBS). We introduce a novel computational algorithm named Hybrid Motif Sampler (HMS), specifically designed for TFBS motif discovery in ChIP-Seq data. We propose a Bayesian model that incorporates sequencing depth information to aid motif identification. Our model also allows intra-motif dependency to describe more accurately the underlying motif pattern. Our algorithm combines stochastic sampling and deterministic 'greedy' search steps into a novel hybrid iterative scheme. This combination accelerates the computation process. Simulation studies demonstrate favorable performance of HMS compared to other existing methods. When applying HMS to real ChIP-Seq datasets, we find that (i) the accuracy of existing TFBS motif patterns can be significantly improved; and (ii) there is significant intra-motif dependency inside all the TFBS motifs we tested; modeling these dependencies further improves the accuracy of these TFBS motif patterns. These findings may offer new biological insights into the mechanisms of transcription factor regulation.

  • Source
    • "The final collection of datasets contained 191 GEO series containing a total of 917 ChIP-seq and 292 control libraries. Except for a limited number of cases in which a GEO series was associated with multiple publications, two or three GEO series were associated with the same publication, or a GEO series has not yet been used in a publication, and there is a one-to-one relationship between GEO series and published articles in the literature (Robertson et al. 2007; Chen et al. 2008; Marson et al. 2008; Bilodeau et al. 2009; Cheng et al. 2009; De Santa et al. 2009; Lister et al. 2009; Nishiyama et al. 2009; Visel et al. 2009; Welboren et al. 2009; Wilson et al. 2009; Yu et al. 2009; Yuan et al. 2009; Barish et al. 2010; Blow et al. 2010; Blow et al. 2010; Cao et al. 2010; Chi et al. 2010; Chia et al. 2010; Chicas et al. 2010; Corbo et al. 2010; Cuddapah et al. 2009; Durant et al. 2010; Fortschegger et al. 2010; Gotea et al. 2010; Gu et al. 2010; Han et al. 2010; Heinz et al. 2010; Heng et al. 2010; Ho et al. 2009; Hollenhorst et al. 2009; Hu et al. 2010; Johannes et al. 2010; Jung et al. 2010; Kagey et al. 2010; Kassouf et al. 2010; Kim et al. 2010; Kong et al. 2010; Kouwenhoven et al. 2010; Krebs et al. 2010; Kunarso et al. 2010; Kwon et al. 2009; Law et al. 2010; Lee et al. 2010; Lefterova et al. 2010; Li et al. 2010; Lin et al. 2010; Liu et al. 2010; Ma et al. 2010; MacIsaac et al. 2010; Mahony et al. 2010; Martinez et al. 2010; Palii et al. 2010; Qi et al. 2010; Rada-Iglesias et al. 2010; Rahl et al. 2010; Ramagopalan et al. 2010; Ramos et al. 2010; Schlesinger et al. 2010; Schnetz et al. 2010; Sehat et al. 2010; Steger et al. 2010; Tallack et al. 2010; Tang et al. 2010; Vermeulen et al. 2010; Verzi et al. 2010; Vivar et al. 2010; Wei et al. 2010; Woodfield et al. 2010; Yang et al. 2010; Yao et al. 2010; Yu et al. 2010; An et al. 2011; Ang et al. 2011; Bergsland et al. 2011; Bernt et al. 2011; Botcheva et al. 2011; Brown et al. 2011; Bugge et al. 2011; Ceol et al. 2011; Ceschin et al. 2011; Costessi et al. 2011; Ebert et al. 2011; Fang et al. 2011; Handoko et al. 2011; He et al. 2011; Heikkinen et al. 2011; Holmstrom et al. 2011; Horiuchi et al. 2011; Hu et al. 2011; Joseph et al. 2010; Kim et al. 2011; Klisch et al. 2011; Koeppel et al. 2011; Kong et al. 2011; Little et al. 2011; Liu et al. 2011; Lo et al. 2011; Marban et al. 2011; Mazzoni et al. 2011; McManus et al. 2011; Mendoza-Parra et al. 2011; Meyer et al. 2012; Miyazaki et al. 2011; Mullen et al. 2011; Mullican et al. 2011; Nakayamada et al. 2011; Nitzsche et al. 2011; Norton et al. 2011; Novershtern et al. 2011; Quenneville et al. 2011; Rao et al. 2011; Rey et al. 2011; Sahu et al. 2011; Schmitz et al. 2011; Seitz et al. 2011; Shen et al. 2011; Shukla et al. 2011; Siersbaek et al. 2011; Smeenk et al. 2011; Smith et al. 2011; Soccio et al. 2011; Stadler et al. 2011; Sun et al. 2011; Tan et al. 2011a; Tan et al. 2011b; Teo et al. 2011; Tijssen et al. 2011; Tiwari et al. 2011a; Tiwari et al. 2011b; Trompouki et al. 2011; van Heeringen et al. 2011; Verzi et al. 2011; Wang et al. 2011a; Wang et al. 2011b; Wei et al. 2011; Whyte et al. 2011; Wu et al. 2011a; Wu et al. 2011b; Xu et al. 2011; Yang et al. 2011; Yildirim et al. 2011; Yoon et al. 2011; Zhang et al. 2011; Zhao et al. 2011a; Zhao et al. 2011b; Avvakumov et al. 2012; Barish et al. 2012; Boergesen et al. 2012; Bugge et al. 2012; Canella et al. 2012; Cardamone et al. 2012; Cheng et al. 2012; Chlon et al. 2012; Cho et al. 2012; Doré et al. 2012; Fan et al. 2012; Feng et al. 2011; Fong et al. 2012; Gao et al. 2012; Gowher et al. 2012; Hunkapiller et al. 2012; Hutchins et al. 2012; Li et al. 2012; Lu et al. 2012; Miller et al. 2011; Ntziachristos et al. 2012; Pehkonen et al. 2012; Ptasinska et al. 2012; Remeseiro et al. 2012; Sadasivam et al. 2012; Sakabe et al. 2012; Schödel et al. 2012; Trowbridge et al. 2012; Vilagos et al. 2012; Wu et al. 2012; Xiao et al. 2012; Yu et al. 2012; unpublished at the time of completion of this manuscript are the following GEO accession numbers: "
    [Show abstract] [Hide abstract]
    ABSTRACT: ChIP-seq has become the primary method for identifying in vivo protein-DNA interactions on a genome-wide scale, with nearly 800 publications involving the technique in PubMed as of December 2012. Individually and in aggregate these data are an important and information-rich resource. However, uncertainties about data quality confound their use by the wider research community. Recently, the Encyclopedia Of DNA Elements (ENCODE) project, developed and applied metrics to objectively measure ChIP-seq data quality. The ENCODE quality analysis was useful for flagging datasets for closer inspection, eliminating or replacing poor data, and for driving changes in experimental pipelines. There had been no similarly systematic quality analysis of the large and disparate body of published ChIP-seq profiles. Here we report a uniform analysis of vertebrate transcription factor ChIP-seq datasets in the Gene Expression Omnibus (GEO) repository as of April 1st 2012. The majority (55%) of datasets scored as highly successful, but a substantial minority (20%) were of apparently poor quality, and another ~25% were of intermediate quality. We discuss how different uses of ChIP-Seq data are affected by specific aspects of data quality, and we highlight exceptional instances for which the metric values should not be taken at face value. Unexpectedly, we discovered that a significant subset of control datasets (i.e. no-immunoprecipitation and mock-immunoprecipitation samples) display an enrichment structure similar to successful ChIP-seq data. This can, in turn, affect peak calling and data interpretation. Published datasets identified here as high quality comprise a large group that users can draw on for large-scale integrated analysis. In the future, ChIP-seq quality assessment similar to that used here could guide experimentalists at early stages in a study, provide useful input in the publication process, and be used to stratify ChIP-seq data for different community-wide uses.
    G3-Genes Genomes Genetics 12/2013; 4(2). DOI:10.1534/g3.113.008680 · 2.51 Impact Factor
  • Source
    • "The use of motif discovery methods has dramatically increased over the last few years due to the rise in sequencing capacity and the advancement of other high-throughput methods. These methods are routinely used to identify and predict transcription factor binding sites (Hu et al., 2010), protein phosphorylation sites (Schwartz and Church, 2010), microRNAs targets (Linhart et al., 2008) and alternative splicing locations (Suyama et al., 2010). However, these high-throughput methods have also led to new requirements from motif search algorithms. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Motif discovery is now routinely used in high-throughput studies including large-scale sequencing and proteomics. These datasets present new challenges. The first is speed. Many motif discovery methods do not scale well to large datasets. Another issue is identifying discriminative rather than generative motifs. Such discriminative motifs are important for identifying co-factors and for explaining changes in behavior between different conditions. To address these issues we developed a method for DECOnvolved Discriminative motif discovery (DECOD). DECOD uses a k-mer count table and so its running time is independent of the size of the input set. By deconvolving the k-mers DECOD considers context information without using the sequences directly. DECOD outperforms previous methods both in speed and in accuracy when using simulated and real biological benchmark data. We performed new binding experiments for p53 mutants and used DECOD to identify p53 co-factors, suggesting new mechanisms for p53 activation. The source code and binaries for DECOD are available at CONTACT: Supplementary data are available at Bioinformatics online.
    Bioinformatics 09/2011; 27(17):2361-7. DOI:10.1093/bioinformatics/btr412 · 4.62 Impact Factor
  • Source
    • "The Gibbs sample strategy, first described for sequence analysis by Lawrence et al. (1993) and refined by Liu et al. (1995), has been implemented in many motif discovery tools such * To whom correspondence should be addressed. as AlignACE (Roth et al., 1998), BioProspector (Liu et al., 2001), Motif Sampler (Thijs et al., 2001), GLAM (Frith et al., 2004), NestedMICA (Down and Hubbard, 2005), A-GLAM (Kim et al., 2008), BayesMD (Tang et al., 2008), GIMSAN (Ng and Keich, 2008), info-gibbs (Defrance and van Helden, 2009) and HMS (Hu et al., 2010). The EM algorithm, first applied to motif discovery by Lawrence and Reilly (1990), also has many implementations including the widely used MEME (Bailey and Elkan, 1995) as well as GreedyEM (Blekas et al., 2003), the discriminative PSSM approach (Segal et al., 2003), fdrMotif (Li et al., 2008) and GADEM (Li, 2009). "
    [Show abstract] [Hide abstract]
    ABSTRACT: ChIP-seq data are enriched in binding sites for the protein immunoprecipitated. Some sequences may also contain binding sites for a coregulator. Biologists are interested in knowing which coregulatory factor motifs may be present in the sequences bound by the protein ChIP'ed. We present a finite mixture framework with an expectation-maximization algorithm that considers two motifs jointly and simultaneously determines which sequences contain both motifs, either one or neither of them. Tested on 10 simulated ChIP-seq datasets, our method performed better than repeated application of MEME in predicting sequences containing both motifs. When applied to a mouse liver Foxa2 ChIP-seq dataset involving ~ 12 000 400-bp sequences, coMOTIF identified co-occurrence of Foxa2 with Hnf4a, Cebpa, E-box, Ap1/Maf or Sp1 motifs in ~6-33% of these sequences. These motifs are either known as liver-specific transcription factors or have an important role in liver function. Freely available at Supplementary data are available at Bioinformatics online.
    Bioinformatics 07/2011; 27(19):2625-32. DOI:10.1093/bioinformatics/btr397 · 4.62 Impact Factor
Show more