Peak identification for ChIP-seq data with no controls

State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, the Chinese Academy of Sciences, Kunming 650223, China. .
Zoological Research 12/2012; 33(E5-6):E121-E128. DOI: 10.3724/SP.J.1141.2012.E120-06E121
Source: PubMed


Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is increasingly being used for genome-wide profiling of transcriptional regulation, as this technique enables dissection of the gene regulatory networks. With input as control, a variety of statistical methods have been proposed for identifying the enriched regions in the genome, i.e., the transcriptional factor binding sites and chromatin modifications. However, when there are no controls, whether peak calling is still reliable awaits systematic evaluations. To address this question, we used a Bayesian framework approach to show the effectiveness of peak calling without controls (PCWC). Using several different types of ChIP-seq data, we demonstrated the relatively high accuracy of PCWC with less than a 5% false discovery rate (FDR). Compared with previously published methods, e.g., the model-based analysis of ChIP-seq (MACS), PCWC is reliable with lower FDR. Furthermore, to interpret the biological significance of the called peaks, in combination with microarray gene expression data, gene ontology annotation and subsequent motif discovery, our results indicate PCWC possesses a high efficiency. Additionally, using in silico data, only a small number of peaks were identified, suggesting the significantly low FDR for PCWC.

Download full-text


Available from: Bing Su
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Unlabelled: ChIP-seq has become a major tool for the genome-wide identification of transcription factor binding or histone modification sites. Most peak-calling algorithms require input control datasets to model the occurrence of background reads to account for local sequencing and GC bias. However, the GC-content of reads in Input-seq datasets deviates significantly from that in ChIP-seq datasets. Moreover, we observed that a commonly used peak calling program performed equally well when the use of a simulated uniform background set was compared to an Input-seq dataset. This contradicts the assumption that input control datasets are necessary to fatefully reflect the background read distribution. Because the GC-content of the abundant single reads in ChIP-seq datasets is similar to those of randomly sampled regions we designed a peak-calling algorithm with a background model based on overlapping single reads. The application, OccuPeak, uses the abundant low frequency tags present in each ChIP-seq dataset to model the background, thereby avoiding the need for additional datasets. Analysis of the performance of OccuPeak showed robust model parameters. Its measure of peak significance, the excess ratio, is only dependent on the tag density of a peak and the global noise levels. Compared to the commonly used peak-calling applications MACS and CisGenome, OccuPeak had the highest sensitivity in an enhancer identification benchmark test, and performed similar in an overlap tests of transcription factor occupation with DNase I hypersensitive sites and H3K27ac sites. Moreover, peaks called by OccuPeak were significantly enriched with cardiac disease-associated SNPs. OccuPeak runs as a standalone application and does not require extensive tweaking of parameters, making its use straightforward and user friendly. Availability:
    Full-text · Article · Jun 2014 · PLoS ONE