False positive peaks in ChIP-seq and other sequencing-based functional assays caused by unannotated high copy number regions

Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA.
Bioinformatics (Impact Factor: 4.98). 06/2011; 27(15):2144-6. DOI: 10.1093/bioinformatics/btr354
Source: PubMed


Sequencing-based assays such as ChIP-seq, DNase-seq and MNase-seq have become important tools for genome annotation. In these assays, short sequence reads enriched for loci of interest are mapped to a reference genome to determine their origin. Here, we consider whether false positive peak calls can be caused by particular type of error in the reference genome: multicopy sequences which have been incorrectly assembled and collapsed into a single copy.
Using sequencing data from the 1000 Genomes Project, we systematically scanned the human genome for regions of high sequencing depth. These regions are highly enriched for erroneously inferred transcription factor binding sites, positions of nucleosomes and regions of open chromatin. We suggest a simple masking procedure to remove these regions and reduce false positive calls.
Files for masking out these regions are available at

Download full-text


Available from: PubMed Central · License: CC BY-NC
  • Source
    • "In the literature, this effect has been appreciated in peak calling (Pickrell et al., 2011; Rashid et al., 2011; Ashoor et al., 2013), and in differential epigenome (ChIP-seq and RNA-seq) analyses (Robinson et al., 2012). As of today, none of the multi-read mapping methods has considered the potential effect of CNV on multi-read allocation and the power it might provide for discriminating the mapping locations of multi-reads. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In chromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq) and other short read sequencing experiments, a considerable fraction of the short reads align to multiple locations on the reference genome (multi-reads). Inferring the origin of multi-reads is critical for accurately mapping reads to repetitive regions. Current state-of-the-art multi-read allocation algorithms rely on the read counts in the local neighbourhood of the alignment locations and ignore the variation in the copy-numbers of these regions. Copy-number variation (CNV) can directly affect the read densities and, therefore, bias allocation of multi-reads.
    Full-text · Article · Jun 2014 · Bioinformatics
  • Source
    • "Sequences that fully or partially overlapped problematic regions were discarded. We defined problematic regions as those with known mapability issues, (for example, repetitive sequences (from the UCSC genome browser microsatellite track (downloaded July 8, 2011))) and genomic coordinates with high false positive rates of enrichments, as identified by [82]. All remaining mapped tags were extended to 200 bp in the 3’ direction to account of the expected length of nucleosome-bound DNA. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The epithelial-mesenchymal transition (EMT) is a de-differentiation process required for wound healing and development. In tumors of epithelial origin aberrant induction of EMT contributes to cancer progression and metastasis. Studies have begun to implicate epigenetic reprogramming in EMT; however, the relationship between reprogramming and the coordination of cellular processes is largely unexplored. We have previously developed a system to study EMT in a canonical non-small cell lung cancer (NSCLC) model. In this system we have shown that the induction of EMT results in constitutive NF-kappaB activity. We hypothesized a role for chromatin remodeling in the sustained deregulation of cellular signaling pathways. We mapped sixteen histone modifications and two variants for epithelial and mesenchymal states. Combinatorial patterns of epigenetic changes were quantified at gene and enhancer loci. We found a distinct chromatin signature among genes in well-established EMT pathways. Strikingly, these genes are only a small minority of those that are differentially expressed. At putative enhancers of genes with the 'EMT-signature' we observed highly coordinated epigenetic activation or repression. Furthermore, enhancers that are activated are bound by a set of transcription factors that is distinct from those that bind repressed enhancers. Upregulated genes with the 'EMT-signature' are upstream regulators of NF-kappaB, but are also bound by NF-kappaB at their promoters and enhancers. These results suggest a chromatin-mediated positive feedback as a likely mechanism for sustained NF-kappaB activation. There is highly specific epigenetic regulation at genes and enhancers across several pathways critical to EMT. These sites of these changes in chromatin state implicate several inducible transcription factors with critical roles in EMT (NF-kappaB, AP-1 and MYC) as targets of this reprogramming. Furthermore, we find evidence that suggests that these transcription factors are in chromatin-mediated transcriptional feedback loops that regulate critical EMT genes. In sum, we establish an important link between chromatin remodeling and shifts in cellular reprogramming.
    Full-text · Article · Sep 2013 · Epigenetics & Chromatin
  • Source
    • "Although our data showed that the total number of TFBRs changes little between these closely related species, Caroli/EiJ was found to have overall fewer bound locations, most likely due to differences in the genome qualities (Figure S1C). For each data set, we estimated our false positive rate to be less than 1% by comparing our ChIP experiments to either a mock ChIP lacking the specific antibody or input DNA from the livers; this false positive rate is similar to prior studies (ENCODE, 2012; Pickrell et al., 2011). TFBRs were found to almost always center on a sequence match for the known TF binding motif (Figure S1D); similarly, computational analyses of the sets of TFBRs with either highest or lowest ChIP intensities readily produced the known position weight matrix (PWM) when subjected to de novo motif discovery (Figure S1D). "
    [Show abstract] [Hide abstract]
    ABSTRACT: To mechanistically characterize the microevolutionary processes active in altering transcription factor (TF) binding among closely related mammals, we compared the genome-wide binding of three tissue-specific TFs that control liver gene expression in six rodents. Despite an overall fast turnover of TF binding locations between species, we identified thousands of TF regions of highly constrained TF binding intensity. Although individual mutations in bound sequence motifs can influence TF binding, most binding differences occur in the absence of nearby sequence variations. Instead, combinatorial binding was found to be significant for genetic and evolutionary stability; cobound TFs tend to disappear in concert and were sensitive to genetic knockout of partner TFs. The large, qualitative differences in genomic regions bound between closely related mammals, when contrasted with the smaller, quantitative TF binding differences among Drosophila species, illustrate how genome structure and population genetics together shape regulatory evolution.
    Full-text · Article · Aug 2013 · Cell
Show more