False positive peaks in ChIP-seq and other sequencing-based functional assays caused by unannotated high copy number regions

Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA.
Bioinformatics (Impact Factor: 4.62). 06/2011; 27(15):2144-6. DOI: 10.1093/bioinformatics/btr354
Source: PubMed

ABSTRACT Sequencing-based assays such as ChIP-seq, DNase-seq and MNase-seq have become important tools for genome annotation. In these assays, short sequence reads enriched for loci of interest are mapped to a reference genome to determine their origin. Here, we consider whether false positive peak calls can be caused by particular type of error in the reference genome: multicopy sequences which have been incorrectly assembled and collapsed into a single copy.
Using sequencing data from the 1000 Genomes Project, we systematically scanned the human genome for regions of high sequencing depth. These regions are highly enriched for erroneously inferred transcription factor binding sites, positions of nucleosomes and regions of open chromatin. We suggest a simple masking procedure to remove these regions and reduce false positive calls.
Files for masking out these regions are available at

  • Source
    • "In the literature, this effect has been appreciated in peak calling (Pickrell et al., 2011; Rashid et al., 2011; Ashoor et al., 2013), and in differential epigenome (ChIP-seq and RNA-seq) analyses (Robinson et al., 2012). As of today, none of the multi-read mapping methods has considered the potential effect of CNV on multi-read allocation and the power it might provide for discriminating the mapping locations of multi-reads. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In chromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq) and other short read sequencing experiments, a considerable fraction of the short reads align to multiple locations on the reference genome (multi-reads). Inferring the origin of multi-reads is critical for accurately mapping reads to repetitive regions. Current state-of-the-art multi-read allocation algorithms rely on the read counts in the local neighbourhood of the alignment locations and ignore the variation in the copy-numbers of these regions. Copy-number variation (CNV) can directly affect the read densities and, therefore, bias allocation of multi-reads.
    Bioinformatics 06/2014; 30(20). DOI:10.1093/bioinformatics/btu402 · 4.62 Impact Factor
  • Source
    • "And it is well appreciated that characteristics of the reference genome influence the mapping results, for example, some sequences in the genome are present in multiple copies, leading to ambiguity when determining the origin of sequencing reads [12]. Some sequences which present in a single copy on the available reference genome are present in multiple copies in all or some individuals in reality [13]. So the sequence structure of human centromeres may have an important impact on the generation of EHPs. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Next-generation sequencing (NGS) and its applications are widely used in studying gene regulation and epigenetic mechanisms due to its decreasing cost and high throughput. Here we used MNase-seq technology to determine the nucleosome positions in human erythroleukemia k562 cells by direct sequencing of nucleosome ends with the SOLiD high-throughput sequencing technique. However, during the reads mapping and data pre-analysis steps, only 40% of the sequenced reads can be mapped to the reference genome hg19 and there are some extremely high peaks (EHPs) in the profiles of mapped reads on the reference genome. Mathematical models were developed to analyze the unmapped reads and nearly 25.3% of the unmapped reads were found due to genome variants, base-calling errors and gaps of the reference genome. We also investigated EHPs and proposed methods to deal with the EHPs for the downstream data analysis.
    Biomedical Engineering and Informatics (BMEI), 2012 5th International Conference on; 01/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Single nucleotide polymorphisms (SNPs) have been associated with many aspects of human development and disease, and many non-coding SNPs associated with disease risk are presumed to affect gene regulation. We have previously shown that SNPs within transcription factor binding sites can affect transcription factor binding in an allele-specific and heritable manner. However, such analysis has relied on prior whole-genome genotypes provided by large external projects such as HapMap and the 1000 Genomes Project. This requirement limits the study of allele-specific effects of SNPs in primary patient samples from diseases of interest, where complete genotypes are not readily available. In this study, we show that we are able to identify SNPs de novo and accurately from ChIP-seq data generated in the ENCODE Project. Our de novo identified SNPs from ChIP-seq data are highly concordant with published genotypes. Independent experimental verification of more than 100 sites estimates our false discovery rate at less than 5%. Analysis of transcription factor binding at de novo identified SNPs revealed widespread heritable allele-specific binding, confirming previous observations. SNPs identified from ChIP-seq datasets were significantly enriched for disease-associated variants, and we identified dozens of allele-specific binding events in non-coding regions that could distinguish between disease and normal haplotypes. Our approach combines SNP discovery, genotyping and allele-specific analysis, but is selectively focused on functional regulatory elements occupied by transcription factors or epigenetic marks, and will therefore be valuable for identifying the functional regulatory consequences of non-coding SNPs in primary disease samples.
    BMC Genetics 09/2012; 13:46. DOI:10.1186/1471-2156-13-46 · 2.36 Impact Factor
Show more

Preview (2 Sources)

Available from