A strand specific high resolution normalization method for chip-sequencing data employing multiple experimental control measurements

The Linnaeus Centre for Bioinformatics, Department of Cell and Molecular Biology, Science for Life Laboratory, Biomedical Center, Uppsala University, Box 598, SE-75124 Uppsala, Sweden. .
Algorithms for Molecular Biology (Impact Factor: 1.46). 01/2012; 7(1):2. DOI: 10.1186/1748-7188-7-2
Source: PubMed


High-throughput sequencing is becoming the standard tool for investigating protein-DNA interactions or epigenetic modifications. However, the data generated will always contain noise due to e.g. repetitive regions or non-specific antibody interactions. The noise will appear in the form of a background distribution of reads that must be taken into account in the downstream analysis, for example when detecting enriched regions (peak-calling). Several reported peak-callers can take experimental measurements of background tag distribution into account when analysing a data set. Unfortunately, the background is only used to adjust peak calling and not as a pre-processing step that aims at discerning the signal from the background noise. A normalization procedure that extracts the signal of interest would be of universal use when investigating genomic patterns.
We formulated such a normalization method based on linear regression and made a proof-of-concept implementation in R and C++. It was tested on simulated as well as on publicly available ChIP-seq data on binding sites for two transcription factors, MAX and FOXA1 and two control samples, Input and IgG. We applied three different peak-callers to (i) raw (un-normalized) data using statistical background models and (ii) raw data with control samples as background and (iii) normalized data without additional control samples as background. The fraction of called regions containing the expected transcription factor binding motif was largest for the normalized data and evaluation with qPCR data for FOXA1 suggested higher sensitivity and specificity using normalized data over raw data with experimental background.
The proposed method can handle several control samples allowing for correction of multiple sources of bias simultaneously. Our evaluation on both synthetic and experimental data suggests that the method is successful in removing background noise.

Download full-text


Available from: Robin Andersson,
19 Reads
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Histone modifications are a key epigenetic mechanism to activate or repress the transcription of genes. Data sets of matched transcription data and histone modification data obtained by ChIP-seq exist, but methods for integrative analysis of both data types are still rare. Here, we present a novel bioinformatics approach to detect genes that show different transcript abundances between two conditions putatively caused by alterations in histone modification. We introduce a correlation measure for integrative analysis of ChIP-seq and gene transcription data measured by RNA-seq or microarrays and demonstrate that a proper normalisation of ChIP-seq data is crucial. We suggest applying Bayesian mixture models of different types of distributions to further study the distribution of the correlation measure. The implicit classification of the mixture models is used to detect genes with differences between two conditions in both gene transcription and histone modification. The method is applied to different data sets and its superiority to a naive separate analysis of both data types is demonstrated. R/Bioconductor package epigenomix CONTACT:
    Bioinformatics 01/2014; 30(8). DOI:10.1093/bioinformatics/btu003 · 4.98 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Fluctuations in nutrient availability profoundly impact gene expression. Previous work revealed postrecruitment regulation of RNA polymerase II (Pol II) during starvation and recovery in Caenorhabditis elegans, suggesting that promoter-proximal pausing promotes rapid response to feeding. To test this hypothesis, we measured Pol II elongation genome wide by two complementary approaches and analyzed elongation in conjunction with Pol II binding and expression. We confirmed bona fide pausing during starvation and also discovered Pol II docking. Pausing occurs at active stress-response genes that become downregulated in response to feeding. In contrast, "docked" Pol II accumulates without initiating upstream of inactive growth genes that become rapidly upregulated upon feeding. Beyond differences in function and expression, these two sets of genes have different core promoter motifs, suggesting alternative transcriptional machinery. Our work suggests that growth and stress genes are both regulated postrecruitment during starvation but at initiation and elongation, respectively, coordinating gene expression with nutrient availability.
    Cell Reports 01/2014; 6(3). DOI:10.1016/j.celrep.2014.01.008 · 8.36 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Chromatin immunoprecipitation followed by deep sequencing (ChIP-seq) experiments are widely used to determine, within entire genomes, the occupancy sites of any protein of interest including, for example, transcription factors, RNA polymerases, or histones with or without various modifications. In addition to allowing the determination of occupancy sites within one cell type and under one condition, the method allows, in principle, the establishment and comparison of occupancy maps in various cell types, tissues, and conditions. Such comparisons require, however, that samples be normalized. Widely used normalization methods that include a quantile normalization step perform well when factor occupancy varies at a subset of sites, but may miss uniform genome-wide increases or decreases in site occupancy. We describe a spike adjustment procedure (SAP) that, unlike commonly used normalization methods intervening at the analysis stage, entails an experimental step prior to immunoprecipitation. A constant, low amount from a single batch of chromatin of a foreign genome is added to the experimental chromatin. This 'spike' chromatin then serves as an internal control to which the experimental signals can be adjusted. We show that the method improves similarity between replicates and reveals biological differences, including global and largely uniform changes.
    Genome Research 04/2014; 24(7). DOI:10.1101/gr.168260.113 · 14.63 Impact Factor
Show more