Uniform, optimal signal processing of mapped deep-sequencing data

Computational and Systems Biology, Genome Institute of Singapore, Singapore.
Nature Biotechnology (Impact Factor: 39.08). 06/2013; 31(7). DOI: 10.1038/nbt.2596
Source: PubMed

ABSTRACT Despite their apparent diversity, many problems in the analysis of high-throughput sequencing data are merely special cases of two general problems, signal detection and signal estimation. Here we adapt formally optimal solutions from signal processing theory to analyze signals of DNA sequence reads mapped to a genome. We describe DFilter, a detection algorithm that identifies regulatory features in ChIP-seq, DNase-seq and FAIRE-seq data more accurately than assay-specific algorithms. We also describe EFilter, an estimation algorithm that accurately predicts mRNA levels from as few as 1-2 histone profiles (R ∼0.9). Notably, the presence of regulatory motifs in promoters correlates more with histone modifications than with mRNA levels, suggesting that histone profiles are more predictive of cis-regulatory mechanisms. We show by applying DFilter and EFilter to embryonic forebrain ChIP-seq data that regulatory protein identification and functional annotation are feasible despite tissue heterogeneity. The mathematical formalism underlying our tools facilitates integrative analysis of data from virtually any sequencing-based functional profile.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Chromatin modifiers and histone modifications are components of a chromatin-signaling network involved in transcription and its regulation. The interactions between chromatin modifiers and histone modifications are often unknown, are based on the analysis of few genes or are studied in vitro. Here, we apply computational methods to recover interactions between chromatin modifiers and histone modifications from genome-wide ChIP-Seq data. These interactions provide a high-confidence backbone of the chromatin-signaling network. Many recovered interactions have literature support; others provide hypotheses about yet unknown interactions. We experimentally verified two of these predicted interactions, leading to a link between H4K20me1 and members of the Polycomb Repressive Complexes 1 and 2. Our results suggest that our computationally derived interactions are likely to lead to novel biological insights required to establish the connectivity of the chromatin-signaling network involved in transcription and its regulation. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
    Nucleic Acids Research 11/2014; 42(22). DOI:10.1093/nar/gku1234 · 8.81 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Next-generation sequencing (NGS) technologies have been used in diverse ways to investigate various aspects of chromatin biology by identifying genomic loci that are bound by transcription factors, occupied by nucleosomes or accessible to nuclease cleavage, or loci that physically interact with remote genomic loci. However, reaching sound biological conclusions from such NGS enrichment profiles requires many potential biases to be taken into account. In this Review, we discuss common ways in which biases may be introduced into NGS chromatin profiling data, approaches to diagnose these biases and analytical techniques to mitigate their effect.
    Nature Reviews Genetics 09/2014; 15(11). DOI:10.1038/nrg3788 · 39.79 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present MUSIC, a signal processing approach for identification of enriched regions in ChIP-Seq data, available at MUSIC first filters the ChIP-Seq read-depth signal for systematic noise from non-uniform mappability, which fragments enriched regions. Then it performs a multiscale decomposition, using median filtering, identifying enriched regions at multiple length scales. This is useful given the wide range of scales probed in ChIP-Seq assays. MUSIC performs favorably in terms of accuracy and reproducibility compared with other methods. In particular, analysis of RNA Polymerase II data reveals a clear distinction between the stalled and elongating forms of the polymerase.
    Genome Biology 10/2014; 15(10):474. DOI:10.1186/PREACCEPT-9116006401338101 · 10.47 Impact Factor


Available from
Jun 4, 2014