Uniform, optimal signal processing of mapped deep-sequencing data

Computational and Systems Biology, Genome Institute of Singapore, Singapore.
Nature Biotechnology (Impact Factor: 41.51). 06/2013; 31(7). DOI: 10.1038/nbt.2596
Source: PubMed


Despite their apparent diversity, many problems in the analysis of high-throughput sequencing data are merely special cases of two general problems, signal detection and signal estimation. Here we adapt formally optimal solutions from signal processing theory to analyze signals of DNA sequence reads mapped to a genome. We describe DFilter, a detection algorithm that identifies regulatory features in ChIP-seq, DNase-seq and FAIRE-seq data more accurately than assay-specific algorithms. We also describe EFilter, an estimation algorithm that accurately predicts mRNA levels from as few as 1-2 histone profiles (R ∼0.9). Notably, the presence of regulatory motifs in promoters correlates more with histone modifications than with mRNA levels, suggesting that histone profiles are more predictive of cis-regulatory mechanisms. We show by applying DFilter and EFilter to embryonic forebrain ChIP-seq data that regulatory protein identification and functional annotation are feasible despite tissue heterogeneity. The mathematical formalism underlying our tools facilitates integrative analysis of data from virtually any sequencing-based functional profile.

Download full-text


Available from: Petra Kraus
  • Source
    • "Reads that map uniquely to the genome, with MAPQ quality score above 20, were used for the analysis. FAIRE-seq and ChIP-seq peaks were called with two algorithms, MACS 1.4 [5] and DFilter 1.0 [6], against mixed input controls corresponding to each group. MACS was run with default parameters, except for p = 10 − 7 for ChIP-seq data. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Prostate cancer (PCa) is the second most common cancer in men. The Androgen Receptor (AR) is the major driver of PCa and the main target of therapy in the advanced setting. AR is a nuclear receptor that binds the chromatin and regulates transcription of genes involved in cancer cell proliferation and survival. In a study by Stelloo et al. (1) we explored prostate cancer on the level of transcriptional regulation by means of Formaldehyde-Assisted Isolation of Regulatory Elements and Chromatin Immunoprecipitation coupled with massive parallel sequencing (FAIRE-seq and ChIP-seq, respectively). We employed these data for the assessment of differences in transcriptional regulation at distinct stages of PCa progression and to construct a prognostic gene expression classifier. Genomics data includes FAIRE-seq data from normal prostate tissue as well as primary, hormone therapy resistant and metastatic PCa. Furthermore, ChIP-seq data from primary and resistant PCa were generated, along with multiple input controls. The data are publicly available through NCBI GEO database with accession number GSE65478. Here we describe the genomics and clinical data in detail and provide comparative analysis of FAIRE-seq and ChIP-seq data.
    Full-text · Article · Mar 2016 · Genomics Data
  • Source
    • "are specifically designed for peak finding , the DFilter method performed equally well on other HTS technology data such as DNase - seq and FAIRE - seq to detect NFRs ( Kumar et al . , 2013 ) . This suggest that methods based on the concept of read profiles can be both robust as well as general for the analysis of a wide range of HTS data . Indeed another recent study showed high performance in predicting CRE ( enhancers ) using read profiles generated from CAGE data across a wide range of human tissues and cell types ( An"
    [Show abstract] [Hide abstract]
    ABSTRACT: Functional annotation of the genome is important to understand the phenotypic complexity of various species. The road toward functional annotation involves several challenges ranging from experiments on individual molecules to large-scale analysis of high-throughput sequencing (HTS) data. HTS data is typically a result of the protocol designed to address specific research questions. The sequencing results in reads, which when mapped to a reference genome often leads to the formation of distinct patterns (read profiles). Interpretation of these read profiles is essential for their analysis in relation to the research question addressed. Several strategies have been employed at varying levels of abstraction ranging from a somewhat ad hoc to a more systematic analysis of read profiles. These include methods which can compare read profiles, e.g., from direct (non-sequence based) alignments to classification of patterns into functional groups. In this review, we highlight the emerging applications of read profiles for the annotation of non-coding RNA and cis-regulatory elements (CREs) such as enhancers and promoters. We also discuss the biological rationale behind their formation.
    Full-text · Article · May 2015 · Frontiers in Genetics
  • Source
    • "Among them, SICER and RSEG are comparable for experiments with controls [8]. SICER is one of the best programs showing high accuracy in detecting broad binding regions from H3K36me3 [9]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Chromatin immunoprecipitation (ChIP) followed by next-generation sequencing (ChIP-Seq) has been widely used to identify genomic loci of transcription factor (TF) binding and histone modifications. ChIP-Seq data analysis involves multiple steps from read mapping and peak calling to data integration and interpretation. It remains challenging and time-consuming to process large amounts of ChIP-Seq data derived from different antibodies or experimental designs using the same approach. To address this challenge, there is a need for a comprehensive analysis pipeline with flexible settings to accelerate the utilization of this powerful technology in epigenetics research. We have developed a highly integrative pipeline, termed HiChIP for systematic analysis of ChIP-Seq data. HiChIP incorporates several open source software packages selected based on internal assessments and published comparisons. It also includes a set of tools developed in-house. This workflow enables the analysis of both paired-end and single-end ChIP-Seq reads, with or without replicates for the characterization and annotation of both punctate and diffuse binding sites. The main functionality of HiChIP includes: (a) read quality checking; (b) read mapping and filtering; (c) peak calling and peak consistency analysis; and (d) result visualization. In addition, this pipeline contains modules for generating binding profiles over selected genomic features, de novo motif finding from transcription factor (TF) binding sites and functional annotation of peak associated genes. HiChIP is a comprehensive analysis pipeline that can be configured to analyze ChIP-Seq data derived from varying antibodies and experiment designs. Using public ChIP-Seq data we demonstrate that HiChIP is a fast and reliable pipeline for processing large amounts of ChIP-Seq data.
    Full-text · Article · Aug 2014 · BMC Bioinformatics
Show more