A reexamination of information theory-based methods for DNA-binding site identification

Article (PDF Available)inBMC Bioinformatics 10(1):57 · March 2009with21 Reads
DOI: 10.1186/1471-2105-10-57 · Source: PubMed
Abstract
Searching for transcription factor binding sites in genome sequences is still an open problem in bioinformatics. Despite substantial progress, search methods based on information theory remain a standard in the field, even though the full validity of their underlying assumptions has only been tested in artificial settings. Here we use newly available data on transcription factors from different bacterial genomes to make a more thorough assessment of information theory-based search methods. Our results reveal that conventional benchmarking against artificial sequence data leads frequently to overestimation of search efficiency. In addition, we find that sequence information by itself is often inadequate and therefore must be complemented by other cues, such as curvature, in real genomes. Furthermore, results on skewed genomes show that methods integrating skew information, such as Relative Entropy, are not effective because their assumptions may not hold in real genomes. The evidence suggests that binding sites tend to evolve towards genomic skew, rather than against it, and to maintain their information content through increased conservation. Based on these results, we identify several misconceptions on information theory as applied to binding sites, such as negative entropy, and we propose a revised paradigm to explain the observed results. We conclude that, among information theory-based methods, the most unassuming search methods perform, on average, better than any other alternatives, since heuristic corrections to these methods are prone to fail when working on real data. A reexamination of information content in binding sites reveals that information content is a compound measure of search and binding affinity requirements, a fact that has important repercussions for our understanding of binding site evolution.
    • "We again assumed that the probability of any amino acid that does not exist in the window is zero. The RE was used in previous studies to identify the conserved position [37, 38]. "
    Full-text · Dataset · Mar 2016 · PLoS Genetics
    • "We again assumed that the probability of any amino acid that does not exist in the window is zero. The RE was used in previous studies to identify the conserved position [37, 38]. "
    [Show abstract] [Hide abstract] ABSTRACT: Protein phosphorylation is one of the most widespread regulatory mechanisms in eukaryotes. Over the past decade, phosphorylation site prediction has emerged as an important problem in the field of bioinformatics. Here, we report a new method, termed Random Forest-based Phosphosite (RF-Phos 2.0) predictor, to predict phosphorylation sites given only the primary amino acid sequence of a protein as input. RF-Phos 2.0, which uses random forest with sequence and structural features, is able to identify putative sites of phosphorylation across many protein families. In side-by-side comparisons based on 10-fold cross validation and an independent dataset, RF-Phos 2.0 compares favorably to other popular phosphosite prediction methods, such as PhosphoSVM, GPS2.1 and Musite. RF-Phos 2.0 is freely available at http://bcb.ncat.edu/RF_Phos/.
    Full-text · Article · Jan 2016
    • "of SNP allele frequencies at differentially methylated CpG dinucleotides SNP allele bias for alleles at differentially methylated CpGs was examined by calculating the information content [49] for the observed SNP allele frequencies at the five base pairs up-and downstream of CpGs showing increased or decreased methylation in the SHR and BN strains. Calculations and visualisation were carried out with the Bioconductor seqLogo package implemented in R. The seqLogo code was modified to take into account nucleotide usage in the rat genome (29% A, 29% T, 21% C, 21% G). "
    [Show abstract] [Hide abstract] ABSTRACT: Epigenetic marks such as cytosine methylation are important determinants of cellular and whole-body phenotypes. However, the extent of, and reasons for inter-individual differences in cytosine methylation, and their association with phenotypic variation are poorly characterised. Here we present the first genome-wide study of cytosine methylation at single-nucleotide resolution in an animal model of human disease. We used whole-genome bisulfite sequencing in the spontaneously hypertensive rat (SHR), a model of cardiovascular disease, and the Brown Norway (BN) control strain, to define the genetic architecture of cytosine methylation in the mammalian heart and to test for association between methylation and pathophysiological phenotypes. Analysis of 10.6 million CpG dinucleotides identified 77,088 CpGs that were differentially methylated between the strains. In F1 hybrids we found 38,152 CpGs showing allele-specific methylation and 145 regions with parent-of-origin effects on methylation. Cis-linkage explained almost 60% of inter-strain variation in methylation at a subset of loci tested for linkage in a panel of recombinant inbred (RI) strains. Methylation analysis in isolated cardiomyocytes showed that in the majority of cases methylation differences in cardiomyocytes and non-cardiomyocytes were strain-dependent, confirming a strong genetic component for cytosine methylation. We observed preferential nucleotide usage associated with increased and decreased methylation that is remarkably conserved across species, suggesting a common mechanism for germline control of inter-individual variation in CpG methylation. In the RI strain panel, we found significant correlation of CpG methylation and levels of serum chromogranin B (CgB), a proposed biomarker of heart failure, which is evidence for a link between germline DNA sequence variation, CpG methylation differences and pathophysiological phenotypes in the SHR strain. Together, these results will stimulate further investigation of the molecular basis of locally regulated variation in CpG methylation and provide a starting point for understanding the relationship between the genetic control of CpG methylation and disease phenotypes.
    Full-text · Article · Dec 2014
Show more

Supplementary resources