Article

A reexamination of information theory-based methods for DNA-binding site identification

Department of Biological Sciences, University of Maryland-Baltimore County, Baltimore, MD, USA.
BMC Bioinformatics (Impact Factor: 2.67). 03/2009; 10:57. DOI: 10.1186/1471-2105-10-57
Source: PubMed

ABSTRACT Searching for transcription factor binding sites in genome sequences is still an open problem in bioinformatics. Despite substantial progress, search methods based on information theory remain a standard in the field, even though the full validity of their underlying assumptions has only been tested in artificial settings. Here we use newly available data on transcription factors from different bacterial genomes to make a more thorough assessment of information theory-based search methods.
Our results reveal that conventional benchmarking against artificial sequence data leads frequently to overestimation of search efficiency. In addition, we find that sequence information by itself is often inadequate and therefore must be complemented by other cues, such as curvature, in real genomes. Furthermore, results on skewed genomes show that methods integrating skew information, such as Relative Entropy, are not effective because their assumptions may not hold in real genomes. The evidence suggests that binding sites tend to evolve towards genomic skew, rather than against it, and to maintain their information content through increased conservation. Based on these results, we identify several misconceptions on information theory as applied to binding sites, such as negative entropy, and we propose a revised paradigm to explain the observed results.
We conclude that, among information theory-based methods, the most unassuming search methods perform, on average, better than any other alternatives, since heuristic corrections to these methods are prone to fail when working on real data. A reexamination of information content in binding sites reveals that information content is a compound measure of search and binding affinity requirements, a fact that has important repercussions for our understanding of binding site evolution.

1 Follower
 · 
174 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Regulatory sequence detection is a fundamental challenge in computational biology. The transcription process in protein synthesis starts with the binding of the transcription factor (TF) to its binding site. These binding sites are short DNA segments that are called motifs. Different sites can bind to the same factor. This variability in binding sequences besides their low information content and low specificity increases the difficulty of their detection using computational algorithms. This paper proposes a novel algorithm for transcription factor binding sites (TFBSs) detection in the entire genomic structure and allow discovery of new motif sequences. This is achieved by using distance metrics based on a position frequency matrix (PFM) concept that quantify the similitude between the set of conserved sequences belonging to a particular TF and the entire DNA sequence under study. Hence, the PFM in this context can be thought of as a consensus sequence as it provides a representative measure of the said set of binding sites belonging to a particular TF. The algorithm then quantifies the correlation between the PFM and each binding site belonging to a given TF. Same scenario is then applied to the genome sequence under study. The obtained distance metrics are then utilized to discover new potential TFBSs based on their similitude of the set of binding sites investigated. Analysis is applied to Escherichia coli (E. coli) bacterial genomes. Simulation results verify the cor-rectness and the biological relevance of the proposed algorithm. 1 Introduction In bioinformatics, one can distinguish between two separate problems regarding DNA binding sites: searching for additional members of a known DNA binding motif (the site search problem) and discovering novel DNA binding motifs in collections of functionally related sequences (the sequence motif discovery problem) [1, 2]. Many different methods have been proposed to search for binding sites. Most of them rely on the principles of information theory and have available web servers [3, 4], while other authors have resorted to machine learning methods, such as artificial neural
    The IWBBIO 2014 (2nd International Work-Conference on Bioinformatics and Biomedical Engineering), Granada, Spain; 04/2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Cappadocia, Turkey 37 (= 1 + = 1 + = −1 − = −1 + ABSTRACT Regulatory sequence detection is a fundamental challenge in computational biology. The transcription process in protein synthesis starts with the binding of the transcription factor to its binding site. Different sites can bind to the same factor. This variability in binding sequences increases the difficulty of their detection using computational algorithms. This paper proposes a novel algorithm for transcription factor binding site (TFBS) detection. The algorithm applies a polyphase mapping scheme to represent the four nucleobases in both the DNA sequence and the set of binding sites associated with a given transcription factor (TF). The center of mass (CoM) of each set of binding sites, which can be thought of as a consensus sequence, is then calculated. The algorithm then calculates distances between the CoM and each binding site belonging to a given TF. Same scenario is then applied to the genome sequence under study. The obtained distances are then utilized to detect new potential TFBSs based on their similitude of the set of binding sites that we already know. Analysis is applied to E. coli bacterial genomes. Simulation results verify the correctness and the biological relevance of the proposed algorithm.
    HIBIT 2012 - 2012 7th International Symposium on Health Informatics and Bioinformatics (HIBIT), Cappadocia, Nevsehir, TURKEY; 04/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Epigenetic marks such as cytosine methylation are important determinants of cellular and whole-body phenotypes. However, the extent of, and reasons for inter-individual differences in cytosine methylation, and their association with phenotypic variation are poorly characterised. Here we present the first genome-wide study of cytosine methylation at single-nucleotide resolution in an animal model of human disease. We used whole-genome bisulfite sequencing in the spontaneously hypertensive rat (SHR), a model of cardiovascular disease, and the Brown Norway (BN) control strain, to define the genetic architecture of cytosine methylation in the mammalian heart and to test for association between methylation and pathophysiological phenotypes. Analysis of 10.6 million CpG dinucleotides identified 77,088 CpGs that were differentially methylated between the strains. In F1 hybrids we found 38,152 CpGs showing allele-specific methylation and 145 regions with parent-of-origin effects on methylation. Cis-linkage explained almost 60% of inter-strain variation in methylation at a subset of loci tested for linkage in a panel of recombinant inbred (RI) strains. Methylation analysis in isolated cardiomyocytes showed that in the majority of cases methylation differences in cardiomyocytes and non-cardiomyocytes were strain-dependent, confirming a strong genetic component for cytosine methylation. We observed preferential nucleotide usage associated with increased and decreased methylation that is remarkably conserved across species, suggesting a common mechanism for germline control of inter-individual variation in CpG methylation. In the RI strain panel, we found significant correlation of CpG methylation and levels of serum chromogranin B (CgB), a proposed biomarker of heart failure, which is evidence for a link between germline DNA sequence variation, CpG methylation differences and pathophysiological phenotypes in the SHR strain. Together, these results will stimulate further investigation of the molecular basis of locally regulated variation in CpG methylation and provide a starting point for understanding the relationship between the genetic control of CpG methylation and disease phenotypes.
    PLoS Genetics 12/2014; 10(12):e1004813. DOI:10.1371/journal.pgen.1004813 · 8.17 Impact Factor

Full-text (3 Sources)

Download
42 Downloads
Available from
Jun 5, 2014