A reexamination of information theory-based methods for DNA-binding site identification

Department of Biological Sciences, University of Maryland-Baltimore County, Baltimore, MD, USA.
BMC Bioinformatics (Impact Factor: 2.58). 03/2009; 10(1):57. DOI: 10.1186/1471-2105-10-57
Source: PubMed


Searching for transcription factor binding sites in genome sequences is still an open problem in bioinformatics. Despite substantial progress, search methods based on information theory remain a standard in the field, even though the full validity of their underlying assumptions has only been tested in artificial settings. Here we use newly available data on transcription factors from different bacterial genomes to make a more thorough assessment of information theory-based search methods.
Our results reveal that conventional benchmarking against artificial sequence data leads frequently to overestimation of search efficiency. In addition, we find that sequence information by itself is often inadequate and therefore must be complemented by other cues, such as curvature, in real genomes. Furthermore, results on skewed genomes show that methods integrating skew information, such as Relative Entropy, are not effective because their assumptions may not hold in real genomes. The evidence suggests that binding sites tend to evolve towards genomic skew, rather than against it, and to maintain their information content through increased conservation. Based on these results, we identify several misconceptions on information theory as applied to binding sites, such as negative entropy, and we propose a revised paradigm to explain the observed results.
We conclude that, among information theory-based methods, the most unassuming search methods perform, on average, better than any other alternatives, since heuristic corrections to these methods are prone to fail when working on real data. A reexamination of information content in binding sites reveals that information content is a compound measure of search and binding affinity requirements, a fact that has important repercussions for our understanding of binding site evolution.

Download full-text


Available from: Ivan Erill, Oct 04, 2015
14 Reads
  • Source
    • "A multispecies collection of experimentally validated Gram-positive LexA-binding sites and a reference set of LexA-regulated clusters of orthologous genes (COGs) across the bacteria domain were derived from the published literature (Supplementary Tables S1 and S2). Putative LexA-binding sites were located by scoring all metagenome sequences on both strands using the Ri index as implemented in FITOM (Erill and O’Neill, 2009; Schneider, 1997). Only putative sites with scores >1.5 standard deviations below the mean of the original collection (>8.36 bits) were considered for further analysis. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Data from metagenomics projects remains largely untapped for the analysis of transcriptional regulatory networks. Here we provide proof-of-concept that metagenomic data can be effectively leveraged to analyze regulatory networks by characterizing the SOS meta-regulon in the human gut microbiome. We combine well-established in silico and in vitro techniques to mine the human gut microbiome data and determine the relative composition of the SOS network in a natural setting. Our analysis highlights the importance of translesion synthesis as a primary function of the SOS response. We predict the association of this network with three novel protein clusters involved in cell wall biogenesis, chromosome partitioning and restriction modification, and we confirm binding of the SOS response transcriptional repressor to sites in the promoter of a cell wall biogenesis enzyme, a phage integrase and a death-on-curing protein. We discuss the implications of these findings and the potential for this approach for metagenome analysis. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
    Bioinformatics 01/2014; 30(9). DOI:10.1093/bioinformatics/btt753 · 4.98 Impact Factor
  • Source
    • "In bioinformatics, one can distinguish between two separate problems regarding DNA binding sites: searching for additional members of a known DNA binding motif (the site search problem) and discovering novel DNA binding motifs in collections of functionally related sequences (the sequence motif discovery problem) [1]. Many different methods have been proposed to search for binding sites. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Cappadocia, Turkey 37 (= 1 + = 1 + = −1 − = −1 + ABSTRACT Regulatory sequence detection is a fundamental challenge in computational biology. The transcription process in protein synthesis starts with the binding of the transcription factor to its binding site. Different sites can bind to the same factor. This variability in binding sequences increases the difficulty of their detection using computational algorithms. This paper proposes a novel algorithm for transcription factor binding site (TFBS) detection. The algorithm applies a polyphase mapping scheme to represent the four nucleobases in both the DNA sequence and the set of binding sites associated with a given transcription factor (TF). The center of mass (CoM) of each set of binding sites, which can be thought of as a consensus sequence, is then calculated. The algorithm then calculates distances between the CoM and each binding site belonging to a given TF. Same scenario is then applied to the genome sequence under study. The obtained distances are then utilized to detect new potential TFBSs based on their similitude of the set of binding sites that we already know. Analysis is applied to E. coli bacterial genomes. Simulation results verify the correctness and the biological relevance of the proposed algorithm.
    HIBIT 2012 - 2012 7th International Symposium on Health Informatics and Bioinformatics (HIBIT), Cappadocia, Nevsehir, TURKEY; 04/2012
  • Source
    • "In contrast, transcriptional regulatory network analyses use a known binding motif to search multiple genomes and elucidate the composition of the regulatory network at different levels [42,57,58]. In this approach, the identification of a candidate binding site upstream of an ortholog in different species bolsters the a priori low confidence of the in silico prediction made on each individual genome [59,60]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The SOS response is a well-known regulatory network present in most bacteria and aimed at addressing DNA damage. It has also been linked extensively to stress-induced mutagenesis, virulence and the emergence and dissemination of antibiotic resistance determinants. Recently, the SOS response has been shown to regulate the activity of integrases in the chromosomal superintegrons of the Vibrionaceae, which encompasses a wide range of pathogenic species harboring multiple chromosomes. Here we combine in silico and in vitro techniques to perform a comparative genomics analysis of the SOS regulon in the Vibrionaceae, and we extend the methodology to map this transcriptional network in other bacterial species harboring multiple chromosomes. Our analysis provides the first comprehensive description of the SOS response in a family (Vibrionaceae) that includes major human pathogens. It also identifies several previously unreported members of the SOS transcriptional network, including two proteins of unknown function. The analysis of the SOS response in other bacterial species with multiple chromosomes uncovers additional regulon members and reveals that there is a conserved core of SOS genes, and that specialized additions to this basic network take place in different phylogenetic groups. Our results also indicate that across all groups the main elements of the SOS response are always found in the large chromosome, whereas specialized additions are found in the smaller chromosomes and plasmids. Our findings confirm that the SOS response of the Vibrionaceae is strongly linked with pathogenicity and dissemination of antibiotic resistance, and suggest that the characterization of the newly identified members of this regulon could provide key insights into the pathogenesis of Vibrio. The persistent location of key SOS genes in the large chromosome across several bacterial groups confirms that the SOS response plays an essential role in these organisms and sheds light into the mechanisms of evolution of global transcriptional networks involved in adaptability and rapid response to environmental changes, suggesting that small chromosomes may act as evolutionary test beds for the rewiring of transcriptional networks.
    BMC Genomics 02/2012; 13(1):58. DOI:10.1186/1471-2164-13-58 · 3.99 Impact Factor
Show more