Why Transcription Factor Binding Sites Are Ten Nucleotides Long

University of Pennsylvania.
Genetics (Impact Factor: 5.96). 08/2012; 192(3). DOI: 10.1534/genetics.112.143370
Source: PubMed


Gene expression is controlled primarily by transcription factors, whose DNA binding sites are typically 10 nucleotides long. We develop a population-genetic model to understand how the length and information content of such binding sites evolve. Our analysis is based on an inherent tradeoff between specificity, which is greater in long binding sites, and robustness to mutation, which is greater in short binding sites. The evolutionary stable distribution of binding site lengths predicted by the model agrees with the empirical distribution (5 nt to 31 nt, with mean 9.9 nt for eukaryotes), and it is remarkably robust to variation in the underlying parameters of population size, mutation rate, number of transcription factor targets, and strength of selection for proper binding and selection against improper binding. In a systematic dataset of eukaryotic and prokaryotic transcription factors we also uncover strong relationships between the length of a binding site and its information content per nucleotide, as well as between the number of targets a transcription factor regulates and the information content in its binding sites. Our analysis explains these features as well as the remarkable conservation of binding site characteristics across diverse taxa.

Download full-text


Available from: Alexander J. Stewart, Jan 02, 2014
  • Source
    • "Eukaryotic transcription factor binding sites are the result of a trade-off between the specificity offered by longer stretches of DNA and the robustness to mutation offered by shorter sequences and vary in length between 5 and >30nt, with an average length of 10nt (Stewart et al. 2012). It has been estimated that eukaryotic promoters may contain 10-50 binding sites for 5-15 different transcription factors (Wray et al. 2003). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Snake venom has been hypothesised to have originated and diversified via a process that involves duplication of genes encoding body proteins with subsequent recruitment of the copy to the venom gland, where natural selection acts to develop or increase toxicity. However, gene duplication is known to be a rare event in vertebrate genomes and the recruitment of duplicated genes to a novel expression domain (neofunctionalisation) is an even rarer process that requires the evolution of novel combinations of transcription factor binding sites in upstream regulatory regions. Therefore, whilst this hypothesis concerning the evolution of snake venom is therefore very unlikely and should be regarded with caution, it is nonetheless often assumed to be established fact, hindering research into the true origins of snake venom toxins. To critically evaluate this hypothesis we have generated transcriptomic data for body tissues and salivary and venom glands from five species of venomous and non-venomous reptiles. Our comparative transcriptomic analysis of these data reveals that snake venom does not evolve via the hypothesised process of duplication and recruitment of genes encoding body proteins. Indeed, our results show that many proposed venom toxins are in fact expressed in a wide variety of body tissues, including the salivary gland of non-venomous reptiles and that these genes have therefore been restricted to the venom gland following duplication, not recruited. Thus snake venom evolves via the duplication and subfunctionalisation of genes encoding existing salivary proteins. These results highlight the danger of the elegant and intuitive "just-so story" in evolutionary biology.
    Genome Biology and Evolution 07/2014; 6(8). DOI:10.1093/gbe/evu166 · 4.23 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: One of the large, unsolved problems in human genetics is the proportion of functional sequences in genomes. Recently, the encyclopedia of DNA elements consortium revealed that the majority of the genome is biochemically active, which were described as biochemical functions. This has been used as evidence to pronounce the death of the junk DNA concept. In evolutionary biology, junk DNAs are sequences whose gain or loss does not seriously affect fitness of the host organism. In the human genome, a large amount of biochemical activity should be to repress the sequences so as to avoid their harmful expression. The biochemical activity is very different from functionality in the light of evolution. The single nucleotide polymorphism sites associated with disease and other phenotypes may be functional, but their abundance in the active genome regions is not reliable evidence of functionality. Because of sequence-independent functions, the proportion of functional regions would be underestimated when sequence constraints are used alone. Knockout may be the most effective means of distinguishing functional sequences from junk DNA.
    Biochemical and Biophysical Research Communications 12/2012; 430(4). DOI:10.1016/j.bbrc.2012.12.074 · 2.30 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Although most verified functional elements in non-coding DNA contain a highly conserved core region, this concept is not generally incorporated into de novo motif inference systems. In this work, we explore the utility of adding the notion of conserved core regions into a comparative genomics approach for the search for putative functional elements in noncoding DNA. By modifying the scoring function for GAMI, Genetic Algorithms for Motif Inference, we investigate tradeoffs between the strength of conservation of the full motif vs. the strength of conservation of a core region. This work illustrates that incorporating information about the structure of transcription factor binding sites can be helpful in identifying biologically functional elements.
    Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), 2013 IEEE Symposium on; 01/2013
Show more