Information, probability, and the abundance of the simplest RNA active sites

Department of Computer Science, University of Colorado at Boulder, 430 UCB, Boulder, CO 80309-0430, USA.
Frontiers in Bioscience (Impact Factor: 3.52). 02/2008; 13(16):6060-71. DOI: 10.2741/3137
Source: PubMed


The abundance of simple but functional RNA sites in random-sequence pools is critical for understanding emergence of RNA functions in nature and in the laboratory today. The complexity of a site is typically measured in terms of information, i.e. the Shannon entropy of the positions in a multiple sequence alignment. However, this calculation can be incorrect by many orders of magnitude. Here we compare several methods for estimating the abundance of RNA active-site patterns in the context of in vitro selection (SELEX), highlighting the strengths and weaknesses of each. We include in these methods a new approach that yields confidence bounds for the exact probability of finding specific kinds of RNA active sites. We show that all of the methods that take modularity into account provide far more accurate estimates of this probability than the informational methods, and that fast approximate methods are suitable for a wide range of RNA motifs.

15 Reads
  • Source
    • "The modeled route to this goal (Fig. 1) combines several straightforward ideas. Firstly, I have previously emphasized the value of molecular simplicity in making a structure accessible to primitive synthesis (Illangasekare and Yarus, 1999; Kennedy et al., 2008; Yarus, 2011b). Aminoacyl-RNA synthesis via a 5 nt ribozyme is a characterized experimental example (Yarus, 2011b); such a molecule would occur via untemplated RNA synthesis as soon as mildly activated nucleotides existed. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Abstract A testable, explicit origin for Darwinian behavior, feasible on a chaotic early Earth, would aid origins discussion. Here I show that a pool receiving unreliable supplies of unstable ribonucleotide precursors can recurrently fill this role. By using numerical integration, the differential equations governing a sporadically fed pool are solved, yielding quantitative constraints for the proliferation of molecules that also have a chemical phenotype. For example, templated triphosphate nucleotide joining is >10(4) too slow, suggesting that a group more reactive than pyrophosphate activated primordial nucleotides. However, measured literature rates are sufficient if the Initial Darwinian Ancestor (IDA) resembles a 5'-5' cofactor-like dinucleotide RNA, synthesized via activation with a phosphorimidazolide-like group. A sporadically fed pool offers unforeseen advantages; for example, the pool hosts a novel replicator which is predominantly unpaired, even though it replicates. Such free template is optimized for effective selection during its replication. Pool nucleotides are also subject to a broadly based selection that impels the population toward replication, effective selection, and Darwinian behavior. Such a primordial pool may have left detectable modern traces. A sporadically fed ribonucleotide pool also fits a recognizable early Earth environment, has recognizable modern descendants, and suits the early shape of the phylogenetic tree of Earthly life. Finally, analysis points to particular data now needed to refine the hypothesis. Accordingly, a kinetically explicit chemical hypothesis for a terran IDA can be justified, and informative experiments seem readily accessible. Key Words: Cofactor-RNA-Origin of life-Replication-Initial Darwinian Ancestor (IDA). Astrobiology 12, 870-883.
    Astrobiology 09/2012; 12(9):870-83. DOI:10.1089/ast.2012.0860 · 2.59 Impact Factor
  • Source
    • "Finally, the second loop in the Trp site has several specific implications. First, previous estimates of the probability of finding particular types of RNA sites (Knight and Yarus 2003; Knight et al. 2005; Kennedy et al. 2008) may be inflated by failing to take into account undetectable, but nonetheless important parts of the active site, such as those revealed here. Second, embedding the site in a random-sequence background provides an effective means for detecting such FIGURE 7. (A) A newly defined tryptophan binding site derived from selection, construction, mutagenesis, and massed sequence analysis of selected pools. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Conservation is often used to define essential sequences within RNA sites. However, conservation finds only invariant sequence elements that are necessary for function, rather than finding a set of sequence elements sufficient for function. Biochemical studies in several systems-including the hammerhead ribozyme and the purine riboswitch-find additional elements, such as loop-loop interactions, required for function yet not phylogenetically conserved. Here we define a critical test of sufficiency: We embed a minimal, apparently sufficient motif for binding the amino acid tryptophan in a random-sequence background and ask whether we obtain functional molecules. After a negative result, we use a combination of three-dimensional structural modeling, selection, designed mutations, high-throughput sequencing, and bioinformatics to explore functional insufficiency. This reveals an essential unpaired G in a diverse structural context, varied sequence, and flexible distance from the invariant internal loop binding site identified previously. Addition of the new element yields a sufficient binding site by the insertion criterion, binding tryptophan in 22 out of 23 tries. Random insertion testing for site sufficiency seems likely to be broadly revealing.
    RNA 10/2010; 16(10):1915-24. DOI:10.1261/rna.2220210 · 4.94 Impact Factor
  • Source
    • "The above motivation is far from artificial! For instance, extensive research is being performed to understand the evolution of complex but short RNA sequences from simpler but functional RNA sequences [14] [13]. In contexts like this, the pitfall of the Normal approximation of T n is the slow rate of convergence of order n −1/2 . "
    [Show abstract] [Hide abstract]
    ABSTRACT: We apply Doeblin's ergodicity coefficient as a computational tool to approximate the occupancy distribution of a set of states in a homogeneous but possibly non-stationary finite Markov chain. Our approximation is based on new properties satisfied by this coefficient, which allow us to approximate a chain of duration n by independent and short-lived realizations of an auxiliary homogeneous Markov chain of duration of order ln(n). Our approximation may be particularly useful when exact calculations via first-step methods or transfer matrices are impractical, and asymptotic approximations may not be yet reliable. Our findings may find applications to pattern problems in Markovian and non-Markovian sequences that are treatable via embedding techniques. Comment: 12 pages, 2 tables
Show more


15 Reads
Available from