Page 1

[Frontiers in Bioscience 13, 6060-6071, May 1, 2008]

6060

Information, probability, and the abundanceof the simplest RNA active sites

Ryan Kennedy1,2, Manuel E. Lladser2, Michael Yarus3, Rob Knight4

1Department of Computer Science, University of Colorado at Boulder, 430 UCB, Boulder, CO 80309-0430,

Applied Mathematics, University of Colorado at Boulder, 526 UCB, Boulder, CO 80309-0526,

Cellular and Developmental Biology, University of Colorado at Boulder, 347 UCB, Boulder, CO 80309-0430,4Department of

Chemistry and Biochemistry, University of Colorado at Boulder, 215 UCB, Boulder, CO 80309

2Department of

3Department of Molecular,

TABLE OF CONTENTS

1. Abstract

2. Introduction

3. Methods

4. Results

5. Discussion

6. Acknowledgments

7. References

1. ABSTRACT

The abundance of simple but functional RNA

sites in random-sequence pools is critical for understanding

emergence of RNA functions in nature and in the laboratory

today. The complexity of a site is typically measured in

terms of information, i.e. the Shannon entropy of the

positions in a multiple sequence alignment. However, this

calculation can be incorrect by many orders of magnitude.

Here we compare several methods for estimating the

abundance of RNA active-site patterns in the context of in

vitro selection (SELEX), highlighting the strengths and

weaknesses of each. We include in these methods a new

approach that yields confidence bounds for the exact

probability of finding specific kinds of RNA active sites.

We show that all of the methods that take modularity into

account provide far more accurate estimates of this

probability than the informational methods, and that fast

approximate methods are suitable for a wide range of RNA

motifs.

2. INTRODUCTION

Our understanding of the evolution of functions

in DNA, RNA and protein sequences rests critically on the

probability of finding sequences with specific functions by

chance in collections of random sequences. Of particular

importance is the probability of calculating the abundance

of RNA active sites in short sequences, as longer sequences

become increasingly improbable in primitive conditions.

Although we have very good models for understanding

evolution of a set of related, or homologous, sequences

from a common ancestor according to Markov models of

evolution (1), our understanding of the probability of

sequences arising independently in different groups of

organisms is far more limited. In order to fully understand

the evolutionary processes leading to a set of functional

sequences, whether produced by natural selection over

billions of years, or by artificial selection in a few weeks in

the laboratory, we must develop methods for assessing

whether it is more probable that a given collection of

Page 2

Abundance of the simplest RNA active sites

6061

sequences arose once (and it was passed on with

modifications through successive generations), or arose

many times. This problem is especially important in RNA,

which is a model for molecular evolution in the laboratory

and which may have preceded both DNA and protein in an

‘RNA World’ stage of evolution, in which RNA acted both

as catalyst and genetic material(2).

The last 25 years have brought a revolution in

RNA biology, with the recognition that RNA can play

important catalytic and regulatory roles in the cell rather

than just being a passive messenger. Of particular

importance is a laboratory technique called SELEX, or in

vitro selection, in which random-sequence RNA libraries

are synthesized and screened for various properties (3-5).

SELEX has produced RNAs that can perform many

functions relevant to the origin ofmodern metabolism: there

are dozens of examples, including amino acid binding

(6-12), nucleotide synthesis (13), and self-aminoacylation

(14, 15). Artificially selected RNAs can also bear striking

resemblance to natural systems. For example, an RNA

selected to bind a transition state analog of the peptidyl

transfer reaction contains a conserved 8-base sequence that

is identical to a conserved 8-base sequence in the ribosome

at the site that naturally performs this reaction in all cells

during translation (16, 17), and artificially selected RNAs

that bind amino acids recapture properties of the canonical

genetic code (18-20).

One striking feature of both naturally and

artificially selected RNAs is that they are highly modular

(21, 22). In other words, they tend to consist of short

conserved pieces of the active site (the ‘modules’) that are

separated by essentially random regions of sequence (the

‘spacer’). Modular RNAs consist of specific sequence

motifs in the context of specific secondary structure

elements. For example, the minimal tryptophan-binding site

consists of a CYA opposite a GAC in an internal loop

flanked by helices (12), i.e. two modules each consisting of

three conserved bases and flanked by several base pairs on

either side. However, essentially any sequence can occur

between the two halves of the helices. This modularity is

important because both natural and artificial selection

(SELEX) recover motifs of this form: modular motifs have

a combinatorial advantage over single-module motifs, so

should be isolated more often if they are stable. Another

important feature of RNA motifs is that they are held

together by base pairing, leading to correlations in the

sequence (e.g. in a base-paired region, if we have a C at one

location, we must have a G at the location that pairs with it).

It is important to calculate the probability of

finding a given RNA in a random-sequence background for

several reasons. First, we can estimate how likely a

particular sequence would be observed in a SELEX

experiment, and perhaps tune random-sequence pools to

maximize the probability of occurrence of interesting motifs

(23-25). Second, we have very good methods for estimating

the probability of obtaining a set of sequences given an

evolutionary model(26-29), but the probability of obtaining

the sequences through multiple origins (30) is not well

understood. For example, we know for certain that the

hammerhead ribozyme has evolved at least three times: at

least once in nature, and at least twice in artificial

selections in the Breaker and Szostak lab (31, 32).

Improving our estimates of the probabilities of modular

RNA sites can help us understand whether different RNAs

that contain the same motif most likely had a common

ancestor or evolved independently. Third, genomewide

searches for motifs, such as those performed by the Infernal

package (33) and used in the Rfam database (34) return

many matches, and we thus need to calculate the statistical

significance of a given motif to rule out the null hypothesis

that it evolved by chance. Fourth, understanding the regions

of nucleotide composition that make RNA functions most

likely may provide clues about which genomes are most

likely to evolve which functions, and about the chemical

conditions under which the RNA world might have

emerged (35). For example, we might be able to address

unsolved problems such as why some bacteria use RNA for

regulation where others use proteins. Perhaps genome

composition, which varies over a huge range, favors

formation of riboswitches in species that have the right

composition. Similar considerations might apply to the use

of the hairpin, hammerhead, and HDV self-cleaving motifs,

which, along with many other self-cleaving RNAs, perform

similar functions (32).

Severalmethods havebeenproposed for

calculating the probability of finding a correlated modular

motif (referred to as ‘the motif’ in the text below). We note

that these methods cover only the first step in calculating

the probability of obtaining an active molecule: the second

step is to calculate the probability of correct folding given

that the sequence elements required for a motif are present,

as in (25), and the final step is to calculate the probability

that the molecule is functional given that it contains the

sequence elements required for activity and is predicted to

fold correctly, which can be achieved by laboratory

experiments. However, because these probabilities are

multiplied together to get the overall probability of

function, errors in the first step are propagated throughout

the calculation. These methods for calculating the first step,

the probability of obtaining the correct sequence elements,

are:

1. Information content, as used in e.g. (36, 37): in

this method, a multiple sequence alignment is constructed,

and examined for conserved positions, which contain

the samenucleotidesat

different sequences. The information content in bits is

given by Shannon’s formula (38):

ppH

2

log

, summed across the nucleotides in the

sequence. The intuition here is that in a random RNA

sequence, there are 4 possible states at each position, so if

the bases are equiprobable there is a reduction of 2 bits of

uncertainty if only one of the four choices is acceptable

(Hbefore = −4 0.25 log20.25 = −4 0.25 −2 = 2, Hafter

= 0, so the difference is 2 bits). Thus we have 2 bits of

information per conserved nucleotide, and 2 bits of

information per conserved base pair (or 1.47 bits of

information if wobble pairing is allowed, because then the

final uncertainty is 6/16 rather then 4/16 to account for the

G-U and U-G pairs: the standard Watson-Crick pairs are A-

correspondingsites in

HI

, where

ii

Page 3

Abundance of the simplest RNA active sites

6062

U, U-A, C-G, and G-C). Although the simplicity of this

method is appealing, it assumes that all of the sequences are

drawn from a single starting sequence, with the conserved

sites appearing at specific positions within this reference

frame. Converting bits to probabilities, this model implies

that each conserved nucleotide or Watson-Crick base pair

multiplies the probability of occurrence by 1/4 (for Watson-

Crick pairs, 4/16 = 0.25 of the possible choices of two

nucleotides are valid pairs), and that each conserved

wobble pair multiplies the probability of occurrence of

the motif by 6/16. The appeal of the Shannon formula

arises from its simplicity, and from the fact that the

information content of a motif is independent of the

number of modules that it is broken into and of the

length of the sequence in which it is embedded.

However, as we shall see, these simplifying assumptions

lead to substantial inaccuracies in calculation as indeed

they are independent of the number of spacers. This

methodalso assumesthat

equiprobable, which is often reasonable in SELEX

because the incorporation rates can be controlled during

chemical synthesis, but is not reasonable in genomes

where we know the background base frequencies vary

widely. This method also assumes that there are no

correlations among successive positions in the sequence,

i.e. that the base at each position does not affect the

frequencies of the bases that follow.

thefour basesare

2. Poisson approximation across sites (22): in

this method, we calculate the probability of observing

the motif in a single trial (i.e. of finding it in a single

random sequence of the precise length of the motif). We

then calculate the number of ways to place the modules

of the motif within the longer sequence, and use the

Poisson formula to calculate the probability of zero

occurrences in the number of ‘trials’ corresponding to

the sequence. The complement of the probability of the

zero class is the probability that the motif occurred at

least once. This method assumes that each possible

match location is independent, and that a match is

extremely unlikely, and

probability of a match anywhere in the sequence.

Although the assumption of independence may lead to

reasonable approximations in relatively long sequences

(as compared to the original motif), modularity makes

matches much more likely, thus violating one of the key

assumptions justifying the Poisson approximation (39).

Indeed, we shall see that highly modular motifs can lead

to less accurate estimates on the probability of finding

the motif when the probability of occurrence of the

individuals are large. Like the information method, this

method assumes that there are no correlations among

successive positions in the sequence.

essentiallycalculates the

3. State machine/transition matrix (40): unlike the

Poisson approximation, this method provides an exact way

of calculating the probability of occurrence of a modular

motif, although it is very expensive computationally. In this

method,weembed the

deterministic finite automaton (finite state machine) that

detects matches with each possible pattern that could lead to

the occurrence of the motif in the sequence (see Methods

randomsequencesintoa

below for additional details). When we embed an i.i.d.

(independent and identically distributed) random sequence

of RNA bases into the automaton, the resulting stochastic

process is a first-order homogeneous Markov chain on the

states of the automaton. The probability that the motif is

present in the random-sequence is then equivalent to the

probability that the Markov chain is absorbed into a state

indicative of a match with the motif. This probability can

be calculated in two ways. First, the probability transition

matrix of the chain, which gives the probability of moving

from each state to each other state, may be exponentiated to

the length of the sequence, and the entries associated with

matcheswith the motif may be summed up to determine the

probability of a match. Second, a network flow approach,

in which we simulate each additional character by visiting

each state that can be reached in a given number of

characters, multiplying the probability of that state by the

probability of each of the possible characters, and adding

the result to the probability of the state that is reached by

adding a character. In practice, both methods give very

similar results when both can be implemented, but differ

substantially in run time and computer memory usage.

As we shall see, the automata-based approach becomes

computationally infeasible in both memory and CPU

time with very small numbers of correlations (i.e. base

pairs). The most complex case is when correlations

occur between modules (i.e. the modules are base paired

to one another), which forces us to consider the product

of the automata associated with each of the unique and

simpler patterns that build up the motif (in our case, a

concatenation of Aho-Corasick automata (41)). We omit

results from the network flow approach in what follows

because the method is more heuristic than mathematical,

and does not provide substantially different results from the

inclusion-exclusion approach below (on which we can

place more precise bounds).

4. Inclusion-exclusion approach: this method

can be used both for exact calculations of the probability

of occurrence of a modular motif (for small cases), or to

give bounds for this probability (for larger cases). In this

method, we also use product automata as above.

However, we aim for bounds of the same order of

magnitude rather than an exact calculation of the

probability of observing the motif. Using the inclusion-

exclusion formula (42), we can calculate the probability

of occurrence of the motif in a random-sequence by

determining the probability of match-ing individual,

pairwise, three-way, etc. combinations of the unique

patterns that build up the motif. Combining this

probability as P(individual)−P(pairwise)+P(three-way)-...

we can recover the exact probability of matching the motif.

However, according to Bonferroni’s inequality (42), this

exact probability is bounded by any two successive partial

sums occurring in the inclusion-exclusion formula. In

particular, if for a small k, the first k terms in the inclusion-

exclusion formula provide

probability, the associated bounds in the Bonferroni’s

inequality will be of the same order, and we only need

to consider the product of at most k automata to obtain a

tight approximation for the probability of matching the

motif. When there are too many combinations of the k

a tight estimate of the

Page 4

Abundance of the simplest RNA active sites

6063

Figure 1. Auxiliary deterministic finite automata needed for detecting at least one match with the correlated modular motif

(NA

GA in an RNA sequence. State 0: is the initial state. Visits to state 2:GA correspond to matches with the keyword GA within the

RNA sequence. Top-right, automaton that recognizes the motif GA1 CC i.e. the motif (NA

replaced with the pair GC. State 0: is again the initial state. Up to minor modifications states 0:, 1:G and 2:A correspond the the

automaton on the left, whereas states 3:, 4:C and 5:CC correspond to the states of the Aho-Corasick automaton that detects

matches with the keyword CC. State 2:A is visited for the first time when GA is first encountered in the RNA sequence.

Transitions from state 2:A to state 3: represent the unconstrained region of at least one nucleotide in the motif GA1 CC. State

5:CC may only be visited from state 4:C once the keyword CC is detected. Absorption into this state guarantees a match with the

uncorrelated motif GA

(NA

the automaton that detects at least one match with the motif AA

match with the motif CA1 CG. States labeled with the prefixes 15:, 20:, 21:, 24: and 25: guarantee a match with AA

without a match with CA1 CG. States labeled with the prefixes 18:, 19:, 22:, 23: and 27: guarantee a match with CA1 CG but

without a match with AA

automaton that detects at least one match with the motif (NA1 CN) would require the product of four automata and it is not

displayed in here due to limitations of space. All the plots were obtained using the software Graphviz, available at graphviz.org.

1 CN) in a sequence of RNA nucleotides. Top-left, Aho-Corasick automaton that recognizes any match with the keyword

1 CN) when the correlation is

1 CC. Bottom, product automaton for detecting at least one match with the correlated modular motif

1 CN) when the correlation N is restricted to the value A or C. States are ordered pairs of the form (v1,v2), with

CU, and

1v a state in

12

v a state in the automaton that detects at least one

1 CU

1 CU. The state labeled with the prefix 26: guarantees a match with both AA

1 CU and CA1 CG. The

unique patterns that build up the motif, we can obtain

asymptotic confidence intervals for the terms appearing in

Bonferroni’s inequality via Monte Carlo simulation. As we

shall see, this new method provides accurate estimates even

for highly modular motifs for which the use of the Poisson

approximation is badly justified and for which the state

machine/transitionmatrix

impractical. One important feature of this approach is that it

canbe extendedto sequences

correlations among successive positions). These correlat-

approachis completely

withmemory (i.e.

Page 5

Abundance of the simplest RNA active sites

6064

ions have been observed in genomes, and are widely used

for gene-finding (43).

In this paper, we test the accuracy of these

different methods on motifs of different length and

composition in different random-sequence backgrounds.

3. METHODS

We used an exact and a stochastic version of the

inclusion-exclusion method to either estimate, or to find a

100(1−)%asymptotic confidence

probability p of observing a given motif in a random

sequence of l RNA bases (nucleotides) produced in a

SELEX experiment. We typically consider =0.01, or

=0.05.

intervalfor the

The range of motifs considered here cover many

of the small motifs routinely found in SELEX. The motifs

are correlated (i.e. they contain base pairs), are composed of

either one, two or three modules, and have constant regions

ofvariouslengths.For

(N(NACGUACGUAC

AN)N) consists of three

modules, two correlations (base pairs), and a constant

region totaling thirteen nucleotides. Modules are separated

by unconstrained regions. The notation

unconstrained region of at least one nucleotide. The

modules of this motif are (N(NACGUACGUAC, GU and

AN)N), where the N’s represent bases that may be any of

the four nucleotides. The two bases directly within a pair of

parentheses are correlated and must pair with each other.

example, themotif

1 GU

1

1

refers to an

When the motif consists of n correlations, the

probability p corresponds to the probability that either of

n

m

4

simpler motifs (i.e. uncorrelated motifs) is present

in the random sequence. In the example discussed above,

n=2,

m=16 and one of the uncorrelated motifs is

CAACGUACGUAC1 GU

correlation was replaced by the pair of bases CG and the

inner by AU.

1 AUG, in which the outer

Each of the uncorrelated motifs may be identified

non-random sequence of nucleotides

deterministic finite automaton (44). We construct such an

automaton by concatenating the Aho-Corasick automata

(41) associated with each of the constant regions. Except

for the automaton associated with the last constant region,

all transitions from the terminal state of an Aho-Corasick

automaton are redefined so as to lead to the initial state of

the next. The terminal state of the Aho-Corasick automaton

associated with the last constant region is, however, reset to

be an absorbent state (an absorbent state is a state that

always returns to itself when additional characters are fed

into the automaton). The resulting automaton will have as

many as #(nucleotides in the constant region)+#(modules)

number of states. (See Figure 1 for a more detailed

explanation of these constructions for a motif consisting of

two modules, one correlation and a constant region of two

nucleotides.)

ina usinga

When m is of a manageable size we may consider the

product of the automata associated with each of the

uncorrelated motifs to determine p directly. The states of

this automaton areordered

),,(1

m

vv

, where

associated with the i-th uncorrelated motif. (In principle this

automaton may have at most {#(nucleotides in the constant

region)+#(modules)} states.

interest, this upper-bound is exaggerated because not all

states of the form

,,(1

m

vv

initial state of the product automaton i.e. the state

),,(

1

m

qq

associated with the i-th uncorrelated motif.) By embedding

a random sequence of i.i.d. nucleotides into this product

automaton we are guaranteed to obtain a first-order

homogeneous Markov chain (40). Indeed if

experiment then the probability transition from a state

is

for in this summation if and only if there is a transition from

state into state labeled with the character . If P

denotes the probability transition matrix of the resulting

Markov chain and

),(

2

sPl

with row

m-tuples of theform

iv is always a state in the automaton

m

In many situationsof

)

may be reached from the

, with

i q

the initial state of the automaton

p

denotes the

proportion of base

},,,{

UGCA

used in the SELEX

1s

into a state

2s

p

, where an index is accounted

1s

2 s

1s

denotes the entry associated

1s and column

2 s of the power matrix

l

P then

),(

sqPp

s

l

,

where q is the initial state of the product automaton, andthe

indices s are restricted to be those states of the form

),,(1

m

vv

where at least one of the entries

state.

Unfortunately, in most situations of interest, the

product automata described above do not scale well. It is

for these cases that an estimate of p rather than an exact

formula may be more suitable. If we denote by

event that the i-th of the uncorrelated motifs is present in a

random sequence of length l, it follows from the inclusion-

exclusion formula (42) that

iv is a terminal

i E

the

m

SSSp

21

,(1)

where

);( Prob );( Prob

21

ji

ji

i

i

EESES

.);(Prob

k

3

etcEEES

kji

ji

In general, if |I| denotes the number of indices in

the set I we have that

k

S

,{1,I

i

i

I

k

E

k| I :|m}

1

Prob ) 1(

.

To obtain

m

S

we need to compute the probability

that all the uncorrelated motifs are present in the random

sequence of nucleotides. To the best of our knowledge this