Content uploaded by Denny Dai

Author content

All content in this area was uploaded by Denny Dai

Content may be subject to copyright.

An Incremental Redundancy Estimation for the Sequence

Neighborhood Boundary

Denny C. Dai∗1, Herbert H. Tsang1and Kay C. Wiese1

Address: 1School of Computing Science, Simon Fraser University, Surrey, BC, Canada

Email: Denny C. Dai∗- cda18@cs.sfu.ca; Herbert H. Tsang - htsang@cs.sfu.ca; Kay C. Wiese - wiese@cs.sfu.ca;

∗Corresponding author

Abstract

Background: Understanding the combinatorial properties of RNA sequence-structure space is crucial to-

wards solving problems such as the RNA secondary structure prediction and design. We studied in this

work the mapping correlation between RNA sequences and their corresponding secondary structures. In

particular, we focused on investigating the sequence-structure neighborhood ball and its properties. This

study provides preliminary results for understanding the empirical hardness of related problems as well as

designing efﬁcient algorithmic solutions for them.

Results: We presented an Incremental Redundancy Estimation (IRE) approach for measuring the hamming

boundary of the small-radius sequence neighborhood ball. Our IRE experiment conﬁrmed the existence of

such neighborhood structures in the sequence space and demonstrated that the neighborhood boundary

grows linearly with the sequence size.

Conclusions: Our IRE result suggests ways for the design of promising local search algorithms for RNA

design as well as the study of neutral network proximity.

1 Background

1.1 RNA Secondary Structure

RNA primary sequence is deﬁned as a string of length

nover alphabet set Σ = {A, C, G, U }, representing the

four nucleotide units including Adenine, Cytosine, Gua-

nine and Uracil.. It is experimentally known that RNA

sequences will fold into a particular spatial structure in

order to achieve certain biological functions. The RNA

secondary structure is formed through base pairing be-

tween nucleotide bases at different positions of the pri-

mary sequence. For example, canonical Watson-Crick

pairs may occur between nucleotides {A, U }or {G, C},

and wobble pairs may occur between {G, U }. In na-

ture, these pairing relations represent the hydrogen bonds

between nitrogens and free energy exists among them.

Therefore, different secondary structures imply different

free energy levels. RNA secondary structure becomes

stable under a particular energy level called the ground

state, where the structure achieves a minimum free en-

ergy (MFE) level among all possible secondary structure

conformations.

In the classical RNA secondary structure prediction

problem, we seek for a given RNA primary sequence

its corresponding MFE structure. In a reverse problem,

namely the RNA Secondary Structure Design (SSD), we

search for a given secondary structure conﬁguration, the

1

target RNA primary sequence(s) that would fold into

this structure as its MFE ground state. Experimental

works show that both problems appear to be NP-hard

[1] and numerous algorithmic solutions have been pro-

posed [2–5]. In the literature, heuristic search algorithm

is established as the standard technique for tackling these

problems, as it enables an effective and efﬁcient explo-

ration of the high-dimensional sequence-structure space.

For example, solving the SSD problem requires search-

ing within the combinatorial space of all possible se-

quences for the one that folds into a predeﬁned struc-

ture. Therefore, investigating the combinatorial proper-

ties of the sequence-structure space is crucial towards un-

derstanding the empirical hardness of these problems.

1.2 Combinatorics of RNA Sequence-structure

Space

We study in this work the combinatorics of RNA

sequence-structure space, in particular the mapping cor-

relation between RNA sequences and secondary struc-

tures. An understanding of the sequence-structure rela-

tions is of central importance in biophysics of biopoly-

mer structure [6]. Its knowledge is also crucial towards

understanding the evolutionary transitions in evolution-

ary dynamics [7].

For RNA chain consisting of nnucleotides, the un-

derlying sequence space contains 4nunique sequence

candidates. It is also empirically known that the total

number of unique RNA secondary structures is much

smaller, with an estimated upper bound of (0.7137 ∗

n3/2∗2.2888n)[8]. Therefore, there exists a many-

to-one mapping relationship between the sequence space

and the secondary structure space. A mapping between

sequence rand structure sexists if and only if sis the

MFE structure of r. Two sequences r1and r2that map

to the same structure are called neutral counterparts, in-

dicating that both sequences share the same native MFE

structure. For a given secondary structure s, the set of

all sequences mapped to s constitutes a combinatorial set

called the neutral network. Earlier works [6, 8] studied

the combinatorial properties of neutral network; its bi-

ological importance in terms of evolutionary transition

as well as genotype & phenotype correlation is also ad-

dressed [7].

Schuster et. al empirically studied the structure den-

sity surface of arbitrary RNA sequences and showed

that there exists a small-radius hamming neighborhood

ball around arbitrary random sequences [9]; within such

neighborhood ball, the structure coverage is high such

that almost all common secondary structures could be

mapped from at least one sequence within the ball. To es-

timate the hamming boundary of the neighborhood ball,

they proposed a closest approaching distance measure-

ment that ﬁnds an upper bound for a minimum distance

between sequences folded into two different structures.

However, such method is computationally costly there-

fore becomes infeasible on large RNA sequences.

We propose in this paper an alternative method, the

Incremental Redundancy Estimation (IRE) approach for

determining the sequence neighborhood boundary. We

show that the IRE method scales well with sequence size

and is capable of estimating the hamming boundary with

good accuracy. Our work is motivated by the following

rationale: recent advances in solving the RNA secondary

structure design (SSD) problem show that, although SSD

is an empirically hard problem, the runtime of an SSD

algorithm can be empirically bounded by a polynomial

function with respect to sequence length [10]. Further-

more, the neighborhood ball theory states that for a given

RNA secondary structure, it is expected to map to se-

quence(s) lying within a neighborhood region centered at

arbitrary RNA sequences. Therefore, we argue that uti-

lizing the combinatorial properties of the neighborhood

structure may provide insight towards novel algorithm

design for SSD. Speciﬁcally, the hamming boundary of

the neighborhood ball gives an empirical constraint for

search algorithms that explore the sequence space for

SSD solving. An efﬁcient boundary estimation method,

therefore, is a prerequisite for furthering this study.

2 Method

We present in this section the Incremental Redundancy

Estimation (IRE) method for determining the sequence

neighborhood boundary. Given a kernel sequence ˆr, the

relative hamming layer dconsists of all sequences hav-

ing hamming distance dwith respect to kernel ˆr. Let Sd

be the total number of unique structures 1within ham-

ming layer d, and fdbe the total number of new struc-

tures emerged at layer dwhich do not exist in inner layers

(1to d−1). The incremental redundancy rate p∗

dgives

the degree of structural redundancy at hamming layer d.

It can be computed as

p∗

d=Sd−fd

Sd

, p∗

d∈[0,1] (1)

A hamming neighborhood ball of radius dcentered

1This can be calculated, for example, by folding all sequences within the layer and calculating the total number of unique structures.

2

at ˆrcontains the set of all RNA sequences having ham-

ming distance less or equal to dwith respect to kernel ˆr.

The neighborhood boundary is deﬁned as the hamming

layer for which the structural redundancy approximates

1. Assuming the boundary layer is d, we have

p∗

d=Sd−fd

Sd

≈1(2)

fd≈0(3)

This indicates the fact that the neighborhood ball now

covers most of the common secondary structures, there-

fore new emerging structures become very rare (fd≈

0). In practice, however, direct measurement of Sdand

fdis infeasible since it requires folding all sequences

within the hamming layer and checking their correspond-

ing MFE structures respectively: the sequence capacity

within the hamming layer grows exponentially with both

the sequence size and hamming distance. Therefore we

seek alternative methods that provide an estimation for

p∗

d.

Let pdbe the estimated incremental redundancy at

layer d. To compute pd, we sample narbitrary sequences

within the layer having unique mapping structures; for

each sampled sequence, we search for its corresponding

neutral counterparts within the inner layers (layer 1to

d−1). A redundancy hit is recorded if and only if such

neutral counterpart is found. This indicates that the map-

ping structure of the current sample sequence has been

observed within inner layers therefore is redundant. The

total number of redundancy hits is recorded and the esti-

mated incremental redundancy rate pdat layer dis com-

puted as

pd=k

n(4)

Here nis the sample size and kis total number of

redundancy hits. To see how pdapproximates the real

incremental redundancy rate p∗

d, we have

fd=Sd∗(1 −p∗

d)(5)

p∗

d=Sd−fd

Sd

(6)

≈k

n, n ⇒Sd(7)

=pd(8)

Therefore, given a large enough sample size n, the

estimated redundancy rate will approach the real value.

At the neighborhood boundary layer, we have fd≈0,

and pd≈1.

3 Results and Discussion

We have evaluated the proposed IRE method on RNA

sequences empirically. Figure 1(a) shows the incremen-

tal redundancy rate distribution over hamming layers un-

der different sequence lengths. For each sequence length,

100 random sequences (kernels) are generated. The ex-

periment is conducted by computing the redundancy rate

for each kernel sequence at each hamming layer, and

then averaged over all kernels. Results show that the

redundancy rate eventually converges to 1; larger se-

quences are also shown to have relatively lower conver-

gence rate. The neighborhood boundary is observed at

approximately half the value of the corresponding se-

quence length (where p≈1) and the boundary appears

to grow linearly with the sequence length.

Our next experiment evaluates how redundancy rate

distribution varies across different RNA categories. Fig-

ure 1(b) compares the IRE results among randomly

generated RNA sequences, naturally existing RNA se-

quences as well as arbitrary sequences that are compat-

ible with given structure constraints. Empirical results

show that random sequences have relatively higher con-

vergence rate; however, the difference is not signiﬁcant

in general.

We also evaluated the accuracy of our redundancy es-

timation by showing how IRE result varies under differ-

ent sample sizes. As discussed previously, the estimated

redundancy rate approaches the re al distribution (p≈p∗)

given large enough sample size (n≈Sd). Figure 1(c)

shows the IRE results under different sample sizes. The

comparison shows that as the sample size grows larger,

the redundancy distribution smoothes out and quickly

converges. Therefore, using relatively small sample size

(e.g., n= 100) appears to be sufﬁcient enough to pro-

vide accurate estimation of the incremental redundancy

rate.

4 Conclusions

We presented in this work an incremental redundancy

rate estimation for calculating the sequence neighbor-

hood boundary. We conﬁrmed through empirical ex-

periments the existence of small-radius neighborhood

balls centered at arbitrary sequences; it is shown that the

neighborhood boundary is much smaller than the corre-

sponding sequence size. Furthermore, we demonstrated

that the neighborhood boundary grows linearly with the

sequence size and that such boundary does not vary sig-

niﬁcantly across different RNA categories.

There are a few research directions where this study

3

0 10 20 30 40

0

0.2

0.4

0.6

0.8

1

hamming layer (d)

incremental redundancy (p)

IRE/40

IRE/60

IRE/80

(a)

0 20 40 60

0

0.2

0.4

0.6

0.8

1

hamming layer (d)

incremental redundancy (p)

compatible

natural

random

(b)

0 10 20 30

0

0.2

0.4

0.6

0.8

1

hamming layer (d)

incremental redundancy (p)

sample-size-10

sample-size-100

sample-size-250

(c)

Figure 1: Incremental redundancy rate distribution over hamming layers. Figure (a) shows the IRE distribution

under sequence lengths 40, 60 and 80. Figure (b) shows the IRE distribution for random sequences, compatible se-

quences and natural RNA sequences. All sequences have length 120 and the natural sequences are arbitrarily picked.

Figure (c) shows the IRE distribution under different sample sizes.

may lead to: Firstly, the IRE result identiﬁes sequence

neighborhood regions in which most of the common sec-

ondary structures could be obtained. This provides an

empirical upper-bound for the degree of search diversi-

ﬁcation 2required towards solving the RNA secondary

structure design problem. A local search algorithm,

therefore, could be naturally employed for such task. We

are currently working on an updated version of a pre-

viously developed RNA design algorithm [5]. It is ex-

pected that an integration of the neighborhood boundary

as explicit search constraints will introduce better heuris-

tic strategies, therefore improving the overall algorithm

performance.

Given a kernel sequence, our redundancy estimation

measures the degree of emerging secondary structures at

increasing hamming distances. It is in fact an implicit

local measurement of neutral networks proximity within

the sequence space. Since the density of neutral networks

directly affects the growth rate of structural redundancy,

we speculate that it also shapes the redundancy rate dis-

tribution. Neutral network proximity is known to have

important effects on RNA molecules interaction as well

as RNA co-folding [12]. Therefore, further investigation

of this topic is worthwhile.

References

1. Schnall-Levin M, Chindelevitch L, Berger B: Inverting the

Viterbi Algorithm: an Abstract Framework for Structure De-

sign. In ICML,Volume 307 of ACM International Conference Pro-

ceeding Series. Edited by Cohen WW, McCallum A, Roweis ST,

ACM 2008:904–911.

2. Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M,

Schuster P: Fast Folding and Comparison of RNA Secondary

Structures.Monatsh. Chem. (Chemical Monthly) 1994, 125:167–

188.

3. Andronescu M, Fejes AP, Hutter F, Hoos HH, Condon A: A New

Algorithm for RNA Secondary Structure Design.J. Mol. Biol.

2004, 336(3):607–624.

4. Busch A, Backofen R: INFO-RNA - a Fast Approach to Inverse

RNA Folding.Bioinformatics 2006, 22(15):1823–1831.

5. Dai DC, Tsang HH, Wiese KC: rnaDesign: Local Search for

RNA Secondary Structure Design.Proceedings of the 2009

IEEE Symposium on Computational Intelligence in Bioinformat-

ics and Computational Biology 2009.

6. Reidys C, Stadler PF, Schuster P: Generic Properties of Com-

binatory Maps: Neutral Networks of RNA Secondary Struc-

tures.Bulletin of Mathematical Biology 1997, 59(2):339–397.

7. Cowperthwaite MC, Meyers LA: How Mutational Networks

Shape Evolution: Lessons from RNA Models.Annual Review

of Ecology, Evolution, and Systematics 2007, 38:203–230.

8. Gruner W, Giegerich R, Strothmann D, Reidys C, Weber J, Ho-

facker IL, Stadler PF, Schuster P: Analysis of RNA Sequence

Structure Maps by Exhaustive Enumeration. Working Papers

95-10-099, Santa Fe Institute 1995.

9. Schuster P, Fontana W, Stadler PF, Hofacker IL: From Sequences

to Shapes and Back: A Case Study in RNA Secondary Struc-

tures.Proceedings: Biological Sciences 1994, 255(1344):279–

284.

10. Aguirre-Hern´andez R, Hoos HH, Condon A: Computational

RNA Secondary Structure Design: Empirical Complexity and

Improved Methods.BMC Bioinformatics 2007, 8(34).

11. Blum C, Roli A: Metaheuristics in Combinatorial Optimiza-

tion: Overview and Conceptual Comparison.ACM Comput.

Surv. 2003, 35(3):268–308.

12. Attolini CSO, Stadler PF: Neutral Networks of Interacting RNA

Secondary Structures.Advances in Complex Systems (ACS)

2005, 2/3:275–283.

2for a deﬁnition of search intensiﬁcation and diversiﬁcation, refer to [11]

4