Content uploaded by Denny Dai
Author content
All content in this area was uploaded by Denny Dai
Content may be subject to copyright.
An Incremental Redundancy Estimation for the Sequence
Neighborhood Boundary
Denny C. Dai∗1, Herbert H. Tsang1and Kay C. Wiese1
Address: 1School of Computing Science, Simon Fraser University, Surrey, BC, Canada
Email: Denny C. Dai∗- cda18@cs.sfu.ca; Herbert H. Tsang - htsang@cs.sfu.ca; Kay C. Wiese - wiese@cs.sfu.ca;
∗Corresponding author
Abstract
Background: Understanding the combinatorial properties of RNA sequence-structure space is crucial to-
wards solving problems such as the RNA secondary structure prediction and design. We studied in this
work the mapping correlation between RNA sequences and their corresponding secondary structures. In
particular, we focused on investigating the sequence-structure neighborhood ball and its properties. This
study provides preliminary results for understanding the empirical hardness of related problems as well as
designing efficient algorithmic solutions for them.
Results: We presented an Incremental Redundancy Estimation (IRE) approach for measuring the hamming
boundary of the small-radius sequence neighborhood ball. Our IRE experiment confirmed the existence of
such neighborhood structures in the sequence space and demonstrated that the neighborhood boundary
grows linearly with the sequence size.
Conclusions: Our IRE result suggests ways for the design of promising local search algorithms for RNA
design as well as the study of neutral network proximity.
1 Background
1.1 RNA Secondary Structure
RNA primary sequence is defined as a string of length
nover alphabet set Σ = {A, C, G, U }, representing the
four nucleotide units including Adenine, Cytosine, Gua-
nine and Uracil.. It is experimentally known that RNA
sequences will fold into a particular spatial structure in
order to achieve certain biological functions. The RNA
secondary structure is formed through base pairing be-
tween nucleotide bases at different positions of the pri-
mary sequence. For example, canonical Watson-Crick
pairs may occur between nucleotides {A, U }or {G, C},
and wobble pairs may occur between {G, U }. In na-
ture, these pairing relations represent the hydrogen bonds
between nitrogens and free energy exists among them.
Therefore, different secondary structures imply different
free energy levels. RNA secondary structure becomes
stable under a particular energy level called the ground
state, where the structure achieves a minimum free en-
ergy (MFE) level among all possible secondary structure
conformations.
In the classical RNA secondary structure prediction
problem, we seek for a given RNA primary sequence
its corresponding MFE structure. In a reverse problem,
namely the RNA Secondary Structure Design (SSD), we
search for a given secondary structure configuration, the
1
target RNA primary sequence(s) that would fold into
this structure as its MFE ground state. Experimental
works show that both problems appear to be NP-hard
[1] and numerous algorithmic solutions have been pro-
posed [2–5]. In the literature, heuristic search algorithm
is established as the standard technique for tackling these
problems, as it enables an effective and efficient explo-
ration of the high-dimensional sequence-structure space.
For example, solving the SSD problem requires search-
ing within the combinatorial space of all possible se-
quences for the one that folds into a predefined struc-
ture. Therefore, investigating the combinatorial proper-
ties of the sequence-structure space is crucial towards un-
derstanding the empirical hardness of these problems.
1.2 Combinatorics of RNA Sequence-structure
Space
We study in this work the combinatorics of RNA
sequence-structure space, in particular the mapping cor-
relation between RNA sequences and secondary struc-
tures. An understanding of the sequence-structure rela-
tions is of central importance in biophysics of biopoly-
mer structure [6]. Its knowledge is also crucial towards
understanding the evolutionary transitions in evolution-
ary dynamics [7].
For RNA chain consisting of nnucleotides, the un-
derlying sequence space contains 4nunique sequence
candidates. It is also empirically known that the total
number of unique RNA secondary structures is much
smaller, with an estimated upper bound of (0.7137 ∗
n3/2∗2.2888n)[8]. Therefore, there exists a many-
to-one mapping relationship between the sequence space
and the secondary structure space. A mapping between
sequence rand structure sexists if and only if sis the
MFE structure of r. Two sequences r1and r2that map
to the same structure are called neutral counterparts, in-
dicating that both sequences share the same native MFE
structure. For a given secondary structure s, the set of
all sequences mapped to s constitutes a combinatorial set
called the neutral network. Earlier works [6, 8] studied
the combinatorial properties of neutral network; its bi-
ological importance in terms of evolutionary transition
as well as genotype & phenotype correlation is also ad-
dressed [7].
Schuster et. al empirically studied the structure den-
sity surface of arbitrary RNA sequences and showed
that there exists a small-radius hamming neighborhood
ball around arbitrary random sequences [9]; within such
neighborhood ball, the structure coverage is high such
that almost all common secondary structures could be
mapped from at least one sequence within the ball. To es-
timate the hamming boundary of the neighborhood ball,
they proposed a closest approaching distance measure-
ment that finds an upper bound for a minimum distance
between sequences folded into two different structures.
However, such method is computationally costly there-
fore becomes infeasible on large RNA sequences.
We propose in this paper an alternative method, the
Incremental Redundancy Estimation (IRE) approach for
determining the sequence neighborhood boundary. We
show that the IRE method scales well with sequence size
and is capable of estimating the hamming boundary with
good accuracy. Our work is motivated by the following
rationale: recent advances in solving the RNA secondary
structure design (SSD) problem show that, although SSD
is an empirically hard problem, the runtime of an SSD
algorithm can be empirically bounded by a polynomial
function with respect to sequence length [10]. Further-
more, the neighborhood ball theory states that for a given
RNA secondary structure, it is expected to map to se-
quence(s) lying within a neighborhood region centered at
arbitrary RNA sequences. Therefore, we argue that uti-
lizing the combinatorial properties of the neighborhood
structure may provide insight towards novel algorithm
design for SSD. Specifically, the hamming boundary of
the neighborhood ball gives an empirical constraint for
search algorithms that explore the sequence space for
SSD solving. An efficient boundary estimation method,
therefore, is a prerequisite for furthering this study.
2 Method
We present in this section the Incremental Redundancy
Estimation (IRE) method for determining the sequence
neighborhood boundary. Given a kernel sequence ˆr, the
relative hamming layer dconsists of all sequences hav-
ing hamming distance dwith respect to kernel ˆr. Let Sd
be the total number of unique structures 1within ham-
ming layer d, and fdbe the total number of new struc-
tures emerged at layer dwhich do not exist in inner layers
(1to d−1). The incremental redundancy rate p∗
dgives
the degree of structural redundancy at hamming layer d.
It can be computed as
p∗
d=Sd−fd
Sd
, p∗
d∈[0,1] (1)
A hamming neighborhood ball of radius dcentered
1This can be calculated, for example, by folding all sequences within the layer and calculating the total number of unique structures.
2
at ˆrcontains the set of all RNA sequences having ham-
ming distance less or equal to dwith respect to kernel ˆr.
The neighborhood boundary is defined as the hamming
layer for which the structural redundancy approximates
1. Assuming the boundary layer is d, we have
p∗
d=Sd−fd
Sd
≈1(2)
fd≈0(3)
This indicates the fact that the neighborhood ball now
covers most of the common secondary structures, there-
fore new emerging structures become very rare (fd≈
0). In practice, however, direct measurement of Sdand
fdis infeasible since it requires folding all sequences
within the hamming layer and checking their correspond-
ing MFE structures respectively: the sequence capacity
within the hamming layer grows exponentially with both
the sequence size and hamming distance. Therefore we
seek alternative methods that provide an estimation for
p∗
d.
Let pdbe the estimated incremental redundancy at
layer d. To compute pd, we sample narbitrary sequences
within the layer having unique mapping structures; for
each sampled sequence, we search for its corresponding
neutral counterparts within the inner layers (layer 1to
d−1). A redundancy hit is recorded if and only if such
neutral counterpart is found. This indicates that the map-
ping structure of the current sample sequence has been
observed within inner layers therefore is redundant. The
total number of redundancy hits is recorded and the esti-
mated incremental redundancy rate pdat layer dis com-
puted as
pd=k
n(4)
Here nis the sample size and kis total number of
redundancy hits. To see how pdapproximates the real
incremental redundancy rate p∗
d, we have
fd=Sd∗(1 −p∗
d)(5)
p∗
d=Sd−fd
Sd
(6)
≈k
n, n ⇒Sd(7)
=pd(8)
Therefore, given a large enough sample size n, the
estimated redundancy rate will approach the real value.
At the neighborhood boundary layer, we have fd≈0,
and pd≈1.
3 Results and Discussion
We have evaluated the proposed IRE method on RNA
sequences empirically. Figure 1(a) shows the incremen-
tal redundancy rate distribution over hamming layers un-
der different sequence lengths. For each sequence length,
100 random sequences (kernels) are generated. The ex-
periment is conducted by computing the redundancy rate
for each kernel sequence at each hamming layer, and
then averaged over all kernels. Results show that the
redundancy rate eventually converges to 1; larger se-
quences are also shown to have relatively lower conver-
gence rate. The neighborhood boundary is observed at
approximately half the value of the corresponding se-
quence length (where p≈1) and the boundary appears
to grow linearly with the sequence length.
Our next experiment evaluates how redundancy rate
distribution varies across different RNA categories. Fig-
ure 1(b) compares the IRE results among randomly
generated RNA sequences, naturally existing RNA se-
quences as well as arbitrary sequences that are compat-
ible with given structure constraints. Empirical results
show that random sequences have relatively higher con-
vergence rate; however, the difference is not significant
in general.
We also evaluated the accuracy of our redundancy es-
timation by showing how IRE result varies under differ-
ent sample sizes. As discussed previously, the estimated
redundancy rate approaches the re al distribution (p≈p∗)
given large enough sample size (n≈Sd). Figure 1(c)
shows the IRE results under different sample sizes. The
comparison shows that as the sample size grows larger,
the redundancy distribution smoothes out and quickly
converges. Therefore, using relatively small sample size
(e.g., n= 100) appears to be sufficient enough to pro-
vide accurate estimation of the incremental redundancy
rate.
4 Conclusions
We presented in this work an incremental redundancy
rate estimation for calculating the sequence neighbor-
hood boundary. We confirmed through empirical ex-
periments the existence of small-radius neighborhood
balls centered at arbitrary sequences; it is shown that the
neighborhood boundary is much smaller than the corre-
sponding sequence size. Furthermore, we demonstrated
that the neighborhood boundary grows linearly with the
sequence size and that such boundary does not vary sig-
nificantly across different RNA categories.
There are a few research directions where this study
3
0 10 20 30 40
0
0.2
0.4
0.6
0.8
1
hamming layer (d)
incremental redundancy (p)
IRE/40
IRE/60
IRE/80
(a)
0 20 40 60
0
0.2
0.4
0.6
0.8
1
hamming layer (d)
incremental redundancy (p)
compatible
natural
random
(b)
0 10 20 30
0
0.2
0.4
0.6
0.8
1
hamming layer (d)
incremental redundancy (p)
sample-size-10
sample-size-100
sample-size-250
(c)
Figure 1: Incremental redundancy rate distribution over hamming layers. Figure (a) shows the IRE distribution
under sequence lengths 40, 60 and 80. Figure (b) shows the IRE distribution for random sequences, compatible se-
quences and natural RNA sequences. All sequences have length 120 and the natural sequences are arbitrarily picked.
Figure (c) shows the IRE distribution under different sample sizes.
may lead to: Firstly, the IRE result identifies sequence
neighborhood regions in which most of the common sec-
ondary structures could be obtained. This provides an
empirical upper-bound for the degree of search diversi-
fication 2required towards solving the RNA secondary
structure design problem. A local search algorithm,
therefore, could be naturally employed for such task. We
are currently working on an updated version of a pre-
viously developed RNA design algorithm [5]. It is ex-
pected that an integration of the neighborhood boundary
as explicit search constraints will introduce better heuris-
tic strategies, therefore improving the overall algorithm
performance.
Given a kernel sequence, our redundancy estimation
measures the degree of emerging secondary structures at
increasing hamming distances. It is in fact an implicit
local measurement of neutral networks proximity within
the sequence space. Since the density of neutral networks
directly affects the growth rate of structural redundancy,
we speculate that it also shapes the redundancy rate dis-
tribution. Neutral network proximity is known to have
important effects on RNA molecules interaction as well
as RNA co-folding [12]. Therefore, further investigation
of this topic is worthwhile.
References
1. Schnall-Levin M, Chindelevitch L, Berger B: Inverting the
Viterbi Algorithm: an Abstract Framework for Structure De-
sign. In ICML,Volume 307 of ACM International Conference Pro-
ceeding Series. Edited by Cohen WW, McCallum A, Roweis ST,
ACM 2008:904–911.
2. Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M,
Schuster P: Fast Folding and Comparison of RNA Secondary
Structures.Monatsh. Chem. (Chemical Monthly) 1994, 125:167–
188.
3. Andronescu M, Fejes AP, Hutter F, Hoos HH, Condon A: A New
Algorithm for RNA Secondary Structure Design.J. Mol. Biol.
2004, 336(3):607–624.
4. Busch A, Backofen R: INFO-RNA - a Fast Approach to Inverse
RNA Folding.Bioinformatics 2006, 22(15):1823–1831.
5. Dai DC, Tsang HH, Wiese KC: rnaDesign: Local Search for
RNA Secondary Structure Design.Proceedings of the 2009
IEEE Symposium on Computational Intelligence in Bioinformat-
ics and Computational Biology 2009.
6. Reidys C, Stadler PF, Schuster P: Generic Properties of Com-
binatory Maps: Neutral Networks of RNA Secondary Struc-
tures.Bulletin of Mathematical Biology 1997, 59(2):339–397.
7. Cowperthwaite MC, Meyers LA: How Mutational Networks
Shape Evolution: Lessons from RNA Models.Annual Review
of Ecology, Evolution, and Systematics 2007, 38:203–230.
8. Gruner W, Giegerich R, Strothmann D, Reidys C, Weber J, Ho-
facker IL, Stadler PF, Schuster P: Analysis of RNA Sequence
Structure Maps by Exhaustive Enumeration. Working Papers
95-10-099, Santa Fe Institute 1995.
9. Schuster P, Fontana W, Stadler PF, Hofacker IL: From Sequences
to Shapes and Back: A Case Study in RNA Secondary Struc-
tures.Proceedings: Biological Sciences 1994, 255(1344):279–
284.
10. Aguirre-Hern´andez R, Hoos HH, Condon A: Computational
RNA Secondary Structure Design: Empirical Complexity and
Improved Methods.BMC Bioinformatics 2007, 8(34).
11. Blum C, Roli A: Metaheuristics in Combinatorial Optimiza-
tion: Overview and Conceptual Comparison.ACM Comput.
Surv. 2003, 35(3):268–308.
12. Attolini CSO, Stadler PF: Neutral Networks of Interacting RNA
Secondary Structures.Advances in Complex Systems (ACS)
2005, 2/3:275–283.
2for a definition of search intensification and diversification, refer to [11]
4