ArticlePDF Available
An Incremental Redundancy Estimation for the Sequence
Neighborhood Boundary
Denny C. Dai1, Herbert H. Tsang1and Kay C. Wiese1
Address: 1School of Computing Science, Simon Fraser University, Surrey, BC, Canada
Email: Denny C. Dai- cda18@cs.sfu.ca; Herbert H. Tsang - htsang@cs.sfu.ca; Kay C. Wiese - wiese@cs.sfu.ca;
Corresponding author
Abstract
Background: Understanding the combinatorial properties of RNA sequence-structure space is crucial to-
wards solving problems such as the RNA secondary structure prediction and design. We studied in this
work the mapping correlation between RNA sequences and their corresponding secondary structures. In
particular, we focused on investigating the sequence-structure neighborhood ball and its properties. This
study provides preliminary results for understanding the empirical hardness of related problems as well as
designing efficient algorithmic solutions for them.
Results: We presented an Incremental Redundancy Estimation (IRE) approach for measuring the hamming
boundary of the small-radius sequence neighborhood ball. Our IRE experiment confirmed the existence of
such neighborhood structures in the sequence space and demonstrated that the neighborhood boundary
grows linearly with the sequence size.
Conclusions: Our IRE result suggests ways for the design of promising local search algorithms for RNA
design as well as the study of neutral network proximity.
1 Background
1.1 RNA Secondary Structure
RNA primary sequence is defined as a string of length
nover alphabet set Σ = {A, C, G, U }, representing the
four nucleotide units including Adenine, Cytosine, Gua-
nine and Uracil.. It is experimentally known that RNA
sequences will fold into a particular spatial structure in
order to achieve certain biological functions. The RNA
secondary structure is formed through base pairing be-
tween nucleotide bases at different positions of the pri-
mary sequence. For example, canonical Watson-Crick
pairs may occur between nucleotides {A, U }or {G, C},
and wobble pairs may occur between {G, U }. In na-
ture, these pairing relations represent the hydrogen bonds
between nitrogens and free energy exists among them.
Therefore, different secondary structures imply different
free energy levels. RNA secondary structure becomes
stable under a particular energy level called the ground
state, where the structure achieves a minimum free en-
ergy (MFE) level among all possible secondary structure
conformations.
In the classical RNA secondary structure prediction
problem, we seek for a given RNA primary sequence
its corresponding MFE structure. In a reverse problem,
namely the RNA Secondary Structure Design (SSD), we
search for a given secondary structure configuration, the
1
target RNA primary sequence(s) that would fold into
this structure as its MFE ground state. Experimental
works show that both problems appear to be NP-hard
[1] and numerous algorithmic solutions have been pro-
posed [2–5]. In the literature, heuristic search algorithm
is established as the standard technique for tackling these
problems, as it enables an effective and efficient explo-
ration of the high-dimensional sequence-structure space.
For example, solving the SSD problem requires search-
ing within the combinatorial space of all possible se-
quences for the one that folds into a predefined struc-
ture. Therefore, investigating the combinatorial proper-
ties of the sequence-structure space is crucial towards un-
derstanding the empirical hardness of these problems.
1.2 Combinatorics of RNA Sequence-structure
Space
We study in this work the combinatorics of RNA
sequence-structure space, in particular the mapping cor-
relation between RNA sequences and secondary struc-
tures. An understanding of the sequence-structure rela-
tions is of central importance in biophysics of biopoly-
mer structure [6]. Its knowledge is also crucial towards
understanding the evolutionary transitions in evolution-
ary dynamics [7].
For RNA chain consisting of nnucleotides, the un-
derlying sequence space contains 4nunique sequence
candidates. It is also empirically known that the total
number of unique RNA secondary structures is much
smaller, with an estimated upper bound of (0.7137
n3/22.2888n)[8]. Therefore, there exists a many-
to-one mapping relationship between the sequence space
and the secondary structure space. A mapping between
sequence rand structure sexists if and only if sis the
MFE structure of r. Two sequences r1and r2that map
to the same structure are called neutral counterparts, in-
dicating that both sequences share the same native MFE
structure. For a given secondary structure s, the set of
all sequences mapped to s constitutes a combinatorial set
called the neutral network. Earlier works [6, 8] studied
the combinatorial properties of neutral network; its bi-
ological importance in terms of evolutionary transition
as well as genotype & phenotype correlation is also ad-
dressed [7].
Schuster et. al empirically studied the structure den-
sity surface of arbitrary RNA sequences and showed
that there exists a small-radius hamming neighborhood
ball around arbitrary random sequences [9]; within such
neighborhood ball, the structure coverage is high such
that almost all common secondary structures could be
mapped from at least one sequence within the ball. To es-
timate the hamming boundary of the neighborhood ball,
they proposed a closest approaching distance measure-
ment that finds an upper bound for a minimum distance
between sequences folded into two different structures.
However, such method is computationally costly there-
fore becomes infeasible on large RNA sequences.
We propose in this paper an alternative method, the
Incremental Redundancy Estimation (IRE) approach for
determining the sequence neighborhood boundary. We
show that the IRE method scales well with sequence size
and is capable of estimating the hamming boundary with
good accuracy. Our work is motivated by the following
rationale: recent advances in solving the RNA secondary
structure design (SSD) problem show that, although SSD
is an empirically hard problem, the runtime of an SSD
algorithm can be empirically bounded by a polynomial
function with respect to sequence length [10]. Further-
more, the neighborhood ball theory states that for a given
RNA secondary structure, it is expected to map to se-
quence(s) lying within a neighborhood region centered at
arbitrary RNA sequences. Therefore, we argue that uti-
lizing the combinatorial properties of the neighborhood
structure may provide insight towards novel algorithm
design for SSD. Specifically, the hamming boundary of
the neighborhood ball gives an empirical constraint for
search algorithms that explore the sequence space for
SSD solving. An efficient boundary estimation method,
therefore, is a prerequisite for furthering this study.
2 Method
We present in this section the Incremental Redundancy
Estimation (IRE) method for determining the sequence
neighborhood boundary. Given a kernel sequence ˆr, the
relative hamming layer dconsists of all sequences hav-
ing hamming distance dwith respect to kernel ˆr. Let Sd
be the total number of unique structures 1within ham-
ming layer d, and fdbe the total number of new struc-
tures emerged at layer dwhich do not exist in inner layers
(1to d1). The incremental redundancy rate p
dgives
the degree of structural redundancy at hamming layer d.
It can be computed as
p
d=Sdfd
Sd
, p
d[0,1] (1)
A hamming neighborhood ball of radius dcentered
1This can be calculated, for example, by folding all sequences within the layer and calculating the total number of unique structures.
2
at ˆrcontains the set of all RNA sequences having ham-
ming distance less or equal to dwith respect to kernel ˆr.
The neighborhood boundary is defined as the hamming
layer for which the structural redundancy approximates
1. Assuming the boundary layer is d, we have
p
d=Sdfd
Sd
1(2)
fd0(3)
This indicates the fact that the neighborhood ball now
covers most of the common secondary structures, there-
fore new emerging structures become very rare (fd
0). In practice, however, direct measurement of Sdand
fdis infeasible since it requires folding all sequences
within the hamming layer and checking their correspond-
ing MFE structures respectively: the sequence capacity
within the hamming layer grows exponentially with both
the sequence size and hamming distance. Therefore we
seek alternative methods that provide an estimation for
p
d.
Let pdbe the estimated incremental redundancy at
layer d. To compute pd, we sample narbitrary sequences
within the layer having unique mapping structures; for
each sampled sequence, we search for its corresponding
neutral counterparts within the inner layers (layer 1to
d1). A redundancy hit is recorded if and only if such
neutral counterpart is found. This indicates that the map-
ping structure of the current sample sequence has been
observed within inner layers therefore is redundant. The
total number of redundancy hits is recorded and the esti-
mated incremental redundancy rate pdat layer dis com-
puted as
pd=k
n(4)
Here nis the sample size and kis total number of
redundancy hits. To see how pdapproximates the real
incremental redundancy rate p
d, we have
fd=Sd(1 p
d)(5)
p
d=Sdfd
Sd
(6)
k
n, n Sd(7)
=pd(8)
Therefore, given a large enough sample size n, the
estimated redundancy rate will approach the real value.
At the neighborhood boundary layer, we have fd0,
and pd1.
3 Results and Discussion
We have evaluated the proposed IRE method on RNA
sequences empirically. Figure 1(a) shows the incremen-
tal redundancy rate distribution over hamming layers un-
der different sequence lengths. For each sequence length,
100 random sequences (kernels) are generated. The ex-
periment is conducted by computing the redundancy rate
for each kernel sequence at each hamming layer, and
then averaged over all kernels. Results show that the
redundancy rate eventually converges to 1; larger se-
quences are also shown to have relatively lower conver-
gence rate. The neighborhood boundary is observed at
approximately half the value of the corresponding se-
quence length (where p1) and the boundary appears
to grow linearly with the sequence length.
Our next experiment evaluates how redundancy rate
distribution varies across different RNA categories. Fig-
ure 1(b) compares the IRE results among randomly
generated RNA sequences, naturally existing RNA se-
quences as well as arbitrary sequences that are compat-
ible with given structure constraints. Empirical results
show that random sequences have relatively higher con-
vergence rate; however, the difference is not significant
in general.
We also evaluated the accuracy of our redundancy es-
timation by showing how IRE result varies under differ-
ent sample sizes. As discussed previously, the estimated
redundancy rate approaches the re al distribution (pp)
given large enough sample size (nSd). Figure 1(c)
shows the IRE results under different sample sizes. The
comparison shows that as the sample size grows larger,
the redundancy distribution smoothes out and quickly
converges. Therefore, using relatively small sample size
(e.g., n= 100) appears to be sufficient enough to pro-
vide accurate estimation of the incremental redundancy
rate.
4 Conclusions
We presented in this work an incremental redundancy
rate estimation for calculating the sequence neighbor-
hood boundary. We confirmed through empirical ex-
periments the existence of small-radius neighborhood
balls centered at arbitrary sequences; it is shown that the
neighborhood boundary is much smaller than the corre-
sponding sequence size. Furthermore, we demonstrated
that the neighborhood boundary grows linearly with the
sequence size and that such boundary does not vary sig-
nificantly across different RNA categories.
There are a few research directions where this study
3
0 10 20 30 40
0
0.2
0.4
0.6
0.8
1
hamming layer (d)
incremental redundancy (p)
IRE/40
IRE/60
IRE/80
(a)
0 20 40 60
0
0.2
0.4
0.6
0.8
1
hamming layer (d)
incremental redundancy (p)
compatible
natural
random
(b)
0 10 20 30
0
0.2
0.4
0.6
0.8
1
hamming layer (d)
incremental redundancy (p)
sample-size-10
sample-size-100
sample-size-250
(c)
Figure 1: Incremental redundancy rate distribution over hamming layers. Figure (a) shows the IRE distribution
under sequence lengths 40, 60 and 80. Figure (b) shows the IRE distribution for random sequences, compatible se-
quences and natural RNA sequences. All sequences have length 120 and the natural sequences are arbitrarily picked.
Figure (c) shows the IRE distribution under different sample sizes.
may lead to: Firstly, the IRE result identifies sequence
neighborhood regions in which most of the common sec-
ondary structures could be obtained. This provides an
empirical upper-bound for the degree of search diversi-
fication 2required towards solving the RNA secondary
structure design problem. A local search algorithm,
therefore, could be naturally employed for such task. We
are currently working on an updated version of a pre-
viously developed RNA design algorithm [5]. It is ex-
pected that an integration of the neighborhood boundary
as explicit search constraints will introduce better heuris-
tic strategies, therefore improving the overall algorithm
performance.
Given a kernel sequence, our redundancy estimation
measures the degree of emerging secondary structures at
increasing hamming distances. It is in fact an implicit
local measurement of neutral networks proximity within
the sequence space. Since the density of neutral networks
directly affects the growth rate of structural redundancy,
we speculate that it also shapes the redundancy rate dis-
tribution. Neutral network proximity is known to have
important effects on RNA molecules interaction as well
as RNA co-folding [12]. Therefore, further investigation
of this topic is worthwhile.
References
1. Schnall-Levin M, Chindelevitch L, Berger B: Inverting the
Viterbi Algorithm: an Abstract Framework for Structure De-
sign. In ICML,Volume 307 of ACM International Conference Pro-
ceeding Series. Edited by Cohen WW, McCallum A, Roweis ST,
ACM 2008:904–911.
2. Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M,
Schuster P: Fast Folding and Comparison of RNA Secondary
Structures.Monatsh. Chem. (Chemical Monthly) 1994, 125:167–
188.
3. Andronescu M, Fejes AP, Hutter F, Hoos HH, Condon A: A New
Algorithm for RNA Secondary Structure Design.J. Mol. Biol.
2004, 336(3):607–624.
4. Busch A, Backofen R: INFO-RNA - a Fast Approach to Inverse
RNA Folding.Bioinformatics 2006, 22(15):1823–1831.
5. Dai DC, Tsang HH, Wiese KC: rnaDesign: Local Search for
RNA Secondary Structure Design.Proceedings of the 2009
IEEE Symposium on Computational Intelligence in Bioinformat-
ics and Computational Biology 2009.
6. Reidys C, Stadler PF, Schuster P: Generic Properties of Com-
binatory Maps: Neutral Networks of RNA Secondary Struc-
tures.Bulletin of Mathematical Biology 1997, 59(2):339–397.
7. Cowperthwaite MC, Meyers LA: How Mutational Networks
Shape Evolution: Lessons from RNA Models.Annual Review
of Ecology, Evolution, and Systematics 2007, 38:203–230.
8. Gruner W, Giegerich R, Strothmann D, Reidys C, Weber J, Ho-
facker IL, Stadler PF, Schuster P: Analysis of RNA Sequence
Structure Maps by Exhaustive Enumeration. Working Papers
95-10-099, Santa Fe Institute 1995.
9. Schuster P, Fontana W, Stadler PF, Hofacker IL: From Sequences
to Shapes and Back: A Case Study in RNA Secondary Struc-
tures.Proceedings: Biological Sciences 1994, 255(1344):279–
284.
10. Aguirre-Hern´andez R, Hoos HH, Condon A: Computational
RNA Secondary Structure Design: Empirical Complexity and
Improved Methods.BMC Bioinformatics 2007, 8(34).
11. Blum C, Roli A: Metaheuristics in Combinatorial Optimiza-
tion: Overview and Conceptual Comparison.ACM Comput.
Surv. 2003, 35(3):268–308.
12. Attolini CSO, Stadler PF: Neutral Networks of Interacting RNA
Secondary Structures.Advances in Complex Systems (ACS)
2005, 2/3:275–283.
2for a definition of search intensification and diversification, refer to [11]
4
... However, such method is computationally costly therefore becomes infeasible on large RNA sequences. We recently proposed an alternative method, the Incremental Redundancy Estimation (IRE) approach for determining the sequence neighborhood boundary [8]. We confirmed through empirical experiments the existence of small-radius neighborhood balls centered at arbitrary sequences; it is shown that the neighborhood ball boundary is much smaller than the corresponding sequence size. ...
Article
Full-text available
RNA design problem is a recently emerging research topic motivated by applica-tions such as customized drug design and the self-assembly of RNA nano-objects. This paper gives a survey of the recent advances in RNA design. We discuss the empirical hardness of solving the problem as well as the combinatorial properties of its underlying sequence-structure map. A literature review on existing algo-rithmic solutions is given and comparisons are made among them. An algorithm performance prediction model is introduced and its relevance to RNA design is addressed. We conclude by proposing that RNA design could be extended into a multi-objective optimization problem and this research topic is worth further exploring.
Article
Full-text available
Recent advances in molecular biology and computation have enabled evolutionary biologists to develop models that explicitly capture molecular structure. By including complex and realistic maps from genotypes to phenotypes, such models are yielding important new insights into evolutionary processes. In particular, computer simulations of evolving RNA structure have inspired a new conceptual framework for thinking about patterns of mutational connectivity and general theories about the nature of evolutionary transitions, the evolutionary ascent of nonoptimal phenotypes, and the origins of mutational robustness and modular structures. Here, we describe this class of RNA models and review the major conceptual contributions they have made to evolutionary biology.
Article
Full-text available
Computer codes for computation and comparison of RNA secondary structures, the Vienna RNA package, are presented, that are based on dynamic programming algorithms and aim at predictions of structures with minimum free energies as well as at computations of the equilibrium partition functions and base pairing probabilities.An efficient heuristic for the inverse folding problem of RNA is introduced. In addition we present compact and efficient programs for the comparison of RNA secondary structures based on tree editing and alignment.All computer codes are written in ANSI C. They include implementations of modified algorithms on parallel computers with distributed memory. Performance analysis carried out on an Intel Hypercube shows that parallel computing becomes gradually more and more efficient the longer the sequences are.Die im Vienna RNA package enthaltenen Computer Programme fr die Berechnung und den Vergleich von RNA Sekundrstrukturen werden prsentiert. Ihren Kern bilden Algorithmen zur Vorhersage von Strukturen minimaler Energie sowie zur Berechnung von Zustandssumme und Basenpaarungswahrscheinlichkeiten mittels dynamischer Programmierung.Ein effizienter heuristischer Algorithmus fr das inverse Faltungsproblem wird vorgestellt. Darberhinaus prsentieren wir kompakte und effiziente Programme zum Vergleich von RNA Sekundrstrukturen durch Baum-Editierung und Alignierung.Alle Programme sind in ANSI C geschrieben, darunter auch eine Implementation des Faltungs-algorithmus fr Parallelrechner mit verteiltem Speicher. Wie Tests auf einem Intel Hypercube zeigen, wird das Parallelrechnen umso effizienter je lnger die Sequenzen sind.
Conference Paper
Full-text available
Probabilistic grammatical formalisms such as hidden Markov models (HMMs) and stochastic context-free grammars (SCFGs) have been extensively studied and widely applied in a number of fields. Here, we introduce a new algorithmic problem on HMMs and SCFGs that arises naturally from protein and RNA design, and which has not been previously studied. The problem can be viewed as an inverse to the one solved by the Viterbi algorithm on HMMs or by the CKY algorithm on SCFGs. We study this problem theoretically and obtain the first algorithmic results. We prove that the problem is NP-complete, even for a 3-letter emission alphabet, via a reduction from 3-SAT, a result that has implications for the hardness of RNA secondary structure design. We then develop a number of approaches for making the problem tractable. In particular, for HMMs we develop a branch-and-bound algorithm, which can be shown to have fixed-parameter tractable worst-case running time, exponential in the number of states of the HMM but linear in the length of the structure. We also show how to cast the problem as a Mixed Integer Linear Program.
Conference Paper
Full-text available
The RNA secondary structure design (SSD) problem is a recently emerging research topic motivated by applications such as customized drug design and the self-assembly of RNA nano-objects. This paper presents a novel local search algorithm, rnaDesign for SSD solving. An evaluation of the algorithm performance in terms of sequence affinity and structure specificity is made through comparison with another algorithm, RNAinverse. Experiments were performed on RNA secondary structures including three biologically existing data sets and one random structure set. Empirical results show that rnaDesign outperforms RNAinverse in terms of structure designability; sequences designed by rnaDesign also exhibit better thermodynamic stability with relatively lower folding energy. Furthermore, we demonstrate through parameter tuning experiments that using a combination of heuristic search strategies leads to better design performance; there also exists a strong correlation between the heuristic values in use and solution quality.
Article
Full-text available
Global relations between RNA sequences and secondary structures are understood as mappings from sequence space into shape space. These mappings are investigated by exhaustive folding of all GC and AU sequences with chain lengths up to 30. The technique of tried is used for economic data storage and fast retrieval of information. The computed structural data are evaluated through exhaustive enumeration and used as an exact reference for testing analytical results derived from mathematical models and sampling based of statistical methods. Several new concepts of RNA sequence to secondary structure mappings are investigated, among them the structure of neural networks (being sets of sequences folding into the same structure), percolation of sequence space by neural networks, and the principle of shape space covering. The data of exhaustive enumeration are compared to the analytical results of a random graph model that reveals the generic properties of sequence to structure mappings based on some base pairing logic. The differences between the numerical and the analytical results are interpreted in terms of specific biophysical properties of RNA molecules.
Article
Full-text available
RNA molecules interact by forming inter-molecular base pairs that compete with the intra-molecular base pairs of their secondary structures. We investigate the patterns of neutral mutations in RNAs whose function is the interaction with other RNAs, i.e. the co-folding with one or more other RNA molecules. We find that (i) the degree of neutrality is much smaller in interacting RNAs compared to RNAs that just have to coform to a single externally prescribed target structure, and (ii) strengthening this contraint to the conservation of the co-folded structure with two or more partners essentially eliminates neutrality. It follows that RNAs whose function depends on the formation of a specific interaction complex with a target RNA molecule will evolve much more slowly than RNAs with a function depending only on their own structure.
Article
RNA folding is viewed here as a map assigning secondary structures to sequences. At fixed chain length the number of sequences far exceeds the number of structures. Frequencies of structures are highly non-uniform and follow a generalized form of Zipf's law: we find relatively few common and many rare ones. By using an algorithm for inverse folding, we show that sequences sharing the same structure are distributed randomly over sequence space. All common structures can be accessed from an arbitrary sequence by a number of mutations much smaller than the chain length. The sequence space is percolated by extensive neutral networks connecting nearest neighbours folding into identical structures. Implications for evolutionary adaptation and for applied molecular evolution are evident: finding a particular structure by mutation and selection is much simpler than expected and, even if catalytic activity should turn out to be sparse of RNA structures, it can hardly be missed by evolutionary processes.
Article
Random graph theory is used to model and analyse the relationships between sequences and secondary structures of RNA molecules, which are understood as mappings from sequence space into shape space. These maps are non-invertible since there are always many orders of magnitude more sequences than structures. Sequences folding into identical structures form neutral networks. A neutral network is embedded in the set of sequences that are compatible with the given structure. Networks are modeled as graphs and constructed by random choice of vertices from the space of compatible sequences. The theory characterizes neutral networks by the mean fraction of neutral neighbors (lambda). The networks are connected and percolate sequence space if the fraction of neutral nearest neighbors exceeds a threshold value (lambda > lambda *). Below threshold (lambda < lambda *), the networks are partitioned into a largest "giant" component and several smaller components. Structures are classified as "common" or "rare" according to the sizes of their pre-images, i.e. according to the fractions of sequences folding into them. The neutral networks of any pair of two different common structures almost touch each other, and, as expressed by the conjecture of shape space covering sequences folding into almost all common structures, can be found in a small ball of an arbitrary location in sequence space. The results from random graph theory are compared to data obtained by folding large samples of RNA sequences. Differences are explained in terms of specific features of RNA molecular structures.
Article
The function of many RNAs depends crucially on their structure. Therefore, the design of RNA molecules with specific structural properties has many potential applications, e.g. in the context of investigating the function of biological RNAs, of creating new ribozymes, or of designing artificial RNA nanostructures. Here, we present a new algorithm for solving the following RNA secondary structure design problem: given a secondary structure, find an RNA sequence (if any) that is predicted to fold to that structure. Unlike the (pseudoknot-free) secondary structure prediction problem, this problem appears to be hard computationally. Our new algorithm, "RNA Secondary Structure Designer (RNA-SSD)", is based on stochastic local search, a prominent general approach for solving hard combinatorial problems. A thorough empirical evaluation on computationally predicted structures of biological sequences and artificially generated RNA structures as well as on empirically modelled structures from the biological literature shows that RNA-SSD substantially out-performs the best known algorithm for this problem, RNAinverse from the Vienna RNA Package. In particular, the new algorithm is able to solve structures, consistently, for which RNAinverse is unable to find solutions. The RNA-SSD software is publically available under the name of RNA Designer at the RNASoft website (www.rnasoft.ca).