Page 1
ORIGINAL RESEARCH ARTICLE
published: 26 April 2012
doi: 10.3389/fgene.2012.00036
Multi-objective genetic algorithm for pseudoknotted RNA
sequence design
AkitoTaneda*
Graduate School of Science andTechnology, Hirosaki University, Hirosaki, Japan
Edited by:
Michael Rossbach, Genome Institute
of Singapore, Singapore
Reviewed by:
Kengo Sato, Keio University, Japan
Prasanna R. Kolatkar, Genome
Institute of Singapore, Singapore
Fei Li, Nanjing Agricultural University,
China
*Correspondence:
AkitoTaneda, Graduate School of
Science andTechnology, Hirosaki
University, 3 Bunkyo-cho, Hirosaki,
Aomori 036-8561, Japan.
e-mail: taneda@cc.hirosaki-u.ac.jp
RNA inverse folding is a computational technology for designing RNA sequences which
foldintoauser-specifiedsecondarystructure.Althoughpseudoknotsarefunctionallyimpor-
tantmotifsinRNAstructures,lessreportsconcerningtheinversefoldingofpseudoknotted
RNAs have been done compared to those for pseudoknot-free RNA design. In this paper,
we present a new version of our multi-objective genetic algorithm (MOGA), MODENA,
which we have previously proposed for pseudoknot-free RNA inverse folding. In the new
version of MODENA, (i) a new crossover operator is implemented and (ii) pseudoknot pre-
diction methods, IPknot and HotKnots, are used to evaluate the designed RNA sequences,
allowing us to perform the inverse folding of pseudoknotted RNAs. The new version
of MODENA with the new crossover operator was benchmarked with a dataset com-
posed of natural pseudoknotted RNA secondary structures, and we found that MODENA
can successfully design more pseudoknotted RNAs compared to the other pseudoknot
design algorithm. In addition, a sequence constraint function newly implemented in the
new version of MODENA was tested by designing RNA sequences which fold into the
pseudoknotted structure of a hepatitis delta virus ribozyme; as a result, we success-
fully designed eight RNA sequences.The new version of MODENA is downloadable from
http://rna.eit.hirosaki-u.ac.jp/modena/.
Keywords: inverse folding, pseudoknot, secondary structure, pseudobase, Rfam, sequence constraint
1.
Evolutionary related non-coding RNAs have their own charac-
teristic secondary structure corresponding to each function, and
it is well known that the secondary structures play key roles in
the functions of the RNA sequences. This biochemical knowl-
edgeaccumulatedtodateindicatesthatwecangeneratefunctional
synthetic RNAs if we can control the secondary structure of the
RNAs. In this context, various synthetic RNAs, such as ribozymes
(Schultes and Bartel, 2000), micro RNAs (Schwab et al., 2006),
riboswitches (Breaker, 2004), and RNA nano structures (Jaeger
et al.,2001) have been successfully designed.
RNA inverse folding is a computational methodology for
designing RNA sequences which fold into a given target struc-
ture (Hofacker et al., 1994). The name “inverse” comes from the
reason that the inverse folding is defined as the inverse problem of
RNAsecondarystructureprediction,whereRNAsecondarystruc-
ture prediction problem is referred to as “direct problem”. Since
usually there can be multiple solutions for an RNA inverse fold-
ing problem and we have no deterministic algorithm which can
enumerate all solutions of a given RNA inverse folding problem,
previous RNA inverse folding algorithms have adopted heuris-
tic approaches to find desired RNA sequences. We can find the
following six RNA inverse folding algorithms in literature: local
search algorithms [RNAinverse (Hofacker et al., 1994), RNA-
SSD (Andronescu et al., 2004), INFO-RNA (Busch and Backofen,
2006), Inv (Gao et al., 2010), design in NUPACK (Zadeh et al.,
2011)] and a genetic algorithm [GA; MODENA (Taneda, 2011)].
INTRODUCTION
The local search algorithms are well characterized by their ini-
tialization step and refinement step in the exploration procedures
for obtaining desired RNA sequences. First, in these local search
approaches, a single RNA sequence is generated. The pioneering
RNA inverse folding algorithm, RNAinverse, uses a pure random
initialization. RNA-SSD randomly initializes an RNA sequence in
amoresophisticatedmanner,whereabasecompositionandatabu
mechanism for avoiding undesired stem formation are taken into
account. INFO-RNA utilizes a dynamic programming algorithm
to obtain a good initial sequence for the RNA inverse folding,
where the lowest energy RNA sequence determined assuming that
the RNA sequence folds into a given structure is used as an initial
sequence. In the refinement step after the initialization, RNAin-
verse performs adaptive walk to improve the initial sequence,and
RNA-SSD and INFO-RNA use stochastic local search (Hoos and
Stützle, 2004) to improve the initial sequence. In the refinement
step, RNAinverse and RNA-SSD employ a structure decomposi-
tion strategy to reduce the number of folding calculations for a
whole sequence. Inv and NUPACK also utilize structure decom-
position strategies in their refinement step. Inv is an RNA inverse
folding algorithm designed for a restricted pseudoknot class and
canperformtheinversefoldingofpseudoknottedRNAs.NUPACK
is a suite of programs for computational nucleic acid analysis and
includes a program named design. Design generates the sequences
by minimizing an ensemble defect (Zadeh et al., 2011); the value
of ensemble defect becomes lower, the designed RNA sequence
more specifically folds into a given target structure. MODENA
www.frontiersin.org
April 2012 | Volume 3 | Article 36 | 1
Page 2
Taneda Pseudoknotted RNA sequence design
is a multi-objective genetic algorithm (MOGA; Deb, 2001) for
RNA inverse folding. As objective functions, MODENA uses two
quantities, a structure similarity measure and a stability measure
(e.g.,freeenergy).Byvirtueof simultaneousoptimizationinthese
objectivefunctions,MODENAcanexplorethesequencewhichnot
only folds into the desired target structure but has a high stability
(= a low free energy).
PseudoknotsareimportantfunctionalmotifsinRNAstructure
(Staple and Butcher, 2005). In contrast to other RNA structure
motifs such as hairpin loop, bulge loop, internal loop, and multi-
branch loop, where any two base pairs (i, j) and (k, l) do not
have a relationship such that i <k <j <l (where i, j, k, and l
are nucleotide positions), pseudoknots are defined as the struc-
tures which have base pairs satisfies the condition i <k <j <l.
Since pseudoknots have various enzymatic functions (Staple and
Butcher, 2005), they are intriguing targets of functional RNA
design. However,in the previous RNA inverse folding algorithms,
only Inv can design pseudoknots. Moreover,there is no algorithm
whichcandesignpseudoknottedRNAswithsequenceconstraints,
which are an important feature for designing the molecule with
a known functional sequence motif. For these reasons, develop-
ment of a novel pseudoknotted RNA inverse folding algorithm
is important in order to promote the sequence design of RNA
pseudoknots.
In this paper, we present an extension of MODENA algorithm
to the inverse folding of pseudoknotted RNAs. In MODENA
algorithm, designed RNA sequences are evaluated by perform-
ing secondary structure prediction with an RNA folding program
such as RNAfold (Hofacker, 2003), and we can easily substitute
the RNA folding program by a different RNA folding program
(in the context of inverse folding, we refer RNA structure pre-
diction program as “direct problem solver”). This advantage of
MODENA algorithm also remains in the case of pseudoknotted
RNA inverse folding, where we have to use a pseudoknotted RNA
secondary structure prediction program as direct problem solver.
In the rest of the present paper, first, we describe the MODENA
algorithmforpseudoknottedRNAdesign,whereamulti-objective
genetic algorithm is used in combination with the state of the art
pseudoknotted RNA structure prediction programs. After that,
the performance of MODENA algorithm is evaluated by bench-
marksbasedonnaturalRNAsecondarystructures,wherenotonly
pseudoknotted structures but also pseudoknot-free structures are
takenintoaccount.Then,asequenceconstraintfunctionavailable
in MODENA is demonstrated with a biological example taken
from literature.
2.
Since the detail of the MODENA algorithm for pseudoknot-
free RNAs is described in Taneda (2011) and the present ver-
sion for pseudoknot RNA design shares all parts of the previ-
ous pseudoknot-free version of MODENA, the algorithmically
common parts between the two versions are briefly described
below.
MODENAalgorithmisanRNAinversefoldingalgorithmbased
on MOGA. GA is a population based algorithm for optimization
and search (Goldberg, 1987), which is inspired from the mech-
anism of evolution. MOGA is a GA for exploring the objective
MATERIALS AND METHODS
function space consisting of multiple objective functions, while
standard GA uses a single objective function. In MODENA algo-
rithm, we use the following two objective functions, a structure
stability score ? and a structure similarity score σ:
? = −E,
σ = (N − d)/N,
(1)
(2)
where E is the lowest free energy of a designed sequence; N is
the total number of nucleotides, and d is the structure distance
between target and predicted structures (Taneda,2011). Structure
distance d is defined as the number of the bases which have a
different base-paring status between the target structure and the
structure predicted for the designed sequence.
In MODENA algorithm, we utilize multi-objective optimiza-
tion(MOO;Deb,2001)toexploresolutions(i.e.,RNAsequences)
withbettervaluesof bothof thetwoobjectivefunctions.InMOO,
twosolutionsarecomparedbasedontheirdominance.Letuscon-
sider two solutions,xaand xb.When“all objective function values
of xaare better than or equal to those of xb” and “at least one
objective function value of xais not equal to that of xb”, xadomi-
nates xb; a solution which is not dominated by any other solution
is called a Pareto optimal solution. If “all objective function val-
ues of xaare better than those of xb”, xastrongly dominates xb; a
weak Pareto optimal solution is defined as a solution which is not
strongly dominated by any other solution. MODENA algorithm
explores weak Pareto optimal solutions for RNA inverse folding
problem (Taneda, 2011).
Since usually it is difficult to enumerate all (weak) Pareto
optimal solutions for a given MOO problem, MOGA computes
approximate set (partial solutions) of the (weak) Pareto optimal
solutions. MODENA is developed based on one of the standard
MOGA, non-dominated sorting genetic algorithm 2 (NSGA2;
Deb, 2001). NSGA2 proceeds in accordance with the framework
similar to that of standard GA, which is composed of initializa-
tion, evaluation, and reproduction for a population of solutions.
Intheinitializationstep,auser-definednumberof solutions(RNA
sequences) are randomly generated. In the present study, we use
50 or 100 solutions in one population.
Intheevaluationstep,weperformanRNAstructureprediction
for each solution in the current generation,and then assign stabil-
ityandsimilarityscorestothesolutions.WeuseanRNAstructure
predictionprogramasadirectproblemsolvertoobtaintheobjec-
tive function values. In MODENA, RNAfold (Hofacker, 2003),
CentroidFold (Hamada et al., 2009), or UNAFold (Markham
and Zuker, 2008) can be used as a direct problem solver for
pseudoknot-free RNA design, and IPknot (Sato et al., 2011) or
HotKnots (Ren et al., 2005) can be utilized for pseudoknotted
RNA sequence design (in the present study, we used IPknot 0.0.2
and HotKnots 2.0). Since IPknot does not output a free energy
or some quantity indicating stability of the predicted structure,
instead of using Equation 1, we assign the total number of gua-
nineandcytosinepairsinthepredictedbasepairstoeachsolution
as a stability score when we use IPknot as a direct problem solver.
Based on the solutions of the current generation,reproduction
step generates child solutions for the next generation. MODENA
Frontiers in Genetics | Non-Coding RNA
April 2012 | Volume 3 | Article 36 | 2
Page 3
Taneda Pseudoknotted RNA sequence design
algorithm generates child solutions by invoking three GA opera-
tors with an equal probability: structural n-point crossover, point
accepted mutation, and error diagnosis mutation (Taneda, 2011);
structural n-point crossover generates a child solution by con-
catenating subsequences taken from two parent solutions; point
acceptedmutationrandomlychangesanucleotide;errordiagnosis
mutation compares the predicted and target structures, and then
changes the nucleotides which have different base pairs between
the predicted and target structures.
Point accepted mutation and error diagnosis mutation can
be applied to pseudoknotted RNA sequence design without any
modification. Structural n-point crossover, however, has to be
changed for pseudoknotted RNAs, since its original algorithm
(Taneda,2011)assumesnopseudoknotinatargetstructure(i.e.,a
nucleotide k,[i <k <j],never forms a base pair with a nucleotide
l [l <i orj <l]).Structuraln-pointcrossoveriscomposedof four
steps (Taneda,2011,p. 5),and we modified Step 2 to take pseudo-
knots into account. After a crossover parameter nc(= 2 in the
presentstudy)andarandomlydeterminedx0(=0or1)aregiven,
structural n-point crossover allowing pseudoknots is performed
as follows:
Step1 Set l =0 and set xi=x0for all i (1≤i≤N;N is a sequence
length). Randomly select a base pair (i, j) (1≤i<j≤N).
Step2Foreachxk(i ≤k ≤j)whichdoesnotformabasepairwith
xl(l <i or j <l) in the target structure, perform the following:
if xkis zero, change xkto one, otherwise change xkto zero. Then
increment l by one.
Step3 If l<ncand“the number of the base pairs whose upstream
nucleotidepositionmsatisfiesi<m(wherem<N)”islargerthan
or equal to one, randomly select a base pair (inew, jnew), where
i<inew<jnew≤N;then we set i=inewand j=jnew,and move to
Step 2; otherwise we go to Step 4.
Step4Generateachildsolutionaccordingtoxiforalli (1≤i≤N);
if xi=0,copythevalueof anucleotidesA
sponding nucleotide schild
i
of the child solution;if xi=1,the value
of a nucleotide sB
iinparentAtothecorre-
iin parent B is copied to schild
i
.
It is noted that Step 1, Step 3, and Step 4 are exactly the same
with those in Taneda (2011). By using this modified version, we
can crossover two parent solutions without destructing any base
complementarity in the target structure.An example of structural
n-point crossover is depicted in Figure 1.
2.1.
Since the previous version of MODENA does not support
sequence constraints, we have added the function to the present
version of MODENA. The sequence constraints of MODENA can
be specified in accordance with the IUPAC notation of nucleotide
codes. By using the sequence constraints,user can design pseudo-
knottedRNAsequenceswithsequencemotifsspecifiedbytheuser.
SEQUENCE CONSTRAINTS
2.2.
In MODENA, user inputs a target structure using a bracket nota-
tion, where (), <>, {}, [], and alphabets (AaBbCcDdEe) are
allowed to specify a base pair (where uppercase and lowercase
alphabets indicate upstream and downstream nucleotide posi-
tions, respectively). User can freely input a target structure using
A NOTE ON INPUT TARGET STRUCTURE
FIGURE 1 |An example of structural n-point crossover operator for
pseudoknotted target structure.Trgt and x_k indicate a target structure
and a crossover position indicator, xk, respectively. (A) An initial state. All
xks are set to zero. (B) Position i is randomly selected. Position j is the
position complementary to the position i. (C)The values of xks between i
and j are changed to 1.The values in the pseudoknotted region are not
changed.The positions whose xk=1 are shaded. We can use the xks
obtained after this step as a crossover position indicator for a 4-point
crossover. (D) In addition, we can randomly select one more position i to
increase the crossover points. (E) In this example, as a result, we can
obtain a crossover position indicator for a 6-point crossover, which is
composed of 7 subsequence regions.
these bracket notations, but it is noted that if the direct problem
solver selected by the user cannot predict the class (e.g., Condon
et al., 2004) of the input pseudoknot, the user never obtain the
sequences folding into the target structure.
2.3.
We evaluated the design performance of MODENA with a dataset
which contains the natural pseudoknotted structures taken from
Pseudobase(Batenburgetal.,2000).Sincetheoriginal342pseudo-
knotted structures downloaded from Pseudobase are redundant,
i.e.,different Pseudobase entries can share strictly the same struc-
ture,we performed a non-redundant processing to guarantee that
allstructuresareuniqueinourdataset.Consequently,weobtained
266pseudoknottedstructuresfortheperformanceevaluation.We
refer to this dataset as the Pseudobase dataset.
In addition to the benchmark for the pseudoknotted target
structures,we performed a benchmark for pseudoknot-free target
structures,wheretheRfamdatasetof ourpreviouspaper(Taneda,
2011) was used. It is noted that a pseudoknot prediction method
DATASET FOR BENCHMARK
www.frontiersin.org
April 2012 | Volume 3 | Article 36 | 3
Page 4
TanedaPseudoknotted RNA sequence design
(IPknot) was used as a direct problem solver in this pseudoknot-
free benchmark. The reason why we performed a benchmark for
pseudoknot-free target structures in the present study is as fol-
lows. If we use a non-pseudoknot prediction method as a direct
problem solver to design an RNA sequence, the designed RNA
sequence may fold into a pseudoknotted structure when we fold
the designed sequence with a pseudoknot prediction method. By
using a pseudoknot prediction method as a direct problem solver
for pseudoknot-free RNA sequence design, we can decrease the
probability with which undesired pseudoknots accidentally form
in the designed RNA sequence. That is,inverse folding of pseudo-
knotted RNAs is useful not only to design pseudoknotted RNA
sequences but also to design pseudoknot-free ones.
3.
3.1.
We evaluated the pseudoknot design performance of MOD-
ENA with the Pseudobase dataset, where IPknot and Hotknots
were used as a direct problem solver. We set both a popula-
tion size and maximum iteration number to 50 in our GA. In
this performance evaluation, we obtained successfully designed
RNA sequences for 207 and 198 pseudoknotted target struc-
tures by MODENA+IPknot and MODENA+HotKnot, respec-
tively, in the 266 target structures of the Pseudobase dataset,
where MODENA+IPknot and MODENA+HotKnots denote
RESULTS
BENCHMARK RESULTS
the sequence design utilizing IPknot and HotKnots as a direct
problem solver, respectively (“successfully designed RNAs” mean
the RNA sequences which fold into the input target structure).
Inv obtained successfully designed RNAs for 181 pseudoknot-
ted target structures with the same dataset. Figure 2 shows the
sequence length dependence of the pseudoknot design perfor-
mancesforMODENAandInv,wheretheperformanceisindicated
by “the rate of successfully designed RNAs”=100×(number of
the target structures for which a successfully designed RNA is
obtained)/(total number of the target structures). The total num-
ber of the target structures included in each length bin is given in
Figure 3. As can be seen from Figure 2, MODENA+IPknot out-
performsInvforallbinsof sequencelengths.MODENA+IPknot
showed the best performance for the length range between 21
and 60 nucleotides. For the range between 61 and 80 nucleotides,
MODENA+IPknot and MODENA+HotKnots have compa-
rable performances. For longer target structures with lengths
from 81 to 140 nucleotides, MODENA+HotKnots gives the best
results among MODENA+IPknot, MODENA+HotKnots, and
Inv.
For the target structures longer than 85 nucleotides, Inv com-
pletely failed to design pseudoknots. MODENA also could not
obtainsuccessfullydesignedpseudoknottedRNAswhenthetarget
structures have a length longer than 137 nucleotides. It is noted
that the number of target structures longer than 81 nucleotides
FIGURE 2 | Sequence length dependence of the pseudoknot design
performances for MODENA and Inv. “MODENA+IPknot” and
“MODENA+HotKnots” denote the sequence designs using IPknot and
HotKnots as a direct problem solver, respectively. In each bin, results for
MODENA+IPknot, MODENA+HotKnots and Inv are shown from left to
right.
Frontiers in Genetics | Non-Coding RNA
April 2012 | Volume 3 | Article 36 | 4
Page 5
TanedaPseudoknotted RNA sequence design
FIGURE 3 | Distribution of the target structures in the Pseudobase dataset.
is much smaller than that of the other shorter target struc-
tures (Figure 3); a benchmark with more target structures with
long lengths may give a different result. Details of the results
for the Pseudobase dataset are tabulated in Table S1 in Supple-
mentary Material, which is downloadable from the MODENA
website.
To examine whether a larger calculation,where both a popula-
tionsizeandaniterationnumberhaveavalueof 100,improvesthe
pseudoknot design performance or not,we performed the inverse
foldingofthe59targetstructureswhichwerefailedtodesignwhen
we used a value of 50. By using the larger parameter values, we
successfully designed 15 pseudoknots (Pseudobase PKB-number:
PKB00050, PKB00129,PKB00138,
PKB00171,PKB00178, PKB00179,
PKB00219, PKB00228, PKB00267, PKB00329, PKB00333) of the
59 target structures. These results indicate that a larger calcula-
tion can improve the design performance; it is noted that the
computational time for the larger calculation becomes longer,
i.e., there is a tradeoff between computational time and design
performance.
The logarithm of the computational times needed for
the Pseudobase benchmark of MODENA+IPknot, MOD-
ENA+HotKnots, and Inv is plotted in Figure 4. The compu-
tational times were measured on a Core i7 PC (3.33GHz; 24GB
memory;CentOS5.6[x86_64]).Sinceweperformedfiftyindepen-
dent runs with Inv for each target structure, the mean computa-
tionaltimesforthetargetstructuresareusedasthecomputational
PKB00148,
PKB00211,
PKB00170,
PKB00217,
50100 150
−2
−1
0
1
2
3
4
5
Sequence length (nucleotides)
log10(time in seconds)
MODENA+HotKnots
MODENA+IPknot
Inv
FIGURE 4 |The logarithm of the computational times needed for the
Pseudobase benchmark of MODENA+IPknot, MODENA+HotKnots,
and Inv. Each symbol corresponds to one target structure. Each
computational time for Inv is the mean over fifty independent runs.The
results of failed Inv runs are not included in this figure.
www.frontiersin.org
April 2012 | Volume 3 | Article 36 | 5
Page 6
TanedaPseudoknotted RNA sequence design
time of Inv in Figure 4 (and in Table S1 in Supplementary Mate-
rial). Figure 4 clearly reveals the difference between two direct
problem solvers we used for MODENA in the present study; i.e.,
IPknot is much faster than HotKnots. For the target structures
shorter than 50 nucleotides, Inv is faster than MODENA. How-
ever, in longer target structures, we found that Inv often becomes
much slower than MODENA+IPknot. In addition, Inv com-
pletely failed to design the pseudoknotted RNAs longer than 85
nucleotides.Invquicklyterminatesitscalculationwhentheinverse
foldingof theinputtargetstructureisimpossible(Invanalyzesthe
input target structure before performing a stochastic search). The
results of such terminated calculations are not plotted inFigure4;
the computational times of the terminated Inv runs can be seen in
Table S1 in Supplementary Material.
Tocomparetheconvergencepropertiesofdifferentdirectprob-
lem solvers, we averaged the converged GA iteration numbers for
all target structures in the Pseudobase dataset (where we used the
results for a population size and a maximum iteration number
of 50). As convergence criterions, similar to Taneda (2011), the
GA iteration stops when (i) the maximum iteration number is
reached or (2) the number of weak Pareto optimal solutions is
not changed during continuous 30 iterations. As a result, we
found that MODENA needs 41.8 and 36.9 GA iterations when
IPknot and HotKnots, respectively, are used as a direct problem
solver.
The inverse folding results for the Rfam dataset,which is com-
posed of pseudoknot-free target structures, are summarized in
Table 1, where the results obtained by MODENA+IPknot alone
areshown.Thisisbecausethetargetstructurelengthsof theRfam
dataset are too long for MODENA+HotKnots in terms of com-
putational time, and Inv is limited to the application to the short
target structures. In this benchmark for the pseudoknot-free tar-
get structures, MODENA successfully designed RNA sequences
for 22 target structures. This result is comparable to our previous
result(Taneda,2011)obtainedbyusingthedirectproblemsolvers
which cannot predict pseudoknots. The present result indicates
Table 1 |The benchmark results for the pseudoknot-free Rfam dataset.
RfamAC Rfam IDl (nt) succ. GC_high GC_lowt (s)
RF00001
RF00002
RF00003
RF00004
RF00005
RF00006
RF00007
RF00008
RF00009
RF00010
RF00011
RF00012
RF00013
RF00014
RF00015
RF00016
RF00017
RF00018
RF00019
RF00020
RF00021
RF00022
RF00024
RF00025
RF00026
RF00027
RF00028
RF00029
RF00030
5S_rRNA
5_8S_rRNA
U1
U2
tRNA
Vault
U12
Hammerhead_3
RNaseP_nuc
RNaseP_bact_a
RNaseP_bact_b
U3
6S
DsrA
U4
SNORD14
SRP_euk_arch
CsrB
Y_RNA
U5
Spot_42
GcvB
Telomerase-vert
Telomerase-cil
U6
Let-7
Intron_gpI
Intron_gpII
RNase_MRP
117
151
161
193
74
89
154
54
348
357
382
215
185
87
140
129
301
360
83
119
118
148
451
210
102
79
344
73
340
0/50
24/50
0/50
37/50
40/50
30/50
39/50
39/50
0/50
0/50
0/50
38/50
39/50
36/50
36/50
38/50
26/50
23/50
38/50
0/50
44/50
31/50
27/50
35/50
33/50
34/50
0/50
32/50
35/50
51
39
60
76
39
35
74
24
84
149
158
71
79
51
46
32
164
77
39
53
67
58
172
59
8
59
85
30
105
33
34
38
35
12
14
37
9
74
143
154
33
40
17
26
4
116
56
13
22
22
30
119
41
3
18
61
14
76
20.253
34.430
34.468
54.946
9.523
11.024
35.436
6.056
160.996
314.373
305.932
58.912
48.594
13.750
28.434
19.416
205.422
192.333
11.252
20.646
26.401
28.952
367 .750
47 .194
10.493
14.306
192.231
8.725
151.637
“l” , “succ.” and t columns represent the length (= number of nucleotides) of a target structure, a success rate, and a computational time in seconds, respectively; x/y
indicates a “success rate” in such a way that we obtained x successfully designed sequences when we used a GA population size of y. “GC_high” and “GC_low”
are the highest and lowest nGCs, respectively, where nGCis the total number of guanine and cytosine pairs in the predicted base pairs. Computational times were
measured on a Core i7 PC (3.33GHz; 24GB memory; CentOS 5.6[x86_64]).
Frontiers in Genetics | Non-Coding RNA
April 2012 | Volume 3 | Article 36 | 6
Page 7
TanedaPseudoknotted RNA sequence design
that pseudoknot prediction methods are useful even for design-
ing pseudoknot-free RNA sequences, by which we can reduce the
possibilityof anaccidentalpseudoknotformationwhendesigning
pseudoknot-free RNAs.
3.2.
TodemonstratethesequenceconstraintfunctioninMODENA,we
performed an RNA inverse folding with the secondary structure
andsequenceof aknownhepatitisdeltavirus(HDV)self-cleaving
DESIGN WITH SEQUENCE CONSTRAINTS
FIGURE 5 | Eight HDV ribozyme sequences designed by MODENA.The
top eight rows are designed RNA sequences.Trgt and cnst rows
correspond to the target pseudoknotted secondary structure in bracket
notation and constraint sequences, respectively. A set of pos1 and pos2
indicates a nucleotide position.
ribozyme,which has been used as a prototype for generating arti-
ficial ribozymes (Schultes and Bartel, 2000). The pseudoknotted
secondary structure and the sequence motifs (key nucleotides)
of the HDV ribozyme design were taken from Figure 1 in the
paperbySchultesandBartel(2000).Thekeynucleotides,whichare
importantfortheactivityof theribozyme,wereusedasconstraint
sequences.ByusingMODENA+IPknotwithapopulationsizeof
100 and an iteration number of 100, we successfully designed 8
RNA sequences folding into the structure of the prototype HDV
ribozyme with the constraint sequence motifs. The designed 8
HDVribozymesequencesareshowninFigure5,inwhichthetar-
get structure, constraint sequences, and nucleotide positions are
also indicated.As can clearly be seen from the figure,the designed
8sequencesshareallconstraintsequences.Moreover,interestingly,
the designed 8 sequences are highly “conserved”. To illustrate the
sequence conservation among the designed sequences, we drew
the sequence logo of the 8 sequences by using WebLogo (Crooks
etal.,2004;Figure6).Thelowsequenceconservationintheregion
betweenposition51and74ismainlyduetotheseq3,sincetheseq3
has a very different subsequence from the other sequences in the
region. This seq3 has a very similar sequence except for the region
betweenposition51and74,hencewecanguessthattheseq3shares
an ancestral sequence with the other seven successfully designed
sequences in our GA. In addition, the region between position
51 and 74 corresponds to a hairpin structure [the P4 stem+L4
loop (Schultes and Bartel, 2000)] of the HDV ribozyme. These
results imply that the sequence difference between seq3 and the
other seven successfully designed sequences in the region between
position 51 and 74 was generated by structural n-point crossover
in our GA.
This constrained design of the HDV ribozyme is a relatively
hard calculation;we could not design the HDV ribozyme with the
constraintswhenwesetapopulationsizeandaniterationnumber
to a smaller value, 50; MODENA+HotKnots failed to design the
pseudoknotted ribozyme even when we set both a population size
and an iteration number to 100.
FIGURE 6 |The sequence logo for the eight designed HDV ribozyme sequences.The sequence logo was generated by WebLogo (Crooks et al., 2004).
www.frontiersin.org
April 2012 | Volume 3 | Article 36 | 7
Page 8
Taneda Pseudoknotted RNA sequence design
DISCUSSION
Wehaveproposedamulti-objectivegeneticalgorithmforpseudo-
knotted RNA sequence design, which is a modified version of
our previous pseudoknot-free RNA design algorithm. Important
differences between the current version which can design pseudo-
knots and the previous pseudoknot-free version are as follows.
(i) We utilize a new structural n-point crossover operator in
the current version, by which we can generate child solutions
without breaking complementary relationships in parent solu-
tions even when pseudoknots are included in the target struc-
ture. (ii) We allow MODENA to use pseudoknotted RNA struc-
ture prediction methods as direct problem solver. As a result,
the current version of MODENA can directly evaluate whether
designed sequences have a desired pseudoknot structure or
not. This feature is indispensable for the inverse engineering
of pseudoknotted RNAs. (iii) The third important point intro-
duced in the current version of MODENA is sequence con-
straint. Since the current version of MODENA can work as both
pseudoknotted and pseudoknot-free RNA sequence designer, the
sequence constraint function of MODENA can be utilized to
design not only pseudoknotted RNAs but also pseudoknot-free
ones.
The new version of MODENA, in which the new features
for pseudoknot design are implemented, was tested with two
benchmark datasets: the Pseudobase dataset, which is a non-
redundant dataset and is composed of 266 target structures
taken from the Pseudobase, and the Rfam dataset which does
not contain pseudoknots. In both datasets, MODENA showed
high sequence design performances. For the Pseudobase dataset,
another pseudoknot design algorithm, Inv, was also bench-
marked and it was found that MODENA can successfully
design pseudoknotted RNAs for more target structures compared
to Inv.
The sequence constraint function of MODENA was tested
through the inverse folding of a HDV ribozyme. In this test,
we successfully obtained 8 RNA sequences, which fold into the
target pseudoknotted structure of the HDV ribozyme. All of the
designed 8 RNA sequences have the key nucleotides important
for the activity of the ribozyme, which were specified as sequence
constraints when running MODENA.
The present results clearly indicate that multi-objective genetic
algorithm is a promising approach for the inverse folding of
pseudoknotted RNA. One important issue concerning the com-
putationalinversefoldingis“Dothedesignedsequencestrulyfold
into the target structure in vivo and/or in vitro?”, in other words,
the reliability of the design. Although theorists have no answer to
this question, it is noteworthy that the inverse folding methods
can be improved accompanying improvement of structure pre-
diction methods. In inverse folding, the prediction accuracy of
direct problem solver (structure prediction method) determines
the design reliability. As an extreme case, if we can use a perfect
structure prediction method as a direct problem solver (where
“perfect” means that the RNA sequence for which a structure is
predicted strictly folds into the predicted structure in vivo and/or
in vitro), the designed RNA sequence will perfectly fold into the
target structure. Recent drastic progress in RNA structure predic-
tionmethodshasenabledustoperformveryaccurateandefficient
RNA structure prediction. Improvement of structure prediction
methods will continue not only in a secondary structure level but
alsoinatertiarystructurelevel(DasandBaker,2007;Parisienand
Major, 2008), and the design reliability of RNA inverse folding
methods will also continue to be improved.
ACKNOWLEDGMENTS
This work was partially supported by KAKENHI (22700304) and
a“Grant for Hirosaki University Institutional Research”.
REFERENCES
Andronescu, M., Fejes, A. P., Hutter, F.,
Hoos,H.H.,andCondon,A.(2004).
AnewalgorithmforRNAsecondary
structure design. J. Mol. Biol. 336,
607–624.
Batenburg,F.H.V.,Gultyaev,A.P.,Pleij,
C.W.,Ng,J.,andOliehoek,J.(2000).
PseudoBase: a database with RNA
pseudoknots. Nucleic Acids Res. 28,
201–204.
Breaker, R. R. (2004). Natural and
engineered nucleic acids as tools
to explore biology. Nature 432,
838–845.
Busch, A., and Backofen, R. (2006).
INFO-RNA–a
inverse RNA folding. Bioinformatics
22, 1823–1831.
Condon, A., Davy, B., Rastegari, B.,
Zhao,S.,andTarrant,F.(2004).Clas-
sifying RNA pseudoknotted struc-
tures. Theor. Comp. Sci. 320, 35–50.
Crooks, G. E., Hon, G., Chandonia,
J. M., and Brenner, S. E. (2004).
WebLogo:asequencelogogenerator.
Genome Res. 14, 1188–1190.
fast approach to
Das, R., and Baker, D. (2007). Auto-
matedde novo
native-like RNA tertiary structures.
Proc. Natl. Acad. Sci. U.S.A. 104,
14664–14669.
Deb, K. (2001). Multi-Objective Opti-
mization using Evolutionary Algo-
rithms. Chichester: John Wiley &
Sons.
Gao, J. Z., Li, L. Y., and Reidys, C.
M. (2010). Inverse folding of RNA
pseudoknot structures. Algorithms
Mol. Biol. 5, 27.
Goldberg, D. E. (1987). Genetic Algo-
rithmsin Search,
and Machine Learning. New York:
Addison-Wesley.
Hamada, M., Kiryu, H., Sato, K.,
Mituyama, T., and Asai, K. (2009).
Prediction of RNA secondary struc-
ture using generalized centroid esti-
mators. Bioinformatics 25, 465–473.
Hofacker, I. (2003). Vienna RNA sec-
ondary structure server. Nucleic
Acids Res. 31, 3429–3431.
Hofacker, I., Fontana, W., Stadler, P.,
Bonhoeffer, L., Tacker, M., and
predictionof
Optimization
Schuster, P. (1994). Fast folding
and comparison of RNA secondary
structures. Monatsh. Chem. 125,
167–188.
Hoos,H. H.,and Stützle,T. (2004). Sto-
chasticLocalSearch:Foundationsand
Applications. San Francisco: Else-
vier/Morgan Kaufmann.
Jaeger, L., Westhof, E., and Leontis,
N. B. (2001). TectoRNA: modular
assembly units for the construction
of RNA nano-objects. Nucleic Acids
Res. 29, 455–463.
Markham, N. R., and Zuker, M.
(2008).UNAFold:
nucleic acid folding and hybridiza-
tion.Methods
3–31.
Parisien, M., and Major, F. (2008).
TheMC-FoldandMC-Sympipeline
infers RNA structure from sequence
data. Nature 452, 51–55.
Ren, J., Rastegari, B., Condon, A., and
Hoos, H. H. (2005). HotKnots:
heuristic prediction of RNA sec-
ondarystructuresincludingpseudo-
knots. RNA 11, 1494–1504.
softwarefor
Mol.Biol. 453,
Sato,K., Kato, Y.,Hamada,M.,
Akutsu, T., and Asai, K. (2011).
IPknot: fast and accurate predic-
tion of RNA secondary structures
with pseudoknots using integer
programming. Bioinformatics 27,
85–93.
Schultes, E. A., and Bartel, D. P. (2000).
One sequence,
implications for the emergence of
new ribozyme folds. Science 289,
448–452.
Schwab, R., Ossowski, S., Riester,
M., Warthmann, N., and Weigel,
D. (2006). Highly specific gene
silencing by artificial microRNAs
inArabidopsis.
1121–1133.
Staple, D. W., and Butcher, S. E. (2005).
Pseudoknots: RNA structures with
diverse functions. PLoS Biol. 3,e213.
doi:10.1371/journal.pbio.0030213
Taneda, A. (2011). MODENA: a multi-
objective RNA inverse folding. Adv.
Appl. Bioinform. Chem. 4, 1–12.
Zadeh, J. N., Steenberg, C. D., Bois,
J. S., Wolfe, B. R., Pierce, M. B.,
two ribozymes:
PlantCell 18,
Frontiers in Genetics | Non-Coding RNA
April 2012 | Volume 3 | Article 36 | 8
Page 9
Taneda Pseudoknotted RNA sequence design
Khan, A. R., Dirks, R. M., and
Pierce, N. A. (2011). NUPACK:
analysis and design of nucleic acid
systems. J. Comput. Chem. 32,
170–173.
Conflict of Interest Statement: The
author declares that the research was
conducted in the absence of any com-
mercial or financial relationships that
could be construed as a potential con-
flict of interest.
Received:10November2011;accepted:25
February2012;publishedonline:26April
2012.
Citation: Taneda
objective
pseudoknotted
design.
10.3389/fgene.2012.00036
This article was submitted to Frontiers in
Non-CodingRNA,aspecialtyofFrontiers
in Genetics.
A (2012) Multi-
algorithm
RNA
Gene.
3:36.
geneticfor
sequence
Front.doi:
Copyright © 2012 Taneda. This is an
open-access article distributed under the
terms of the Creative Commons Attribu-
tionNonCommercialLicense,whichper-
mits non-commercial use, distribution,
and reproduction in other forums, pro-
vided the original authors and source are
credited.
www.frontiersin.org
April 2012 | Volume 3 | Article 36 | 9