Page 1

ORIGINAL RESEARCH ARTICLE

published: 26 April 2012

doi: 10.3389/fgene.2012.00036

Multi-objective genetic algorithm for pseudoknotted RNA

sequence design

AkitoTaneda*

Graduate School of Science andTechnology, Hirosaki University, Hirosaki, Japan

Edited by:

Michael Rossbach, Genome Institute

of Singapore, Singapore

Reviewed by:

Kengo Sato, Keio University, Japan

Prasanna R. Kolatkar, Genome

Institute of Singapore, Singapore

Fei Li, Nanjing Agricultural University,

China

*Correspondence:

AkitoTaneda, Graduate School of

Science andTechnology, Hirosaki

University, 3 Bunkyo-cho, Hirosaki,

Aomori 036-8561, Japan.

e-mail: taneda@cc.hirosaki-u.ac.jp

RNA inverse folding is a computational technology for designing RNA sequences which

foldintoauser-specifiedsecondarystructure.Althoughpseudoknotsarefunctionallyimpor-

tantmotifsinRNAstructures,lessreportsconcerningtheinversefoldingofpseudoknotted

RNAs have been done compared to those for pseudoknot-free RNA design. In this paper,

we present a new version of our multi-objective genetic algorithm (MOGA), MODENA,

which we have previously proposed for pseudoknot-free RNA inverse folding. In the new

version of MODENA, (i) a new crossover operator is implemented and (ii) pseudoknot pre-

diction methods, IPknot and HotKnots, are used to evaluate the designed RNA sequences,

allowing us to perform the inverse folding of pseudoknotted RNAs. The new version

of MODENA with the new crossover operator was benchmarked with a dataset com-

posed of natural pseudoknotted RNA secondary structures, and we found that MODENA

can successfully design more pseudoknotted RNAs compared to the other pseudoknot

design algorithm. In addition, a sequence constraint function newly implemented in the

new version of MODENA was tested by designing RNA sequences which fold into the

pseudoknotted structure of a hepatitis delta virus ribozyme; as a result, we success-

fully designed eight RNA sequences.The new version of MODENA is downloadable from

http://rna.eit.hirosaki-u.ac.jp/modena/.

Keywords: inverse folding, pseudoknot, secondary structure, pseudobase, Rfam, sequence constraint

1.

Evolutionary related non-coding RNAs have their own charac-

teristic secondary structure corresponding to each function, and

it is well known that the secondary structures play key roles in

the functions of the RNA sequences. This biochemical knowl-

edgeaccumulatedtodateindicatesthatwecangeneratefunctional

synthetic RNAs if we can control the secondary structure of the

RNAs. In this context, various synthetic RNAs, such as ribozymes

(Schultes and Bartel, 2000), micro RNAs (Schwab et al., 2006),

riboswitches (Breaker, 2004), and RNA nano structures (Jaeger

et al.,2001) have been successfully designed.

RNA inverse folding is a computational methodology for

designing RNA sequences which fold into a given target struc-

ture (Hofacker et al., 1994). The name “inverse” comes from the

reason that the inverse folding is defined as the inverse problem of

RNAsecondarystructureprediction,whereRNAsecondarystruc-

ture prediction problem is referred to as “direct problem”. Since

usually there can be multiple solutions for an RNA inverse fold-

ing problem and we have no deterministic algorithm which can

enumerate all solutions of a given RNA inverse folding problem,

previous RNA inverse folding algorithms have adopted heuris-

tic approaches to find desired RNA sequences. We can find the

following six RNA inverse folding algorithms in literature: local

search algorithms [RNAinverse (Hofacker et al., 1994), RNA-

SSD (Andronescu et al., 2004), INFO-RNA (Busch and Backofen,

2006), Inv (Gao et al., 2010), design in NUPACK (Zadeh et al.,

2011)] and a genetic algorithm [GA; MODENA (Taneda, 2011)].

INTRODUCTION

The local search algorithms are well characterized by their ini-

tialization step and refinement step in the exploration procedures

for obtaining desired RNA sequences. First, in these local search

approaches, a single RNA sequence is generated. The pioneering

RNA inverse folding algorithm, RNAinverse, uses a pure random

initialization. RNA-SSD randomly initializes an RNA sequence in

amoresophisticatedmanner,whereabasecompositionandatabu

mechanism for avoiding undesired stem formation are taken into

account. INFO-RNA utilizes a dynamic programming algorithm

to obtain a good initial sequence for the RNA inverse folding,

where the lowest energy RNA sequence determined assuming that

the RNA sequence folds into a given structure is used as an initial

sequence. In the refinement step after the initialization, RNAin-

verse performs adaptive walk to improve the initial sequence,and

RNA-SSD and INFO-RNA use stochastic local search (Hoos and

Stützle, 2004) to improve the initial sequence. In the refinement

step, RNAinverse and RNA-SSD employ a structure decomposi-

tion strategy to reduce the number of folding calculations for a

whole sequence. Inv and NUPACK also utilize structure decom-

position strategies in their refinement step. Inv is an RNA inverse

folding algorithm designed for a restricted pseudoknot class and

canperformtheinversefoldingofpseudoknottedRNAs.NUPACK

is a suite of programs for computational nucleic acid analysis and

includes a program named design. Design generates the sequences

by minimizing an ensemble defect (Zadeh et al., 2011); the value

of ensemble defect becomes lower, the designed RNA sequence

more specifically folds into a given target structure. MODENA

www.frontiersin.org

April 2012 | Volume 3 | Article 36 | 1

Page 2

Taneda Pseudoknotted RNA sequence design

is a multi-objective genetic algorithm (MOGA; Deb, 2001) for

RNA inverse folding. As objective functions, MODENA uses two

quantities, a structure similarity measure and a stability measure

(e.g.,freeenergy).Byvirtueof simultaneousoptimizationinthese

objectivefunctions,MODENAcanexplorethesequencewhichnot

only folds into the desired target structure but has a high stability

(= a low free energy).

PseudoknotsareimportantfunctionalmotifsinRNAstructure

(Staple and Butcher, 2005). In contrast to other RNA structure

motifs such as hairpin loop, bulge loop, internal loop, and multi-

branch loop, where any two base pairs (i, j) and (k, l) do not

have a relationship such that i <k <j <l (where i, j, k, and l

are nucleotide positions), pseudoknots are defined as the struc-

tures which have base pairs satisfies the condition i <k <j <l.

Since pseudoknots have various enzymatic functions (Staple and

Butcher, 2005), they are intriguing targets of functional RNA

design. However,in the previous RNA inverse folding algorithms,

only Inv can design pseudoknots. Moreover,there is no algorithm

whichcandesignpseudoknottedRNAswithsequenceconstraints,

which are an important feature for designing the molecule with

a known functional sequence motif. For these reasons, develop-

ment of a novel pseudoknotted RNA inverse folding algorithm

is important in order to promote the sequence design of RNA

pseudoknots.

In this paper, we present an extension of MODENA algorithm

to the inverse folding of pseudoknotted RNAs. In MODENA

algorithm, designed RNA sequences are evaluated by perform-

ing secondary structure prediction with an RNA folding program

such as RNAfold (Hofacker, 2003), and we can easily substitute

the RNA folding program by a different RNA folding program

(in the context of inverse folding, we refer RNA structure pre-

diction program as “direct problem solver”). This advantage of

MODENA algorithm also remains in the case of pseudoknotted

RNA inverse folding, where we have to use a pseudoknotted RNA

secondary structure prediction program as direct problem solver.

In the rest of the present paper, first, we describe the MODENA

algorithmforpseudoknottedRNAdesign,whereamulti-objective

genetic algorithm is used in combination with the state of the art

pseudoknotted RNA structure prediction programs. After that,

the performance of MODENA algorithm is evaluated by bench-

marksbasedonnaturalRNAsecondarystructures,wherenotonly

pseudoknotted structures but also pseudoknot-free structures are

takenintoaccount.Then,asequenceconstraintfunctionavailable

in MODENA is demonstrated with a biological example taken

from literature.

2.

Since the detail of the MODENA algorithm for pseudoknot-

free RNAs is described in Taneda (2011) and the present ver-

sion for pseudoknot RNA design shares all parts of the previ-

ous pseudoknot-free version of MODENA, the algorithmically

common parts between the two versions are briefly described

below.

MODENAalgorithmisanRNAinversefoldingalgorithmbased

on MOGA. GA is a population based algorithm for optimization

and search (Goldberg, 1987), which is inspired from the mech-

anism of evolution. MOGA is a GA for exploring the objective

MATERIALS AND METHODS

function space consisting of multiple objective functions, while

standard GA uses a single objective function. In MODENA algo-

rithm, we use the following two objective functions, a structure

stability score ? and a structure similarity score σ:

? = −E,

σ = (N − d)/N,

(1)

(2)

where E is the lowest free energy of a designed sequence; N is

the total number of nucleotides, and d is the structure distance

between target and predicted structures (Taneda,2011). Structure

distance d is defined as the number of the bases which have a

different base-paring status between the target structure and the

structure predicted for the designed sequence.

In MODENA algorithm, we utilize multi-objective optimiza-

tion(MOO;Deb,2001)toexploresolutions(i.e.,RNAsequences)

withbettervaluesof bothof thetwoobjectivefunctions.InMOO,

twosolutionsarecomparedbasedontheirdominance.Letuscon-

sider two solutions,xaand xb.When“all objective function values

of xaare better than or equal to those of xb” and “at least one

objective function value of xais not equal to that of xb”, xadomi-

nates xb; a solution which is not dominated by any other solution

is called a Pareto optimal solution. If “all objective function val-

ues of xaare better than those of xb”, xastrongly dominates xb; a

weak Pareto optimal solution is defined as a solution which is not

strongly dominated by any other solution. MODENA algorithm

explores weak Pareto optimal solutions for RNA inverse folding

problem (Taneda, 2011).

Since usually it is difficult to enumerate all (weak) Pareto

optimal solutions for a given MOO problem, MOGA computes

approximate set (partial solutions) of the (weak) Pareto optimal

solutions. MODENA is developed based on one of the standard

MOGA, non-dominated sorting genetic algorithm 2 (NSGA2;

Deb, 2001). NSGA2 proceeds in accordance with the framework

similar to that of standard GA, which is composed of initializa-

tion, evaluation, and reproduction for a population of solutions.

Intheinitializationstep,auser-definednumberof solutions(RNA

sequences) are randomly generated. In the present study, we use

50 or 100 solutions in one population.

Intheevaluationstep,weperformanRNAstructureprediction

for each solution in the current generation,and then assign stabil-

ityandsimilarityscorestothesolutions.WeuseanRNAstructure

predictionprogramasadirectproblemsolvertoobtaintheobjec-

tive function values. In MODENA, RNAfold (Hofacker, 2003),

CentroidFold (Hamada et al., 2009), or UNAFold (Markham

and Zuker, 2008) can be used as a direct problem solver for

pseudoknot-free RNA design, and IPknot (Sato et al., 2011) or

HotKnots (Ren et al., 2005) can be utilized for pseudoknotted

RNA sequence design (in the present study, we used IPknot 0.0.2

and HotKnots 2.0). Since IPknot does not output a free energy

or some quantity indicating stability of the predicted structure,

instead of using Equation 1, we assign the total number of gua-

nineandcytosinepairsinthepredictedbasepairstoeachsolution

as a stability score when we use IPknot as a direct problem solver.

Based on the solutions of the current generation,reproduction

step generates child solutions for the next generation. MODENA

Frontiers in Genetics | Non-Coding RNA

April 2012 | Volume 3 | Article 36 | 2

Page 3

Taneda Pseudoknotted RNA sequence design

algorithm generates child solutions by invoking three GA opera-

tors with an equal probability: structural n-point crossover, point

accepted mutation, and error diagnosis mutation (Taneda, 2011);

structural n-point crossover generates a child solution by con-

catenating subsequences taken from two parent solutions; point

acceptedmutationrandomlychangesanucleotide;errordiagnosis

mutation compares the predicted and target structures, and then

changes the nucleotides which have different base pairs between

the predicted and target structures.

Point accepted mutation and error diagnosis mutation can

be applied to pseudoknotted RNA sequence design without any

modification. Structural n-point crossover, however, has to be

changed for pseudoknotted RNAs, since its original algorithm

(Taneda,2011)assumesnopseudoknotinatargetstructure(i.e.,a

nucleotide k,[i <k <j],never forms a base pair with a nucleotide

l [l <i orj <l]).Structuraln-pointcrossoveriscomposedof four

steps (Taneda,2011,p. 5),and we modified Step 2 to take pseudo-

knots into account. After a crossover parameter nc(= 2 in the

presentstudy)andarandomlydeterminedx0(=0or1)aregiven,

structural n-point crossover allowing pseudoknots is performed

as follows:

Step1 Set l =0 and set xi=x0for all i (1≤i≤N;N is a sequence

length). Randomly select a base pair (i, j) (1≤i<j≤N).

Step2Foreachxk(i ≤k ≤j)whichdoesnotformabasepairwith

xl(l <i or j <l) in the target structure, perform the following:

if xkis zero, change xkto one, otherwise change xkto zero. Then

increment l by one.

Step3 If l<ncand“the number of the base pairs whose upstream

nucleotidepositionmsatisfiesi<m(wherem<N)”islargerthan

or equal to one, randomly select a base pair (inew, jnew), where

i<inew<jnew≤N;then we set i=inewand j=jnew,and move to

Step 2; otherwise we go to Step 4.

Step4Generateachildsolutionaccordingtoxiforalli (1≤i≤N);

if xi=0,copythevalueof anucleotidesA

sponding nucleotide schild

i

of the child solution;if xi=1,the value

of a nucleotide sB

iinparentAtothecorre-

iin parent B is copied to schild

i

.

It is noted that Step 1, Step 3, and Step 4 are exactly the same

with those in Taneda (2011). By using this modified version, we

can crossover two parent solutions without destructing any base

complementarity in the target structure.An example of structural

n-point crossover is depicted in Figure 1.

2.1.

Since the previous version of MODENA does not support

sequence constraints, we have added the function to the present

version of MODENA. The sequence constraints of MODENA can

be specified in accordance with the IUPAC notation of nucleotide

codes. By using the sequence constraints,user can design pseudo-

knottedRNAsequenceswithsequencemotifsspecifiedbytheuser.

SEQUENCE CONSTRAINTS

2.2.

In MODENA, user inputs a target structure using a bracket nota-

tion, where (), <>, {}, [], and alphabets (AaBbCcDdEe) are

allowed to specify a base pair (where uppercase and lowercase

alphabets indicate upstream and downstream nucleotide posi-

tions, respectively). User can freely input a target structure using

A NOTE ON INPUT TARGET STRUCTURE

FIGURE 1 |An example of structural n-point crossover operator for

pseudoknotted target structure.Trgt and x_k indicate a target structure

and a crossover position indicator, xk, respectively. (A) An initial state. All

xks are set to zero. (B) Position i is randomly selected. Position j is the

position complementary to the position i. (C)The values of xks between i

and j are changed to 1.The values in the pseudoknotted region are not

changed.The positions whose xk=1 are shaded. We can use the xks

obtained after this step as a crossover position indicator for a 4-point

crossover. (D) In addition, we can randomly select one more position i to

increase the crossover points. (E) In this example, as a result, we can

obtain a crossover position indicator for a 6-point crossover, which is

composed of 7 subsequence regions.

these bracket notations, but it is noted that if the direct problem

solver selected by the user cannot predict the class (e.g., Condon

et al., 2004) of the input pseudoknot, the user never obtain the

sequences folding into the target structure.

2.3.

We evaluated the design performance of MODENA with a dataset

which contains the natural pseudoknotted structures taken from

Pseudobase(Batenburgetal.,2000).Sincetheoriginal342pseudo-

knotted structures downloaded from Pseudobase are redundant,

i.e.,different Pseudobase entries can share strictly the same struc-

ture,we performed a non-redundant processing to guarantee that

allstructuresareuniqueinourdataset.Consequently,weobtained

266pseudoknottedstructuresfortheperformanceevaluation.We

refer to this dataset as the Pseudobase dataset.

In addition to the benchmark for the pseudoknotted target

structures,we performed a benchmark for pseudoknot-free target

structures,wheretheRfamdatasetof ourpreviouspaper(Taneda,

2011) was used. It is noted that a pseudoknot prediction method

DATASET FOR BENCHMARK

www.frontiersin.org

April 2012 | Volume 3 | Article 36 | 3

Page 4

TanedaPseudoknotted RNA sequence design

(IPknot) was used as a direct problem solver in this pseudoknot-

free benchmark. The reason why we performed a benchmark for

pseudoknot-free target structures in the present study is as fol-

lows. If we use a non-pseudoknot prediction method as a direct

problem solver to design an RNA sequence, the designed RNA

sequence may fold into a pseudoknotted structure when we fold

the designed sequence with a pseudoknot prediction method. By

using a pseudoknot prediction method as a direct problem solver

for pseudoknot-free RNA sequence design, we can decrease the

probability with which undesired pseudoknots accidentally form

in the designed RNA sequence. That is,inverse folding of pseudo-

knotted RNAs is useful not only to design pseudoknotted RNA

sequences but also to design pseudoknot-free ones.

3.

3.1.

We evaluated the pseudoknot design performance of MOD-

ENA with the Pseudobase dataset, where IPknot and Hotknots

were used as a direct problem solver. We set both a popula-

tion size and maximum iteration number to 50 in our GA. In

this performance evaluation, we obtained successfully designed

RNA sequences for 207 and 198 pseudoknotted target struc-

tures by MODENA+IPknot and MODENA+HotKnot, respec-

tively, in the 266 target structures of the Pseudobase dataset,

where MODENA+IPknot and MODENA+HotKnots denote

RESULTS

BENCHMARK RESULTS

the sequence design utilizing IPknot and HotKnots as a direct

problem solver, respectively (“successfully designed RNAs” mean

the RNA sequences which fold into the input target structure).

Inv obtained successfully designed RNAs for 181 pseudoknot-

ted target structures with the same dataset. Figure 2 shows the

sequence length dependence of the pseudoknot design perfor-

mancesforMODENAandInv,wheretheperformanceisindicated

by “the rate of successfully designed RNAs”=100×(number of

the target structures for which a successfully designed RNA is

obtained)/(total number of the target structures). The total num-

ber of the target structures included in each length bin is given in

Figure 3. As can be seen from Figure 2, MODENA+IPknot out-

performsInvforallbinsof sequencelengths.MODENA+IPknot

showed the best performance for the length range between 21

and 60 nucleotides. For the range between 61 and 80 nucleotides,

MODENA+IPknot and MODENA+HotKnots have compa-

rable performances. For longer target structures with lengths

from 81 to 140 nucleotides, MODENA+HotKnots gives the best

results among MODENA+IPknot, MODENA+HotKnots, and

Inv.

For the target structures longer than 85 nucleotides, Inv com-

pletely failed to design pseudoknots. MODENA also could not

obtainsuccessfullydesignedpseudoknottedRNAswhenthetarget

structures have a length longer than 137 nucleotides. It is noted

that the number of target structures longer than 81 nucleotides

FIGURE 2 | Sequence length dependence of the pseudoknot design

performances for MODENA and Inv. “MODENA+IPknot” and

“MODENA+HotKnots” denote the sequence designs using IPknot and

HotKnots as a direct problem solver, respectively. In each bin, results for

MODENA+IPknot, MODENA+HotKnots and Inv are shown from left to

right.

Frontiers in Genetics | Non-Coding RNA

April 2012 | Volume 3 | Article 36 | 4

Page 5

TanedaPseudoknotted RNA sequence design

FIGURE 3 | Distribution of the target structures in the Pseudobase dataset.

is much smaller than that of the other shorter target struc-

tures (Figure 3); a benchmark with more target structures with

long lengths may give a different result. Details of the results

for the Pseudobase dataset are tabulated in Table S1 in Supple-

mentary Material, which is downloadable from the MODENA

website.

To examine whether a larger calculation,where both a popula-

tionsizeandaniterationnumberhaveavalueof 100,improvesthe

pseudoknot design performance or not,we performed the inverse

foldingofthe59targetstructureswhichwerefailedtodesignwhen

we used a value of 50. By using the larger parameter values, we

successfully designed 15 pseudoknots (Pseudobase PKB-number:

PKB00050, PKB00129,PKB00138,

PKB00171,PKB00178, PKB00179,

PKB00219, PKB00228, PKB00267, PKB00329, PKB00333) of the

59 target structures. These results indicate that a larger calcula-

tion can improve the design performance; it is noted that the

computational time for the larger calculation becomes longer,

i.e., there is a tradeoff between computational time and design

performance.

The logarithm of the computational times needed for

the Pseudobase benchmark of MODENA+IPknot, MOD-

ENA+HotKnots, and Inv is plotted in Figure 4. The compu-

tational times were measured on a Core i7 PC (3.33GHz; 24GB

memory;CentOS5.6[x86_64]).Sinceweperformedfiftyindepen-

dent runs with Inv for each target structure, the mean computa-

tionaltimesforthetargetstructuresareusedasthecomputational

PKB00148,

PKB00211,

PKB00170,

PKB00217,

50100 150

−2

−1

0

1

2

3

4

5

Sequence length (nucleotides)

log10(time in seconds)

MODENA+HotKnots

MODENA+IPknot

Inv

FIGURE 4 |The logarithm of the computational times needed for the

Pseudobase benchmark of MODENA+IPknot, MODENA+HotKnots,

and Inv. Each symbol corresponds to one target structure. Each

computational time for Inv is the mean over fifty independent runs.The

results of failed Inv runs are not included in this figure.

www.frontiersin.org

April 2012 | Volume 3 | Article 36 | 5

Page 6

TanedaPseudoknotted RNA sequence design

time of Inv in Figure 4 (and in Table S1 in Supplementary Mate-

rial). Figure 4 clearly reveals the difference between two direct

problem solvers we used for MODENA in the present study; i.e.,

IPknot is much faster than HotKnots. For the target structures

shorter than 50 nucleotides, Inv is faster than MODENA. How-

ever, in longer target structures, we found that Inv often becomes

much slower than MODENA+IPknot. In addition, Inv com-

pletely failed to design the pseudoknotted RNAs longer than 85

nucleotides.Invquicklyterminatesitscalculationwhentheinverse

foldingof theinputtargetstructureisimpossible(Invanalyzesthe

input target structure before performing a stochastic search). The

results of such terminated calculations are not plotted inFigure4;

the computational times of the terminated Inv runs can be seen in

Table S1 in Supplementary Material.

Tocomparetheconvergencepropertiesofdifferentdirectprob-

lem solvers, we averaged the converged GA iteration numbers for

all target structures in the Pseudobase dataset (where we used the

results for a population size and a maximum iteration number

of 50). As convergence criterions, similar to Taneda (2011), the

GA iteration stops when (i) the maximum iteration number is

reached or (2) the number of weak Pareto optimal solutions is

not changed during continuous 30 iterations. As a result, we

found that MODENA needs 41.8 and 36.9 GA iterations when

IPknot and HotKnots, respectively, are used as a direct problem

solver.

The inverse folding results for the Rfam dataset,which is com-

posed of pseudoknot-free target structures, are summarized in

Table 1, where the results obtained by MODENA+IPknot alone

areshown.Thisisbecausethetargetstructurelengthsof theRfam

dataset are too long for MODENA+HotKnots in terms of com-

putational time, and Inv is limited to the application to the short

target structures. In this benchmark for the pseudoknot-free tar-

get structures, MODENA successfully designed RNA sequences

for 22 target structures. This result is comparable to our previous

result(Taneda,2011)obtainedbyusingthedirectproblemsolvers

which cannot predict pseudoknots. The present result indicates

Table 1 |The benchmark results for the pseudoknot-free Rfam dataset.

RfamAC Rfam IDl (nt) succ. GC_high GC_lowt (s)

RF00001

RF00002

RF00003

RF00004

RF00005

RF00006

RF00007

RF00008

RF00009

RF00010

RF00011

RF00012

RF00013

RF00014

RF00015

RF00016

RF00017

RF00018

RF00019

RF00020

RF00021

RF00022

RF00024

RF00025

RF00026

RF00027

RF00028

RF00029

RF00030

5S_rRNA

5_8S_rRNA

U1

U2

tRNA

Vault

U12

Hammerhead_3

RNaseP_nuc

RNaseP_bact_a

RNaseP_bact_b

U3

6S

DsrA

U4

SNORD14

SRP_euk_arch

CsrB

Y_RNA

U5

Spot_42

GcvB

Telomerase-vert

Telomerase-cil

U6

Let-7

Intron_gpI

Intron_gpII

RNase_MRP

117

151

161

193

74

89

154

54

348

357

382

215

185

87

140

129

301

360

83

119

118

148

451

210

102

79

344

73

340

0/50

24/50

0/50

37/50

40/50

30/50

39/50

39/50

0/50

0/50

0/50

38/50

39/50

36/50

36/50

38/50

26/50

23/50

38/50

0/50

44/50

31/50

27/50

35/50

33/50

34/50

0/50

32/50

35/50

51

39

60

76

39

35

74

24

84

149

158

71

79

51

46

32

164

77

39

53

67

58

172

59

8

59

85

30

105

33

34

38

35

12

14

37

9

74

143

154

33

40

17

26

4

116

56

13

22

22

30

119

41

3

18

61

14

76

20.253

34.430

34.468

54.946

9.523

11.024

35.436

6.056

160.996

314.373

305.932

58.912

48.594

13.750

28.434

19.416

205.422

192.333

11.252

20.646

26.401

28.952

367 .750

47 .194

10.493

14.306

192.231

8.725

151.637

“l” , “succ.” and t columns represent the length (= number of nucleotides) of a target structure, a success rate, and a computational time in seconds, respectively; x/y

indicates a “success rate” in such a way that we obtained x successfully designed sequences when we used a GA population size of y. “GC_high” and “GC_low”

are the highest and lowest nGCs, respectively, where nGCis the total number of guanine and cytosine pairs in the predicted base pairs. Computational times were

measured on a Core i7 PC (3.33GHz; 24GB memory; CentOS 5.6[x86_64]).

Frontiers in Genetics | Non-Coding RNA

April 2012 | Volume 3 | Article 36 | 6

Page 7

TanedaPseudoknotted RNA sequence design

that pseudoknot prediction methods are useful even for design-

ing pseudoknot-free RNA sequences, by which we can reduce the

possibilityof anaccidentalpseudoknotformationwhendesigning

pseudoknot-free RNAs.

3.2.

TodemonstratethesequenceconstraintfunctioninMODENA,we

performed an RNA inverse folding with the secondary structure

andsequenceof aknownhepatitisdeltavirus(HDV)self-cleaving

DESIGN WITH SEQUENCE CONSTRAINTS

FIGURE 5 | Eight HDV ribozyme sequences designed by MODENA.The

top eight rows are designed RNA sequences.Trgt and cnst rows

correspond to the target pseudoknotted secondary structure in bracket

notation and constraint sequences, respectively. A set of pos1 and pos2

indicates a nucleotide position.

ribozyme,which has been used as a prototype for generating arti-

ficial ribozymes (Schultes and Bartel, 2000). The pseudoknotted

secondary structure and the sequence motifs (key nucleotides)

of the HDV ribozyme design were taken from Figure 1 in the

paperbySchultesandBartel(2000).Thekeynucleotides,whichare

importantfortheactivityof theribozyme,wereusedasconstraint

sequences.ByusingMODENA+IPknotwithapopulationsizeof

100 and an iteration number of 100, we successfully designed 8

RNA sequences folding into the structure of the prototype HDV

ribozyme with the constraint sequence motifs. The designed 8

HDVribozymesequencesareshowninFigure5,inwhichthetar-

get structure, constraint sequences, and nucleotide positions are

also indicated.As can clearly be seen from the figure,the designed

8sequencesshareallconstraintsequences.Moreover,interestingly,

the designed 8 sequences are highly “conserved”. To illustrate the

sequence conservation among the designed sequences, we drew

the sequence logo of the 8 sequences by using WebLogo (Crooks

etal.,2004;Figure6).Thelowsequenceconservationintheregion

betweenposition51and74ismainlyduetotheseq3,sincetheseq3

has a very different subsequence from the other sequences in the

region. This seq3 has a very similar sequence except for the region

betweenposition51and74,hencewecanguessthattheseq3shares

an ancestral sequence with the other seven successfully designed

sequences in our GA. In addition, the region between position

51 and 74 corresponds to a hairpin structure [the P4 stem+L4

loop (Schultes and Bartel, 2000)] of the HDV ribozyme. These

results imply that the sequence difference between seq3 and the

other seven successfully designed sequences in the region between

position 51 and 74 was generated by structural n-point crossover

in our GA.

This constrained design of the HDV ribozyme is a relatively

hard calculation;we could not design the HDV ribozyme with the

constraintswhenwesetapopulationsizeandaniterationnumber

to a smaller value, 50; MODENA+HotKnots failed to design the

pseudoknotted ribozyme even when we set both a population size

and an iteration number to 100.

FIGURE 6 |The sequence logo for the eight designed HDV ribozyme sequences.The sequence logo was generated by WebLogo (Crooks et al., 2004).

www.frontiersin.org

April 2012 | Volume 3 | Article 36 | 7

Page 8

Taneda Pseudoknotted RNA sequence design

DISCUSSION

Wehaveproposedamulti-objectivegeneticalgorithmforpseudo-

knotted RNA sequence design, which is a modified version of

our previous pseudoknot-free RNA design algorithm. Important

differences between the current version which can design pseudo-

knots and the previous pseudoknot-free version are as follows.

(i) We utilize a new structural n-point crossover operator in

the current version, by which we can generate child solutions

without breaking complementary relationships in parent solu-

tions even when pseudoknots are included in the target struc-

ture. (ii) We allow MODENA to use pseudoknotted RNA struc-

ture prediction methods as direct problem solver. As a result,

the current version of MODENA can directly evaluate whether

designed sequences have a desired pseudoknot structure or

not. This feature is indispensable for the inverse engineering

of pseudoknotted RNAs. (iii) The third important point intro-

duced in the current version of MODENA is sequence con-

straint. Since the current version of MODENA can work as both

pseudoknotted and pseudoknot-free RNA sequence designer, the

sequence constraint function of MODENA can be utilized to

design not only pseudoknotted RNAs but also pseudoknot-free

ones.

The new version of MODENA, in which the new features

for pseudoknot design are implemented, was tested with two

benchmark datasets: the Pseudobase dataset, which is a non-

redundant dataset and is composed of 266 target structures

taken from the Pseudobase, and the Rfam dataset which does

not contain pseudoknots. In both datasets, MODENA showed

high sequence design performances. For the Pseudobase dataset,

another pseudoknot design algorithm, Inv, was also bench-

marked and it was found that MODENA can successfully

design pseudoknotted RNAs for more target structures compared

to Inv.

The sequence constraint function of MODENA was tested

through the inverse folding of a HDV ribozyme. In this test,

we successfully obtained 8 RNA sequences, which fold into the

target pseudoknotted structure of the HDV ribozyme. All of the

designed 8 RNA sequences have the key nucleotides important

for the activity of the ribozyme, which were specified as sequence

constraints when running MODENA.

The present results clearly indicate that multi-objective genetic

algorithm is a promising approach for the inverse folding of

pseudoknotted RNA. One important issue concerning the com-

putationalinversefoldingis“Dothedesignedsequencestrulyfold

into the target structure in vivo and/or in vitro?”, in other words,

the reliability of the design. Although theorists have no answer to

this question, it is noteworthy that the inverse folding methods

can be improved accompanying improvement of structure pre-

diction methods. In inverse folding, the prediction accuracy of

direct problem solver (structure prediction method) determines

the design reliability. As an extreme case, if we can use a perfect

structure prediction method as a direct problem solver (where

“perfect” means that the RNA sequence for which a structure is

predicted strictly folds into the predicted structure in vivo and/or

in vitro), the designed RNA sequence will perfectly fold into the

target structure. Recent drastic progress in RNA structure predic-

tionmethodshasenabledustoperformveryaccurateandefficient

RNA structure prediction. Improvement of structure prediction

methods will continue not only in a secondary structure level but

alsoinatertiarystructurelevel(DasandBaker,2007;Parisienand

Major, 2008), and the design reliability of RNA inverse folding

methods will also continue to be improved.

ACKNOWLEDGMENTS

This work was partially supported by KAKENHI (22700304) and

a“Grant for Hirosaki University Institutional Research”.

REFERENCES

Andronescu, M., Fejes, A. P., Hutter, F.,

Hoos,H.H.,andCondon,A.(2004).

AnewalgorithmforRNAsecondary

structure design. J. Mol. Biol. 336,

607–624.

Batenburg,F.H.V.,Gultyaev,A.P.,Pleij,

C.W.,Ng,J.,andOliehoek,J.(2000).

PseudoBase: a database with RNA

pseudoknots. Nucleic Acids Res. 28,

201–204.

Breaker, R. R. (2004). Natural and

engineered nucleic acids as tools

to explore biology. Nature 432,

838–845.

Busch, A., and Backofen, R. (2006).

INFO-RNA–a

inverse RNA folding. Bioinformatics

22, 1823–1831.

Condon, A., Davy, B., Rastegari, B.,

Zhao,S.,andTarrant,F.(2004).Clas-

sifying RNA pseudoknotted struc-

tures. Theor. Comp. Sci. 320, 35–50.

Crooks, G. E., Hon, G., Chandonia,

J. M., and Brenner, S. E. (2004).

WebLogo:asequencelogogenerator.

Genome Res. 14, 1188–1190.

fast approach to

Das, R., and Baker, D. (2007). Auto-

matedde novo

native-like RNA tertiary structures.

Proc. Natl. Acad. Sci. U.S.A. 104,

14664–14669.

Deb, K. (2001). Multi-Objective Opti-

mization using Evolutionary Algo-

rithms. Chichester: John Wiley &

Sons.

Gao, J. Z., Li, L. Y., and Reidys, C.

M. (2010). Inverse folding of RNA

pseudoknot structures. Algorithms

Mol. Biol. 5, 27.

Goldberg, D. E. (1987). Genetic Algo-

rithmsin Search,

and Machine Learning. New York:

Addison-Wesley.

Hamada, M., Kiryu, H., Sato, K.,

Mituyama, T., and Asai, K. (2009).

Prediction of RNA secondary struc-

ture using generalized centroid esti-

mators. Bioinformatics 25, 465–473.

Hofacker, I. (2003). Vienna RNA sec-

ondary structure server. Nucleic

Acids Res. 31, 3429–3431.

Hofacker, I., Fontana, W., Stadler, P.,

Bonhoeffer, L., Tacker, M., and

predictionof

Optimization

Schuster, P. (1994). Fast folding

and comparison of RNA secondary

structures. Monatsh. Chem. 125,

167–188.

Hoos,H. H.,and Stützle,T. (2004). Sto-

chasticLocalSearch:Foundationsand

Applications. San Francisco: Else-

vier/Morgan Kaufmann.

Jaeger, L., Westhof, E., and Leontis,

N. B. (2001). TectoRNA: modular

assembly units for the construction

of RNA nano-objects. Nucleic Acids

Res. 29, 455–463.

Markham, N. R., and Zuker, M.

(2008).UNAFold:

nucleic acid folding and hybridiza-

tion.Methods

3–31.

Parisien, M., and Major, F. (2008).

TheMC-FoldandMC-Sympipeline

infers RNA structure from sequence

data. Nature 452, 51–55.

Ren, J., Rastegari, B., Condon, A., and

Hoos, H. H. (2005). HotKnots:

heuristic prediction of RNA sec-

ondarystructuresincludingpseudo-

knots. RNA 11, 1494–1504.

softwarefor

Mol.Biol. 453,

Sato,K., Kato, Y.,Hamada,M.,

Akutsu, T., and Asai, K. (2011).

IPknot: fast and accurate predic-

tion of RNA secondary structures

with pseudoknots using integer

programming. Bioinformatics 27,

85–93.

Schultes, E. A., and Bartel, D. P. (2000).

One sequence,

implications for the emergence of

new ribozyme folds. Science 289,

448–452.

Schwab, R., Ossowski, S., Riester,

M., Warthmann, N., and Weigel,

D. (2006). Highly specific gene

silencing by artificial microRNAs

inArabidopsis.

1121–1133.

Staple, D. W., and Butcher, S. E. (2005).

Pseudoknots: RNA structures with

diverse functions. PLoS Biol. 3,e213.

doi:10.1371/journal.pbio.0030213

Taneda, A. (2011). MODENA: a multi-

objective RNA inverse folding. Adv.

Appl. Bioinform. Chem. 4, 1–12.

Zadeh, J. N., Steenberg, C. D., Bois,

J. S., Wolfe, B. R., Pierce, M. B.,

two ribozymes:

PlantCell 18,

Frontiers in Genetics | Non-Coding RNA

April 2012 | Volume 3 | Article 36 | 8

Page 9

Taneda Pseudoknotted RNA sequence design

Khan, A. R., Dirks, R. M., and

Pierce, N. A. (2011). NUPACK:

analysis and design of nucleic acid

systems. J. Comput. Chem. 32,

170–173.

Conflict of Interest Statement: The

author declares that the research was

conducted in the absence of any com-

mercial or financial relationships that

could be construed as a potential con-

flict of interest.

Received:10November2011;accepted:25

February2012;publishedonline:26April

2012.

Citation: Taneda

objective

pseudoknotted

design.

10.3389/fgene.2012.00036

This article was submitted to Frontiers in

Non-CodingRNA,aspecialtyofFrontiers

in Genetics.

A (2012) Multi-

algorithm

RNA

Gene.

3:36.

geneticfor

sequence

Front.doi:

Copyright © 2012 Taneda. This is an

open-access article distributed under the

terms of the Creative Commons Attribu-

tionNonCommercialLicense,whichper-

mits non-commercial use, distribution,

and reproduction in other forums, pro-

vided the original authors and source are

credited.

www.frontiersin.org

April 2012 | Volume 3 | Article 36 | 9