Page 1

ORIGINAL RESEARCH ARTICLE

published: 26 April 2012

doi: 10.3389/fgene.2012.00036

Multi-objective genetic algorithm for pseudoknotted RNA

sequence design

AkitoTaneda*

Graduate School of Science andTechnology, Hirosaki University, Hirosaki, Japan

Edited by:

Michael Rossbach, Genome Institute

of Singapore, Singapore

Reviewed by:

Kengo Sato, Keio University, Japan

Prasanna R. Kolatkar, Genome

Institute of Singapore, Singapore

Fei Li, Nanjing Agricultural University,

China

*Correspondence:

AkitoTaneda, Graduate School of

Science andTechnology, Hirosaki

University, 3 Bunkyo-cho, Hirosaki,

Aomori 036-8561, Japan.

e-mail: taneda@cc.hirosaki-u.ac.jp

RNA inverse folding is a computational technology for designing RNA sequences which

foldintoauser-specifiedsecondarystructure.Althoughpseudoknotsarefunctionallyimpor-

tantmotifsinRNAstructures,lessreportsconcerningtheinversefoldingofpseudoknotted

RNAs have been done compared to those for pseudoknot-free RNA design. In this paper,

we present a new version of our multi-objective genetic algorithm (MOGA), MODENA,

which we have previously proposed for pseudoknot-free RNA inverse folding. In the new

version of MODENA, (i) a new crossover operator is implemented and (ii) pseudoknot pre-

diction methods, IPknot and HotKnots, are used to evaluate the designed RNA sequences,

allowing us to perform the inverse folding of pseudoknotted RNAs. The new version

of MODENA with the new crossover operator was benchmarked with a dataset com-

posed of natural pseudoknotted RNA secondary structures, and we found that MODENA

can successfully design more pseudoknotted RNAs compared to the other pseudoknot

design algorithm. In addition, a sequence constraint function newly implemented in the

new version of MODENA was tested by designing RNA sequences which fold into the

pseudoknotted structure of a hepatitis delta virus ribozyme; as a result, we success-

fully designed eight RNA sequences.The new version of MODENA is downloadable from

http://rna.eit.hirosaki-u.ac.jp/modena/.

Keywords: inverse folding, pseudoknot, secondary structure, pseudobase, Rfam, sequence constraint

1.

Evolutionary related non-coding RNAs have their own charac-

teristic secondary structure corresponding to each function, and

it is well known that the secondary structures play key roles in

the functions of the RNA sequences. This biochemical knowl-

edgeaccumulatedtodateindicatesthatwecangeneratefunctional

synthetic RNAs if we can control the secondary structure of the

RNAs. In this context, various synthetic RNAs, such as ribozymes

(Schultes and Bartel, 2000), micro RNAs (Schwab et al., 2006),

riboswitches (Breaker, 2004), and RNA nano structures (Jaeger

et al.,2001) have been successfully designed.

RNA inverse folding is a computational methodology for

designing RNA sequences which fold into a given target struc-

ture (Hofacker et al., 1994). The name “inverse” comes from the

reason that the inverse folding is defined as the inverse problem of

RNAsecondarystructureprediction,whereRNAsecondarystruc-

ture prediction problem is referred to as “direct problem”. Since

usually there can be multiple solutions for an RNA inverse fold-

ing problem and we have no deterministic algorithm which can

enumerate all solutions of a given RNA inverse folding problem,

previous RNA inverse folding algorithms have adopted heuris-

tic approaches to find desired RNA sequences. We can find the

following six RNA inverse folding algorithms in literature: local

search algorithms [RNAinverse (Hofacker et al., 1994), RNA-

SSD (Andronescu et al., 2004), INFO-RNA (Busch and Backofen,

2006), Inv (Gao et al., 2010), design in NUPACK (Zadeh et al.,

2011)] and a genetic algorithm [GA; MODENA (Taneda, 2011)].

INTRODUCTION

The local search algorithms are well characterized by their ini-

tialization step and refinement step in the exploration procedures

for obtaining desired RNA sequences. First, in these local search

approaches, a single RNA sequence is generated. The pioneering

RNA inverse folding algorithm, RNAinverse, uses a pure random

initialization. RNA-SSD randomly initializes an RNA sequence in

amoresophisticatedmanner,whereabasecompositionandatabu

mechanism for avoiding undesired stem formation are taken into

account. INFO-RNA utilizes a dynamic programming algorithm

to obtain a good initial sequence for the RNA inverse folding,

where the lowest energy RNA sequence determined assuming that

the RNA sequence folds into a given structure is used as an initial

sequence. In the refinement step after the initialization, RNAin-

verse performs adaptive walk to improve the initial sequence,and

RNA-SSD and INFO-RNA use stochastic local search (Hoos and

Stützle, 2004) to improve the initial sequence. In the refinement

step, RNAinverse and RNA-SSD employ a structure decomposi-

tion strategy to reduce the number of folding calculations for a

whole sequence. Inv and NUPACK also utilize structure decom-

position strategies in their refinement step. Inv is an RNA inverse

folding algorithm designed for a restricted pseudoknot class and

canperformtheinversefoldingofpseudoknottedRNAs.NUPACK

is a suite of programs for computational nucleic acid analysis and

includes a program named design. Design generates the sequences

by minimizing an ensemble defect (Zadeh et al., 2011); the value

of ensemble defect becomes lower, the designed RNA sequence

more specifically folds into a given target structure. MODENA

www.frontiersin.org

April 2012 | Volume 3 | Article 36 | 1

Page 2

Taneda Pseudoknotted RNA sequence design

is a multi-objective genetic algorithm (MOGA; Deb, 2001) for

RNA inverse folding. As objective functions, MODENA uses two

quantities, a structure similarity measure and a stability measure

(e.g.,freeenergy).Byvirtueof simultaneousoptimizationinthese

objectivefunctions,MODENAcanexplorethesequencewhichnot

only folds into the desired target structure but has a high stability

(= a low free energy).

PseudoknotsareimportantfunctionalmotifsinRNAstructure

(Staple and Butcher, 2005). In contrast to other RNA structure

motifs such as hairpin loop, bulge loop, internal loop, and multi-

branch loop, where any two base pairs (i, j) and (k, l) do not

have a relationship such that i <k <j <l (where i, j, k, and l

are nucleotide positions), pseudoknots are defined as the struc-

tures which have base pairs satisfies the condition i <k <j <l.

Since pseudoknots have various enzymatic functions (Staple and

Butcher, 2005), they are intriguing targets of functional RNA

design. However,in the previous RNA inverse folding algorithms,

only Inv can design pseudoknots. Moreover,there is no algorithm

whichcandesignpseudoknottedRNAswithsequenceconstraints,

which are an important feature for designing the molecule with

a known functional sequence motif. For these reasons, develop-

ment of a novel pseudoknotted RNA inverse folding algorithm

is important in order to promote the sequence design of RNA

pseudoknots.

In this paper, we present an extension of MODENA algorithm

to the inverse folding of pseudoknotted RNAs. In MODENA

algorithm, designed RNA sequences are evaluated by perform-

ing secondary structure prediction with an RNA folding program

such as RNAfold (Hofacker, 2003), and we can easily substitute

the RNA folding program by a different RNA folding program

(in the context of inverse folding, we refer RNA structure pre-

diction program as “direct problem solver”). This advantage of

MODENA algorithm also remains in the case of pseudoknotted

RNA inverse folding, where we have to use a pseudoknotted RNA

secondary structure prediction program as direct problem solver.

In the rest of the present paper, first, we describe the MODENA

algorithmforpseudoknottedRNAdesign,whereamulti-objective

genetic algorithm is used in combination with the state of the art

pseudoknotted RNA structure prediction programs. After that,

the performance of MODENA algorithm is evaluated by bench-

marksbasedonnaturalRNAsecondarystructures,wherenotonly

pseudoknotted structures but also pseudoknot-free structures are

takenintoaccount.Then,asequenceconstraintfunctionavailable

in MODENA is demonstrated with a biological example taken

from literature.

2.

Since the detail of the MODENA algorithm for pseudoknot-

free RNAs is described in Taneda (2011) and the present ver-

sion for pseudoknot RNA design shares all parts of the previ-

ous pseudoknot-free version of MODENA, the algorithmically

common parts between the two versions are briefly described

below.

MODENAalgorithmisanRNAinversefoldingalgorithmbased

on MOGA. GA is a population based algorithm for optimization

and search (Goldberg, 1987), which is inspired from the mech-

anism of evolution. MOGA is a GA for exploring the objective

MATERIALS AND METHODS

function space consisting of multiple objective functions, while

standard GA uses a single objective function. In MODENA algo-

rithm, we use the following two objective functions, a structure

stability score ? and a structure similarity score σ:

? = −E,

σ = (N − d)/N,

(1)

(2)

where E is the lowest free energy of a designed sequence; N is

the total number of nucleotides, and d is the structure distance

between target and predicted structures (Taneda,2011). Structure

distance d is defined as the number of the bases which have a

different base-paring status between the target structure and the

structure predicted for the designed sequence.

In MODENA algorithm, we utilize multi-objective optimiza-

tion(MOO;Deb,2001)toexploresolutions(i.e.,RNAsequences)

withbettervaluesof bothof thetwoobjectivefunctions.InMOO,

twosolutionsarecomparedbasedontheirdominance.Letuscon-

sider two solutions,xaand xb.When“all objective function values

of xaare better than or equal to those of xb” and “at least one

objective function value of xais not equal to that of xb”, xadomi-

nates xb; a solution which is not dominated by any other solution

is called a Pareto optimal solution. If “all objective function val-

ues of xaare better than those of xb”, xastrongly dominates xb; a

weak Pareto optimal solution is defined as a solution which is not

strongly dominated by any other solution. MODENA algorithm

explores weak Pareto optimal solutions for RNA inverse folding

problem (Taneda, 2011).

Since usually it is difficult to enumerate all (weak) Pareto

optimal solutions for a given MOO problem, MOGA computes

approximate set (partial solutions) of the (weak) Pareto optimal

solutions. MODENA is developed based on one of the standard

MOGA, non-dominated sorting genetic algorithm 2 (NSGA2;

Deb, 2001). NSGA2 proceeds in accordance with the framework

similar to that of standard GA, which is composed of initializa-

tion, evaluation, and reproduction for a population of solutions.

Intheinitializationstep,auser-definednumberof solutions(RNA

sequences) are randomly generated. In the present study, we use

50 or 100 solutions in one population.

Intheevaluationstep,weperformanRNAstructureprediction

for each solution in the current generation,and then assign stabil-

ityandsimilarityscorestothesolutions.WeuseanRNAstructure

predictionprogramasadirectproblemsolvertoobtaintheobjec-

tive function values. In MODENA, RNAfold (Hofacker, 2003),

CentroidFold (Hamada et al., 2009), or UNAFold (Markham

and Zuker, 2008) can be used as a direct problem solver for

pseudoknot-free RNA design, and IPknot (Sato et al., 2011) or

HotKnots (Ren et al., 2005) can be utilized for pseudoknotted

RNA sequence design (in the present study, we used IPknot 0.0.2

and HotKnots 2.0). Since IPknot does not output a free energy

or some quantity indicating stability of the predicted structure,

instead of using Equation 1, we assign the total number of gua-

nineandcytosinepairsinthepredictedbasepairstoeachsolution

as a stability score when we use IPknot as a direct problem solver.

Based on the solutions of the current generation,reproduction

step generates child solutions for the next generation. MODENA

Frontiers in Genetics | Non-Coding RNA

April 2012 | Volume 3 | Article 36 | 2

Page 3

Taneda Pseudoknotted RNA sequence design

algorithm generates child solutions by invoking three GA opera-

tors with an equal probability: structural n-point crossover, point

accepted mutation, and error diagnosis mutation (Taneda, 2011);

structural n-point crossover generates a child solution by con-

catenating subsequences taken from two parent solutions; point

acceptedmutationrandomlychangesanucleotide;errordiagnosis

mutation compares the predicted and target structures, and then

changes the nucleotides which have different base pairs between

the predicted and target structures.

Point accepted mutation and error diagnosis mutation can

be applied to pseudoknotted RNA sequence design without any

modification. Structural n-point crossover, however, has to be

changed for pseudoknotted RNAs, since its original algorithm

(Taneda,2011)assumesnopseudoknotinatargetstructure(i.e.,a

nucleotide k,[i <k <j],never forms a base pair with a nucleotide

l [l <i orj <l]).Structuraln-pointcrossoveriscomposedof four

steps (Taneda,2011,p. 5),and we modified Step 2 to take pseudo-

knots into account. After a crossover parameter nc(= 2 in the

presentstudy)andarandomlydeterminedx0(=0or1)aregiven,

structural n-point crossover allowing pseudoknots is performed

as follows:

Step1 Set l =0 and set xi=x0for all i (1≤i≤N;N is a sequence

length). Randomly select a base pair (i, j) (1≤i<j≤N).

Step2Foreachxk(i ≤k ≤j)whichdoesnotformabasepairwith

xl(l <i or j <l) in the target structure, perform the following:

if xkis zero, change xkto one, otherwise change xkto zero. Then

increment l by one.

Step3 If l<ncand“the number of the base pairs whose upstream

nucleotidepositionmsatisfiesi<m(wherem<N)”islargerthan

or equal to one, randomly select a base pair (inew, jnew), where

i<inew<jnew≤N;then we set i=inewand j=jnew,and move to

Step 2; otherwise we go to Step 4.

Step4Generateachildsolutionaccordingtoxiforalli (1≤i≤N);

if xi=0,copythevalueof anucleotidesA

sponding nucleotide schild

i

of the child solution;if xi=1,the value

of a nucleotide sB

iinparentAtothecorre-

iin parent B is copied to schild

i

.

It is noted that Step 1, Step 3, and Step 4 are exactly the same

with those in Taneda (2011). By using this modified version, we

can crossover two parent solutions without destructing any base

complementarity in the target structure.An example of structural

n-point crossover is depicted in Figure 1.

2.1.

Since the previous version of MODENA does not support

sequence constraints, we have added the function to the present

version of MODENA. The sequence constraints of MODENA can

be specified in accordance with the IUPAC notation of nucleotide

codes. By using the sequence constraints,user can design pseudo-

knottedRNAsequenceswithsequencemotifsspecifiedbytheuser.

SEQUENCE CONSTRAINTS

2.2.

In MODENA, user inputs a target structure using a bracket nota-

tion, where (), <>, {}, [], and alphabets (AaBbCcDdEe) are

allowed to specify a base pair (where uppercase and lowercase

alphabets indicate upstream and downstream nucleotide posi-

tions, respectively). User can freely input a target structure using

A NOTE ON INPUT TARGET STRUCTURE

FIGURE 1 |An example of structural n-point crossover operator for

pseudoknotted target structure.Trgt and x_k indicate a target structure

and a crossover position indicator, xk, respectively. (A) An initial state. All

xks are set to zero. (B) Position i is randomly selected. Position j is the

position complementary to the position i. (C)The values of xks between i

and j are changed to 1.The values in the pseudoknotted region are not

changed.The positions whose xk=1 are shaded. We can use the xks

obtained after this step as a crossover position indicator for a 4-point

crossover. (D) In addition, we can randomly select one more position i to

increase the crossover points. (E) In this example, as a result, we can

obtain a crossover position indicator for a 6-point crossover, which is

composed of 7 subsequence regions.

these bracket notations, but it is noted that if the direct problem

solver selected by the user cannot predict the class (e.g., Condon

et al., 2004) of the input pseudoknot, the user never obtain the

sequences folding into the target structure.

2.3.

We evaluated the design performance of MODENA with a dataset

which contains the natural pseudoknotted structures taken from

Pseudobase(Batenburgetal.,2000).Sincetheoriginal342pseudo-

knotted structures downloaded from Pseudobase are redundant,

i.e.,different Pseudobase entries can share strictly the same struc-

ture,we performed a non-redundant processing to guarantee that

allstructuresareuniqueinourdataset.Consequently,weobtained

266pseudoknottedstructuresfortheperformanceevaluation.We

refer to this dataset as the Pseudobase dataset.

In addition to the benchmark for the pseudoknotted target

structures,we performed a benchmark for pseudoknot-free target

structures,wheretheRfamdatasetof ourpreviouspaper(Taneda,

2011) was used. It is noted that a pseudoknot prediction method

DATASET FOR BENCHMARK

www.frontiersin.org

April 2012 | Volume 3 | Article 36 | 3

Page 4

TanedaPseudoknotted RNA sequence design

(IPknot) was used as a direct problem solver in this pseudoknot-

free benchmark. The reason why we performed a benchmark for

pseudoknot-free target structures in the present study is as fol-

lows. If we use a non-pseudoknot prediction method as a direct

problem solver to design an RNA sequence, the designed RNA

sequence may fold into a pseudoknotted structure when we fold

the designed sequence with a pseudoknot prediction method. By

using a pseudoknot prediction method as a direct problem solver

for pseudoknot-free RNA sequence design, we can decrease the

probability with which undesired pseudoknots accidentally form

in the designed RNA sequence. That is,inverse folding of pseudo-

knotted RNAs is useful not only to design pseudoknotted RNA

sequences but also to design pseudoknot-free ones.

3.

3.1.

We evaluated the pseudoknot design performance of MOD-

ENA with the Pseudobase dataset, where IPknot and Hotknots

were used as a direct problem solver. We set both a popula-

tion size and maximum iteration number to 50 in our GA. In

this performance evaluation, we obtained successfully designed

RNA sequences for 207 and 198 pseudoknotted target struc-

tures by MODENA+IPknot and MODENA+HotKnot, respec-

tively, in the 266 target structures of the Pseudobase dataset,

where MODENA+IPknot and MODENA+HotKnots denote

RESULTS

BENCHMARK RESULTS

the sequence design utilizing IPknot and HotKnots as a direct

problem solver, respectively (“successfully designed RNAs” mean

the RNA sequences which fold into the input target structure).

Inv obtained successfully designed RNAs for 181 pseudoknot-

ted target structures with the same dataset. Figure 2 shows the

sequence length dependence of the pseudoknot design perfor-

mancesforMODENAandInv,wheretheperformanceisindicated

by “the rate of successfully designed RNAs”=100×(number of

the target structures for which a successfully designed RNA is

obtained)/(total number of the target structures). The total num-

ber of the target structures included in each length bin is given in

Figure 3. As can be seen from Figure 2, MODENA+IPknot out-

performsInvforallbinsof sequencelengths.MODENA+IPknot

showed the best performance for the length range between 21

and 60 nucleotides. For the range between 61 and 80 nucleotides,

MODENA+IPknot and MODENA+HotKnots have compa-

rable performances. For longer target structures with lengths

from 81 to 140 nucleotides, MODENA+HotKnots gives the best

results among MODENA+IPknot, MODENA+HotKnots, and

Inv.

For the target structures longer than 85 nucleotides, Inv com-

pletely failed to design pseudoknots. MODENA also could not

obtainsuccessfullydesignedpseudoknottedRNAswhenthetarget

structures have a length longer than 137 nucleotides. It is noted

that the number of target structures longer than 81 nucleotides

FIGURE 2 | Sequence length dependence of the pseudoknot design

performances for MODENA and Inv. “MODENA+IPknot” and

“MODENA+HotKnots” denote the sequence designs using IPknot and

HotKnots as a direct problem solver, respectively. In each bin, results for

MODENA+IPknot, MODENA+HotKnots and Inv are shown from left to

right.

Frontiers in Genetics | Non-Coding RNA

April 2012 | Volume 3 | Article 36 | 4

Page 5

TanedaPseudoknotted RNA sequence design

FIGURE 3 | Distribution of the target structures in the Pseudobase dataset.

is much smaller than that of the other shorter target struc-

tures (Figure 3); a benchmark with more target structures with

long lengths may give a different result. Details of the results

for the Pseudobase dataset are tabulated in Table S1 in Supple-

mentary Material, which is downloadable from the MODENA

website.

To examine whether a larger calculation,where both a popula-

tionsizeandaniterationnumberhaveavalueof 100,improvesthe

pseudoknot design performance or not,we performed the inverse

foldingofthe59targetstructureswhichwerefailedtodesignwhen

we used a value of 50. By using the larger parameter values, we

successfully designed 15 pseudoknots (Pseudobase PKB-number:

PKB00050, PKB00129,PKB00138,

PKB00171,PKB00178, PKB00179,

PKB00219, PKB00228, PKB00267, PKB00329, PKB00333) of the

59 target structures. These results indicate that a larger calcula-

tion can improve the design performance; it is noted that the

computational time for the larger calculation becomes longer,

i.e., there is a tradeoff between computational time and design

performance.

The logarithm of the computational times needed for

the Pseudobase benchmark of MODENA+IPknot, MOD-

ENA+HotKnots, and Inv is plotted in Figure 4. The compu-

tational times were measured on a Core i7 PC (3.33GHz; 24GB

memory;CentOS5.6[x86_64]).Sinceweperformedfiftyindepen-

dent runs with Inv for each target structure, the mean computa-

tionaltimesforthetargetstructuresareusedasthecomputational

PKB00148,

PKB00211,

PKB00170,

PKB00217,

50100 150

−2

−1

0

1

2

3

4

5

Sequence length (nucleotides)

log10(time in seconds)

MODENA+HotKnots

MODENA+IPknot

Inv

FIGURE 4 |The logarithm of the computational times needed for the

Pseudobase benchmark of MODENA+IPknot, MODENA+HotKnots,

and Inv. Each symbol corresponds to one target structure. Each

computational time for Inv is the mean over fifty independent runs.The

results of failed Inv runs are not included in this figure.

www.frontiersin.org

April 2012 | Volume 3 | Article 36 | 5