Page 1

January 7, 200418:21WSPC/INSTRUCTION FILElocal-ssa

Journal of Bioinformatics and Computational Biology

c ? Imperial College Press

LOCAL SEQUENCE-STRUCTURE MOTIFS IN RNA

ROLF BACKOFEN and SEBASTIAN WILL

Chair for Bioinformatics at the Institute of Computer Science,

Friedrich-Schiller-Universitaet Jena, Ernst-Abbe-Platz 2,

D-07743 Jena, Germany,

{backofen,will}@inf.uni-jena.de

RNA enjoys increasing interest in molecular biology; despite this interest fundamental algorithms are

lacking, e.g. for identifying local motifs. As proteins, RNA molecules have a distinctive structure.

Therefore, in addition to sequence information, structure plays an important part in assessing the sim-

ilarity of RNAs. Furthermore, common sequence-structure features in two or several RNA molecules

are often only spatially local, where possibly large parts of the molecules are dissimilar. Consequently,

we address the problem of comparing RNA molecules by computing an optimal local alignment with

respect to sequence and structure information. While local alignment is superior to global alignment

for identifying local similarities, no general local sequence-structure alignment algorithms are cur-

rently known. We suggest a new general definition of locality for sequence-structure alignments that

is biologically motivated and efficiently tractable. To show the former, we discuss locality of RNA

and prove that the defined locality means connectivity by atomic and non-atomic bonds. To show the

latter, we present an efficient algorithm for the newly defined pairwise local sequence-structure align-

ment (lssa) problem for RNA. For molecules of lengthes n and m, the algorithm has worst-case time

complexity of O(n2·m2·max(n,m)) and a space complexity of only O(n·m). An implementation

of our algorithm is available at http://www.bio.inf.uni-jena.de. Its runtime is competitive with global

sequence-structure alignment.

Keywords: RNA; local alignment; local sequence-structure alignment; lssa.

1. Introduction

The role of ribonucleic acid (RNA) in biological systems was largely underestimated for

a long time. Today, RNA enjoys increasing attention due to recent discoveries such as the

existenceof small RNAs, whichare stronglyinvolvedin cell control(see Cousin1). Despite

increasing interest in RNA, the important problem of comparing RNA molecules in order

to identify local motifs is still unsolved; even its formal understanding is unsatisfying.

We illustrate this problem by means of a typical scenario, where one has to compare

RNA molecules.Supposewe are interestedin theRNA motifthat bindsto a certainprotein.

For example, we want to identify the mRNA element SECIS (short for SElenoCysteine

Insertion Sequence) as investigatede.g. by Wilting et al.2, which binds to the protein SelB.

By certain means, e.g. by SELEX (see Klug and Famulok3) we get a pool of RNAs that

contain the motif of interest.

However, for our approach neither the origin of the set of RNAs nor the kind of the

binding motif is limited. Anyway, given a set of RNA molecules, we are left with de-

1

Page 2

January 7, 2004 18:21WSPC/INSTRUCTION FILE local-ssa

2

Rolf Backofen and Sebastian Will

a)

C

C

C

G

G

G

G

G

G

C

C

C

U

A

A

U

U

A

A

U

A

A

AA

U

C

A

G

A

A

A

A

C

C

C

G

G

G

A

A

G

A

A

AU

G

AU

GC

C

GC

CG

C

A

A

G

A

fdhAfwdB

b)

A

G

C

G

A

C

G

G

A

allowed

C

C

U

G

G

C

C

UA

A

G

CUCGG

A

C

C

U

G

disallowed

exclusion exclusion

Fig. 1. a) Putative SECIS-motif (see Wilting et al.2). The identical bases, which form the minimal local motif, are

highlighted. b) Allowed vs. disallowed exclusions.

termining the similar regions in them. Whereas in such cases today’s biologists compare

the RNA molecules manually, we identify the similarities automatically. To this end, we

introduce the local sequence-structure alignment (lssa); the local comparison of RNAs is

complicated by two particularities.

First, in our example the protein binds to the RNA due to its sequence and structure.

Consequently, the similarity of the bound RNAs is based on sequence and structure. In

general, RNA molecules have a distinct and complex three-dimensional structure, due to

the single strand occurrence of RNA. Sequence and structure of RNA molecules are bio-

logically meaningful and thus conserved in evolution. It is therefore essential to compare

RNA considering both features. Second, as it is the case for the SECIS motif, often simi-

larities between two or several molecules are only local, i.e. some parts of the molecules

share great similarity, whereas other parts are unrelated.

In contrast to the first aspect, which is also subject of the recently developed global

sequence-structure alignment algorithms, the second aspect even lacks proper understand-

ing up to now, which we will elucidate in the following subsection.

1.1. A Note on Locality

Since the meaning of locality is more intricate in the context of sequence-structure align-

ment than of pure sequence alignment, we explain our notion in analogyto sequence align-

ment. A local alignment of sequences is commonly defined as a (global) alignment of one

pair of subsequences of the input sequences. Note that the bases in a subsequence are con-

nected via the backbone, which constitutes a dependency. For RNA, several definitions of

local alignment are possible. If we define the local alignment again as the best alignment

of subsequences,we ignore the RNA structure completely.Hence, in a next step we require

that the subsequences represent complete substructures. (Later we are going to introduce

the term arc-complete for this property.) This kind of locality is required for an appropri-

Page 3

January 7, 200418:21WSPC/INSTRUCTION FILE local-ssa

Local Sequence Structure Motifs in RNA

3

ate definition of local sequence-structure alignment. Additionally, one can exclude certain

substructures from a substructure, while the spatial locality is preserved due to connection

of bases by non-atomic H-bonds, in the following called arcs. Our algorithm handles this

form of locality, i.e. connected substructureswith excludedsubstructures. The small exam-

ple in Fig. 1a shows that this indeed is the preferable notion of locality. The figure shows

the putative SECIS-elements in the archaea Methanococcus jannaschii proposed by Wilt-

ing et al.2(see Fig. 1a, where the putative motif is boxed). Since the apical subsequence

AAUAUAAAAUAAUAC in the left molecule fdhA has no correspondence in fwdB, a correct

local alignment of the two RNAs aligns two pairs of subsequences, which are isolated on

the sequence level but connected by structure.

This should not be confused with the output of local sequence alignment programs

such as BLAST4, which typically yield several isolated pairs of aligned subsequences.

In sequence alignment, these subsequence pairs can be aligned and scored independently

and are just the k best non-overlapping local alignments. However, for sequence-structure

alignment, the dependency created by the arcs forbids this independent treatment. We will

not discuss the analogous extension of sequence-structurealignment for yielding a number

of best non-overlappingalignments here.

Furthermore, it is reasonable to consider only conserved arcs for forming a connection

of two otherwise isolated subsequences. Conserved arcs are those arcs that are matched

by our alignment. In consequence, our locality can be defined only for alignments and not

for single RNA sequence-structures. Additionally, we treat an arc as an entity, i.e. we align

either both bases of an arc or none.

In order to get exactly those connected subsets of bases as local motifs, we allow at

most oneexclusionofa substructurein each loop(regardlesswhichkindof loop).Consider

Fig. 1b as an example. Whereas the left structure contains one exclusion and is therefore

an admissible local motif, the right one contains two exclusions in the same loop, thus

producing an unconnected, and hence forbidden motif.

Note that since the exclusionof certain substructuresin substructuresis algorithmically

mostchallenging,ourpresentationandtheexamplesgivenin this paperfocusonthis aspect

of locality.

1.2. Related Work

For a proper classification and comparison to related work, we have to discuss RNA struc-

turemoreindetail.ItiscommonforRNA algorithmstohandleonlythesecondarystructure

of the molecules, i.e. the set of non-atomic bonds (arcs) between pairs of bases within one

molecule. When considering secondary structure, one further differentiates between gen-

eral structures (aka crossing structures) and the class of nested structures, which forbid

pairs of crossing arcs. Often it is only the restriction to nested structures that makes a prob-

lem algorithmically tractable. For example, the most prominent RNA structure prediction

algorithms compute nested secondary structure (see Zuker5, Wuchty et al.6).

Consequently, one distinguishes several RNA alignment problems with different com-

plexity. Namely, these problems are alignment crossing vs. crossing, nested vs. crossing,

Page 4

January 7, 2004 18:21WSPC/INSTRUCTION FILE local-ssa

4

Rolf Backofen and Sebastian Will

nested vs. nested, as well as alignments of structure vs. plain sequence (see Jiang et al.7).

In this paper, we consider local nested vs. nested alignment.

Recently, Jiang et al.7defined a globalgeneral edit distance of RNA based on sequence

and structure. Theygive an efficient O(n2·m2) dynamicprogrammingalgorithm for global

alignment of nested against crossing structures for a specialization of their distance score.

Furthermore, they show that the global alignment of crossing against plain (and in conse-

quence, nested against crossing) RNA is NP-complete for their general distance score.

Despite NP-completeness, there are approaches to align crossing structures against

plain sequence. Lenhof et al.8give an integer linear programming (ILP) approach to the

alignment of RNA structure to plain sequence. Eddy9discusses the similar problem of

aligning a sequence against a covariance model — a description of an ensemble of RNA

sequences and structures.

Furthermore, special cases of local alignment of RNAs have been dealt with previ-

ously. Gorodkin et al.10have identified common stem loops. Improving this significantly,

Hoechsmann et al.11, who handle RNA alignment by tree alignment based on an earlier

work of Jiang et al.12, examine the problem of finding the most similar subtrees. For ex-

ample, both algorithms cannot identify the motif of the RNAs in Fig. 1a as given in the

literature, since in contrast to our approachexcluded substructures in a substructureare not

considered.aThe tree alignment approach imposes some restrictions on the sequences of

edit operations (and thus the alignments). In order to overcome these restrictions Jiang et

al.7recently developed a general edit distance, which constitutes the basis for our general

similarity. For further discussion, see Jiang et al.7and Hoechsmann et al.11.

Finally, proteinstructure alignment is closely related to RNA alignment.In principle,if

the 3D-structure of the RNAs is known, then methods developed for proteins, as discussed

by Gerstein et al.13, are applicable to RNA as well. However, algorithms based on RNA

secondary structure are generally more efficient for this purpose. The ILP branch and cut

algorithm for protein structure alignment described by Lancia et al.14uses contact maps of

proteins, a representation of structure that is almost equal to crossing secondary structure

of RNAs. Therefore, the problem there is closely related to crossing vs. crossing RNA

alignment, which is NP-complete. The problem of aligning a plain sequence to a sequence

and structure is called protein threading. When using contact maps as done by Xu et al.15,

protein threading is very close to the alignment of plain RNA sequence vs. RNA with

crossing structure.

1.3. Contribution and Plan of the Paper

We present a sequence-structure alignment method that handles local motifs as shown e.g.

in Figs. 1a and 8, where none of the previously mentioned approaches can be applied.

Therefore, we define the local sequence-structure alignment (lssa) problem and then for-

mulate an efficient algorithm to solve it.

aHoechsmann et al. discuss this issue as well and refer to it as local pattern similarity between trees, which is not

handled by their algorithm.

Page 5

January 7, 200418:21 WSPC/INSTRUCTION FILE local-ssa

Local Sequence Structure Motifs in RNA

5

In the subsection on similarity, we develop a general similarity, which is as general

as the recently defined edit distance by Jiang et al.7. A similarity score is a necessary

prerequisite for local alignment. A major contribution of this paper is our biologically mo-

tivated definition of local alignments. As discussed previously, the notion of locality is by

no means trivial in the context of RNA alignment. We translate the biological intuition to a

mathematical concept, namely connectivity of a motif graph, and show that our definition

of locality is equivalent to this formalization. To our knowledge the definition of locality

presented here is the first non-trivial and algorithmically tractable one for RNA.

The discussion of similarity and locality then allows a definition of the lssa problem.

Finally, we will show how the lssa problem is efficiently tractable by dynamic program-

ming. We will demonstrate the use of the defined terms and applicability of the algorithm

by giving real world examples in the results section.

2. Local Alignment of RNA

2.1. Preliminaries

A sequence S is a word over the alphabet {A,C,G,U}, S[i] denotes the ith symbol in S. An

arc a is a pair (i, j) ∈ N×N, such that i < j. i and j are called ends of the arc a. We also

use the notations aland arfor i and j, respectively.

A structure P is a set of arcs, such that no end of an arc appears more than once in P.

We call two arcs (i1,i?

A structure containing at least one pair of crossing arcs is called crossing, otherwise it is

called nested. We call the tupleS S S = (S,P) a sequence-structure. We impose a partial order

≺Pbetween arcs by defining (i1,i?

A range [k..k?] is the set of positions {k,k+1,...,k?}. Let I,J ⊂ N denote two arbitrary

sets of positions. The symbol − denotes a gap. An alignment A of two sequences S1and S2

is a subset of [1..|S1|]∪{−}×[1..|S2|]∪{−}, where for all pairs (i, j),(i?, j?) ∈ A holds 1.)

1),(i2,i?

2) crossing, if and only if i1< i2< i?

1< i?

2∨i2< i1< i?

2< i?

1.

1) ≺P(i2,i?

2) if and only if i2< i1< i?

1< i?

2.

i ≤i?⇒ j ≤ j?2.) i =i?=− ⇒ j = j?, and 3.) j = j?= −⇒ i =i?. Intentionally,unaligned

positions in the sequences are allowed, which means that alignments can be local.b

In the following,we fix two sequence-structuresS S S1=(S1,P1) andS S S2=(S2,P2), where

P1and P2are nested. Given an arc (i,i?) ∈ P1(resp. (i,i?) ∈ P2), then the arc (i,i?) is called

aligned in the alignmentA if and only if there is an arc (j, j?) in the other structureP2(resp.

P1), such that (i, j),(i?, j?) ∈ A (resp. (j,i),(j?,i?) ∈ A).

I is called arc-complete for a structure P, write acP(I), iff for every arc (i,i?) ∈ P holds

either i,i?∈ I or i,i?∈ I. The term arc-complete formalizes our treatment of an arc as an

entity. By restricting ourselves to arc-complete sets, we disallow that the two bases of each

arc are seperated.

Let π1(A) (resp. π2(A)) denote the projection to the aligned positions in the first (resp.

second)sequenceofA.Then,wecallAarc-completeforP1andP2,iffπl(A)is arc-complete

for Plfor l ∈ {1,2}.

bIn contrast, the alignment A is called global, if for every 1 ≤ i ≤ |S1| there is an edge (i, j) ∈ A and for every

1 ≤ j ≤ |S2| there is an edge (i, j) ∈ A.

Page 6

January 7, 200418:21 WSPC/INSTRUCTION FILElocal-ssa

6

Rolf Backofen and Sebastian Will

We define a binary 0/1-function incP, by incP(i) = 1 iff there is an arc a ∈ P ending

with i. Since we fixed the structures P1and P2, we will write inc1instead of incP1in the

following. Analogously, we will use the abbreviations inc2, ac1, and ac2.

2.2. Similarity

Our similarity function differentiates matches, insertions and deletions of bases without

incident arcs as a score for sequence alignment. Furthermore, it scores matches of arcs and

breakings of arcs. Here, we have an arc breaking, whenever either at least one end of an

arc is aligned to a gap or the two ends are aligned to bases that are not connected by an arc.

In contrast, Jiang et al.7distinguish between arc breaking, arc altering and arc removing,

which we handle as sub-cases of arc breaking.

In the definition of similarity score, the functions sb, sarc, sbr1, and sbr2have the fol-

lowing semantics. sb(i, j) denotes the similarity between the bases S1[i] and S2[j]. If i = −

(resp. j =−), then sb(i, j) is the similarity between S[j] (resp. S[i]) and a gap. sarc(a1,a2) is

the similarity between arcs a1and a2. Moreover,sbr1(a1, j, j?) is the penalty (i.e. a negative

similarity) for breaking the arc a1by aligning its ends to j and j?in sequence S2, where

(j, j?) ∈ P2, or to gaps. Analogously, sbr2(i,i?,a2) is defined as penalty for breaking the arc

a2in P2.

Let A be an (arbitrary) alignment of S1and S2. We define the general similarity score

of A given by the functions sb, sarc, sbr1, and sbr2forS S S1andS S S2as

∑

(i,j)∈A

¬inc1(i)∧¬inc2(j)

+

∑

SIMSCORE(S S S1,S S S2,A) =

sb(i, j) + ∑

(i,i?)∈P1,(j,j?)∈P2

(i,j)∈A,(i?,j?)∈A

sbr1((i,i?), j, j?) + ∑

sarc((i, j),(i?, j?))

(1)

(i,i?)∈P1,(j,j?)∈P2

(i,j)∈A,(i?,j?)∈A

(i,i?)∈P1,(j,j?)∈P2

(i,j)∈A,(i?,j?)∈A

sbr2(i,i?,(j, j?)).

According to this definition the similarity score is a sum of base similarities between

freebases andotherfreebases orgapsthatare alignedbyA,the similarities ofarcs matched

by A, and the penalties for arc breakings.

Like Jiang et al.7, we use a slightly specialized scoring scheme in the efficient align-

ment algorithm, for which we will use the term similarity score.cTherefore,we restrict the

functions sbr1and sbr2by

sbr1((i,i?), j, j?) = sl

br1(i, j)+sr

br1(i?, j?) and sbr2(i,i?,(j, j?)) = sl

br2(i, j)+sr

br2(i?, j?),

where sl

kenarc’s leftandrightends.Forthe similarityscore,in thecase ofarc-completealignments

Equation 1 is equivalently defined by summing over the two ends of broken arcs indepen-

dently.As a technicality,assume that the similarity score is definedin this way for arbitrary

alignments, such that we score halve arc-breaks for non-arc-completealignments.

br1(i, j), sr

br1(i?, j?), sl

br2(i, j), and sr

br2(i?, j?) are the score contributions of the bro-

cNote that Jiang et al. give only an approximation algorithm using their general score, since the problem is NP-

complete for the general score.

Page 7

January 7, 200418:21WSPC/INSTRUCTION FILElocal-ssa

Local Sequence Structure Motifs in RNA

7

.....(((..((....)) ..((.. ..))...((....))..)))..

CGACCCGUCGACUCUAGU AGAGUU GAUUGACUAUAUCUAGGACGGG

|| | | | | | |

CCCCGAGGGCGGCAGCCC−GAAUCUGAUAGGGAUUGA GUCCUUCG

....((..((....)).. .((.((....))..)).. .)).....

110 20

110 2030

30

40

40

C

C

C

CG

5

U

G

G

G

C

G

C

G

G

CA

CCGA

A

UC

U

GAU

A

G

G

G

A

U

U

G

A

G

A

CC

U

U

C

G

C

G

A

C

CC

G

U

C

G

A

C

U

U A

15

GUAGA

20

G

U

U G

A

U

UGAC

U AUA

30

U

C

U

A

G

G

A

C

G

G

G

C

1

1

15

10

20

25

30

35

40

5

10

25

35

40

45

Fig. 2. A local alignment of two sequence-structures and their motif graphs in the alignment. These graphs,

which represent the aligned, local parts, consist of only the dark nodes and dark grey edges. Nevertheless, for

convenience we show the completed structures and unaligned arcs with light ink.

Note that distinguishing scores for left and right ends causes no additional work in the

algorithm and is biologically justified, since RNA molecules are directed. However, for

simplicity, we will not distinguish the two ends in the following and set for l ∈ {1,2},

sbrl(i, j) = sl

brl(i, j) = sr

brl(i, j).

2.3. Locality

As mentioned in the introduction, our notion of locality implies that all bases in local

motifs are connected. In this subsection, we will make this claim more precise by defining

locality of a sequence-structure alignment and showing that locality is equivalent to the

connectivity of corresponding motif graphs. Fig. 2 illustrates local alignment.

A successor of a base k is defined as an arc (i,i?) ∈ P, such that (i,i?) spans k (i.e.

i < k < i?). The immediate successor of k which is its minimal successor w.r.t to ≺P. For a

range [k..k?], we denote (i,i?) as a successor (resp. immediate successor) of [k..k?] if (i, j)

is a successor (resp. immediate successor) of both k and k?. Note that, since our structures

are nested, any range in a sequence-structure has at most one immediate successor, which

is then determined unambigously.

Beforedefininglssa, wewillintroducefurtherterms.Adenotesanalignment.We define

anexclusionofAinsequencel =1,2asarange[k..k?],wherek≤k?,suchthat[k..k?]∈πl(A)

and k−1,k?+1 ∈ πl(A). Furthermore, we define PA

only of arcs that are aligned in A. For a node k in sequence l, we define the immediate

aligned successor for k as the immediate successor of k in PA

las the substructure of Plthat consists

l. Similarly, we define the

Page 8

January 7, 2004 18:21 WSPC/INSTRUCTION FILE local-ssa

8

Rolf Backofen and Sebastian Will

immediate aligned successor for an exclusion [k..k?] of A to be the immediate aligned suc-

cessor for both k and k?.

Definition 1 (LSSA Problem). LetS S S1=(S1,P1) andS S S2=(S2,P2) be sequence-structures

with nested structures. We call an alignment of S1and S2local sequence-structure align-

ment (lssa) of S S S1and S S S2, if and only if 1.) A is arc-complete and 2.) any exclusion of A

has a immediate aligned successor a and no second exclusion has a as immediate aligned

successor. Further, givenS S S1, S S S2, and a similarity score SIMSCORE, the lssa problem is to

determine

argmax

A lssa ofS S S1andS S S2

SIMSCORE(A,S S S1,S S S2).

Now, local alignments are connected in the following sense. LetS S S = (S,P) be a nested

sequence-structure. Define GS S S= (V,E) as the graph where V = {i | 1 ≤ i ≤ |S|}, and E =

{{i,i+1}|1 ≤ i < |S|}∪{{i, j} | (i, j) ∈ Pl}. The motif graph of a sequence-structure S S Sl

(l = 1,2) in an alignment A is the subgraph GA

arcs aligned by A. Thus, the motif graph represents the aligned, local part of the sequence-

structure.Formally,GA

1}|i ∈Vl,i+1 ∈Vl}∪{{i, j} | (i, j) ∈ PA

Theorem 1. Let A be an arc-complete alignment of (S S S1,S S S2), and GA

of the sequence-structuresS S Sl(l = 1,2) in A. Then A is local if and only if the graphs GA

are connected.

lof GS S Slconsisting of only the bases and

l=(Vl,El) is the subgraphofGS S Sl, whereVl=πl(A) andEl={{i,i+

l}.

lbe the motif graphs

l

Proof sketch. For clarity, we are hiding some tedious technical details. The first direction,

“localimpliesconnected”,is provenseparatelyforbothgraphsbyinductionoverthenested

structures PA

for every arc (i, j) ∈ PA

defined as follows. Let (i, j) ∈ PA

in [i..j]. An arc (ˆi,ˆj) ∈ PA

range [ˆi..ˆj], is called immediate (aligned) predecessor of (i, j). Gloop

of Gi,jthat is restricted to the nodes in the outer multi-loop. Formally, Gloop

from Gi,jby removing the nodes in [ˆi+1..ˆj−1], for each immediate aligned predecessor

(ˆi,ˆj) of (i, j). An example of the subgraphs is given in Fig. 3.

Still, (i, j) denotesanarcinPA

alignedpredecessorof(i, j). Inconsequence,Gloop

there is at most one exclusion with the immediate aligned successor (i, j). If there is such

an exclusion [k..k?], then the graph consists of the nodes [i..k−1] and [k?+1..j]. The nodes

in each of the two subsets are connected, since they are consecutive, and the two subsets

are connected by the arc (i, j).

For the induction step, for each immediate aligned predecessor (ˆi,ˆj) of (i, j), the graph

Gˆi,ˆjis connected by induction hypothesis. Then, by construction, Gi,jis connected, if the

corresponding loop-subgraphGloop

i,j

is connected.

lin the ≺PA

l-ordering. Hence, we fix l ∈ {1,2}. Then, one proofs inductively

lthe connectivity of the subgraphs Gi,jand Gloop

l, then Gi,jis the subgraph of GA

l, where the arc (i, j) is the immediate (aligned) successor of the

i,j

of GA

l, which are

lrestricted to the nodes

i,j

denotes the subgraph

is generated

i,j

l.Forthebasecase oftheinduction,thereis noimmediate

i,jequalsGi,j. By thedefinitionoflocality,

Page 9

January 7, 200418:21WSPC/INSTRUCTION FILElocal-ssa

Local Sequence Structure Motifs in RNA

9

.....(((..((....))..((....))...((....))..)))..

8

Fig. 3.The complete graph drawing represents the motif graph GA

graph contains only aligned bases and arcs. In particular there is no edge for the arc (7,43) and no nodes for the

excluded bases CUAUAUCUAAG. The subgraph G8,42of GA

Gloop

42

CGACCCGUCGACUCUAGUAGAGUUGAUUGACUAUAUCUAGGACGGG

1forthe alignment of Fig. 2.Recall that the motif

1is shown with black thin line, and the loop-subgraph

8,42of GA

1is drawn with gray thick line.

By definition of locality there is at most one exclusion [k..k?] with the immediate suc-

cessor (i, j). We discuss the case that there is one. In order to show the connectivity of

Gloop

i,j

and show that they are connected (by case dis-

tinction). Note that the nodes smaller than k are connected in Gloop

exclusion with successor (i, j). The same holds for the nodes greater than k?. Therefore,the

single remaining case is s < k and k?<t. There, the connection of s andt is only due to the

connection of i and j by the arc (i, j).

Instead of proving the second direction “connected implies local” directly, we prove

its logically equivalent form “non local implies non connected”. If the arc-complete align-

ment A is not local, then there are exclusions which contradict the second condition in the

definition of local alignment. Either, there is one exclusion without a immediate aligned

successor, or there are two exclusions sharing the same immediate successor. In both cases,

we can identify sets of nodes in the motif graph that are isolated, since a connection con-

tradicts the assumption that the structures are nested.

i,j, we choose two nodes s < t of Gloop

i,j

since [k..k?] is the only

2.4. Dynamic Programming Algorithm

Here, we introduce a dynamic programming algorithm for the lssa problem for sequence-

structuresS S S1= (S1,P1) andS S S2= (S2,P2), where P1and P2are nested.

For applying dynamic programming to this problem, we develop several recursion

equations, that recursively define the maximal similarity score of a local alignment of se-

quences S1and S2. These recursion equations can be efficiently evaluated while filling

matrices, i.e. materializingintermediate results. Finally, we obtain the actual optimal align-

ment by traceback from the matrices. We further argue in this subsection that the equations

imply an O(|S1|2·|S2|2) time and O(|S1|·|S2|) space lssa algorithm.

The recursion scheme presented below has been inspired by an algorithm developed

by Jiang et al.7for finding global nested/crossing sequence-structure alignments. Whereas

their algorithm is clearly presented by a single recursion equation, we have to introduce

several interlocked recursion equations, in order to handle locality.

In the following, fix arcs a1∈ P1and a2∈ P2. For sets I and J, we introduce E(I,J)

Page 10

January 7, 200418:21 WSPC/INSTRUCTION FILElocal-ssa

10

Rolf Backofen and Sebastian Will

as an abbreviation for the set of edges I ∪{−}×J ∪{−}. At the center of our system

of recursion equations, we define D(a1,a2) as the maximal similarity score of a lssa A ⊆

E([al

materialized in a matrix D and will be used to compute the maximal score of a lssa of S S S1

andS S S2in the top-level computation step. For the recursive definition of an entry D(a1,a2)

we will introduce four further recursion equations.

Tocomputea scoreD(a1,a2) wehavetoconsidersub-setsoflocalalignmentsrestricted

to bases between the left and right ends of the two arcs a1and a2. Since the arc-match

between a1and a2justifies exclusions in the local alignments, those restrictions, which do

not contain this arc-match, are not necessarily local alignments themselves. Furthermore,

they need not to be arc-complete. We express this by the following definitions. For a

structure P, we define the restriction of P to a set I, write P|I, as {(i, j) ∈ P | i, j ∈ I}. We

call an alignment A ⊆ E([i..i?],[j..j?]) of S S S1andS S S2sub-arc-complete if it is arc-complete

for P1|[i..i?] and P2|[j..j?]. A top-level exclusion in sequence l of alignment A (l =1,2) is an

exclusion [k..k?] in sequence l of A, where neither k nor k?have a successor in PA

sub-alignment A ⊆ E([i..i?],[j..j?]) ofS S S1andS S S2is an alignment, satisfying the conditions

of a local alignment except that in each sequence one top-level exclusion is allowed and

the alignment is sub-arc-complete instead of arc-complete (compare to Definition 1).

Now, we define the maximal similarity score of all local sub-alignments A ⊆

E([al

i and j. As a particularity, we have to ensure that there is at most one top-level exclusion in

each sequence.

For this reason, we count the top-level exclusions by keeping track of four states.

Namely, for i < ar

alignment A ⊆ E([al

(1) at most one exclusion in every sequence (arbitrary local sub-alignment) asx

(2) at most one exclusion in the first sequence asx

(3) at most one exclusion in the second sequence as◦

(4) no exclusions (true local alignment) as◦

1..ar

1],[al

2..ar

2]) of S S S1and S S S2, where the two arcs a1and a2match. The scores are

l. A local

1..i],[al

2..j]) ofS S S1andS S S2, where i < ar

1and j < ar

2, recursively going back to smaller

1and j < ar

1..i],[al

2, we define the maximal similarity score of a local sub-

2..j]) ofS S S1andS S S2, where the sub-alignment has

xMa1

a2(i, j)

◦Ma1

a2(i, j)

xMa1

a2(i, j)

◦Ma1

a2(i, j).

For applying dynamic programming, we will introduce four recursion equations, one for

each state. Again, the intermediate results are materialized in matrices.

We define the scores in D usingx

D(a1,a2) =x

Then, it remains to give the recursion equations forx

Since the cases, where no top-level exclusions are involved, are part of all re-

cursion equations, we define a helper function NOEX(Ma1

{◦

NOEX(Ma1

to the matrix entry Ma1

sequence. In consequence, we get

xMa1

xMa1

a2by

a2(ar

1−1,ar

2−1)+sarc(a1,a2).

xMa1

a2,x

◦Ma1

a2,◦

xMa1

a2and◦

◦Ma1

a2.

a2,i, j) for any matrix Ma1

2..ar

a2∈

◦Ma1

a2,x

◦Ma1

a2,◦

xMa1

a2,i, j) denotes the maximal similarity score of the alignments corresponding

a2(i, j), which do not end (at the right) with an exclusion in either

a2,x

xMa1

a2}, where (i, j) ∈ [al

1..ar

1−1]×[al

2−1].

◦

◦Ma1

a2(i, j) = NOEX(◦

◦Ma1

a2,i, j).

Page 11

January 7, 200418:21 WSPC/INSTRUCTION FILE local-ssa

Local Sequence Structure Motifs in RNA

11

The maximal score NOEX(Ma1

case, the alignments match S1[i] and S2[j]. Then, the score is computed as the sum of

Ma1

similarity) for breaking the arcs, that are incident to i or j. In the next two cases, the align-

ment ends in a gap in the first sequence (resp. the second sequence). The score is the sum

of Ma1

breaking. In the last case, an alignment ends in an arc-match. There, we add the maximal

similarity of alignments left of the arc match to the maximal similarity for alignments that

are framed by the arc-match given in D.

In the following definition of NOEX, we use the helper functions inc1and inc2for a

compact notation. As defined in our preliminaries incl(i) = 1 if there is an incident arc to

base i in the structure Pland incl(i) = 0 otherwise (for l = 1,2). Multiplying one term by

incl(i) and another term by 1−incl(i) supports the distinction of cases.

⎧

⎪

⎪

⎪

For the initial cases, one of the subsequences has zero length, i.e. if i = al

Since no top-level exclusions are handled here, we get

a2,i, j) can stem from four classes of alignments. In the first

a2(i−1, j−1), the similarity of the matched bases, and possibly the penalty (negative

a2(i−1, j) (resp. Ma1

a2(i−1, j)), the gap dissimilarity, and possibly a term for arc-

NOEX(Ma1

a2,i, j) = max

⎪

⎪

⎪

⎪

⎪

⎪

⎪

⎪

⎪

⎪

⎪

⎩

⎪

⎨

⎪

⎪

⎪

⎪

⎪

⎪

⎪

Ma1

a2(i−1, j−1)+(1−inc1(i))(1−inc2(j))sb(i, j)

+inc1(i)sbr1(i, j)+inc2(j)sbr2(i, j)

Ma1

a2(i−1, j)+(1−inc1(i))sb(i,−)+inc1(i)sbr1(i,−)

Ma1

a2(i, j−1)+(1−inc2(j))sb(−, j)+inc2(j)sbr2(−, j)

Ma1

iff (i?,i) ∈ P1and (j?, j) ∈ P2.

a2(i?−1, j?−1)+D((i?,i),(j?, j))

1or j = al

2.

NOEX(Ma1

NOEX(Ma1

NOEX(Ma1

a2,al

a2,i,al

a2,al

1,al

2) = 0

2) = Ma1

1, j) = Ma1

a2(i−1,al

a2(al

2)+(1−inc1(i))sb(i,−)+inc1(i)sbr1(i,−)

1, j−1)+(1−inc2(j))sb(−, j)+inc2(j)sbr2(−, j).

Now we define the recursion cases for the remaining matricesx

Beside the recursion cases, where no new exclusion is inserted, we need additional cases

to introduce new top-level exclusions, which are arc-complete. (Note that one proves by

contradiction that every exclusion of a local alignment A is itself arc-complete.).The addi-

tional cases reflect the insertion of an exclusion that starts with i in the first sequence (resp.

j in the second sequences), and extends to left until some k < i (resp. k < j) such that the

exclusion [k+1..i] (resp. [k+1..j]) is arc-complete. We need to examine all such values of

kd. This results for (i, j) ∈ [al

◦

◦Ma1

a2,◦

xMa1

a2, andx

xMa1

a2.

1..ar

1−1]×[al

◦Ma1

2..ar

2−1] in the recursion equations

a2(i, j) = NOEX(◦

◦Ma1

a2,i, j),

dSince we already run over these values of k, here it is possible to impose further restrictions on the admissible ex-

clusions, e.g. it seems reasonable to limit the minimal length of exclusions, which can be done without additional

work in the algorithm.

Page 12

January 7, 200418:21WSPC/INSTRUCTION FILElocal-ssa

12

Rolf Backofen and Sebastian Will

a1

i

2

ano toplevel excl.

aa

22+1

ll

j

aa1

1

+1

ll

(at most) one toplevel excl.

Fig. 4. Local sub-alignment considered inx

◦Ma1

a2(i, j). The areas of alignment edges are marked grey.

x

◦Ma1

a2(i, j) = max

⎧

⎩

⎨

⎧

⎪

⎨

⎧

⎩

⎪

⎪

NOEX(x

◦Ma1

max

a2,i, j)

◦

al

1≤k<i,ac1([k+1..i])

NOEX(◦

max

al

2≤k<j,ac2([k+1..j])

NOEX(x

max

al

1≤k<i,ac1([k+1..i])

max

al

2≤k<j,ac2([k+1..j])

◦Ma1

a2(k, j),

◦

xMa1

a2(i, j) = max

xMa1

a2,i, j)

◦

◦Ma1

a2(i,k),

andx

xMa1

a2(i, j) = max

⎪

⎪

⎪

⎪

⎪

⎩

⎨

xMa1

a2,i, j)

◦

xMa1

a2(k, j)

x

◦Ma1

a2(i,k).

Note that for i = al

such a maximum over an empty range as −∞. For example, this sets the entries (al

the four matrices to 0. Fig. 4 shows a local sub-alignment handled by equationx

Finally, we use the our helper function NOEX (and thus implicitly the table D) for

defining recursion equations for the complete alignment, i.e. for the top-level of our align-

ment. Here, we look for the best scoring alignment of a pair of subsequences of S1and

S2, which is similar to local sequence alignment. In the case of local sequence alignment,

this is done by droppingprefix and suffix alignments with negative scores. However,in our

case the aligned subsequenceshave to be arc-complete.In consequence,one cannot simply

drop arbitrary prefixes, as it is done in local sequence alignment.

However, we can avoid searching through all possible starting and ending points of the

subsequences. Fortunately, searching over all (al

a1(resp. a2) is an arc in P1(resp. P2) or spans the whole sequence, turns out to suffice. If

we additionally allow the dropping of negatively scored arc-complete prefix alignments in

the recursion equation, we can take account for every pair of arc-complete subsequences.

For defining the top-level recursion, fix start indices i0and j0. Choose i2and j2to be

maximal such that ac1([i0..i2]) and ac2([j0..j2]). We define T(i, j) for (i, j) ∈ [i0−1..i2]×

1(resp. j = al

2) the maximizations run over empty ranges. We define

1,al

a2(i, j).

2) of

◦Ma1

1+1,al

2+1) as starting positions, where

Page 13

January 7, 200418:21WSPC/INSTRUCTION FILElocal-ssa

Local Sequence Structure Motifs in RNA

13

ii0

imin

jmin0j

j

j2

i2

optimal lssa

Fig. 5. Example for the last step (“top-level”) of the local alignment. The figure illustrates the meaning of the

indices that are used in the description. The region of the optimal lssa that is scored by T(i, j) is highlighted.

[j0−1..j2] recursively by

T(i, j) = max

?

NOEX(T,i, j),

?

0if ac1([i0..i])∧ac2([j0..j])

−∞ otherwise

??

.

Note that by limiting the score of arc-complete prefix alignments to be at least zero, we

accomplish the effect of dropping negatively scored prefix alignments. For ac1([i0..i])∧

ac2([j0..j]), T(i, j) yields the score of the optimal lssa A ⊆ E([i0..i],[j0..j]), where i and

j are aligned and for the minimal positions iminand jminthat are aligned by A, holds

ac1([i0..imin−1]) and ac2([j0..jmin−1]). Fig. 5 illustrates the top-level alignment.

Finally, to get the maximal score we search through all i0= al

for arc-pairs (a1,a2) ∈ P1×P2, and determine the maximal entry (i1, j1) in the matrices T,

where ac1([i0..i1]) and ac2([j0..j1]).

1+1 and j0= al

2+1

Theorem 2. For a similarity score SIMSCORE and sequence-structuresS S S1= (S1,P1) and

S S S2= (S2,P2) with nested structures P1and P2, there is a O(|S1|2·|S2|2·max(|S1|,|S2|))

time and O(|S1|·|S2|) space algorithm for the lssa problem.

Proof sketch. The existence of an algorithm has been shown already. Albeit a rigorous

proof of the algorithm’s correctness has been omitted. Since the scores in the matrices

D, Ma1

the correctness is proven by showing the equivalence of these definitions to the recursion

equations. This is shown inductively.

Itremainstodiscuss thealgorithm’scomplexity.Thealgorithmcomputesandstores the

O(|S1|·|S2|) manyentriesofthematrixD.ForeachentryofD,wecomputetheO(|S1|·|S2|)

many entries of the matrices Ma1

separate matrices Ma1

for different arc-pairs, it suffices to store the matrix D as well as the matrix T and only one

instance of the matrices Ma1

Regardingtime complexity,in the computationof the matrices Ma1

all values of k in a certain range, where its size is limited by the sequence length. The com-

plexity of computing D is dictated by the total number of steps we perform in these maxi-

mizations, since the other cases of our recursion equations are evaluated in constant time.

a2, and T are explicitly defined as maximal scores of certain classes of alignments,

a2. For computing each of the entries in D, we compute

a2. However, since there is no dependency between the matrices Ma1

a2

a2at one time. Hence, we need only O(|S1|·|S2|) space.

a2, we maximize over

Page 14

January 7, 200418:21WSPC/INSTRUCTION FILElocal-ssa

14

Rolf Backofen and Sebastian Will

The worst case time complexity of the algorithm is thus O(|S1|2·|S2|2·max(|S1|,|S2|)),

since the top-level computation has a time complexity of only O(|S1|2·|S2|2) and finally

the traceback step, where we recompute the matrices Ma1

complexity.

a2, does not increase time or space

3. Results

An implementation of the introduced algorithm written in C++ is available on the web-

page http://www.bio.inf.uni-jena.de. We have aligned RNase P from two different organ-

isms Ralstonia eutrophus and Streptomyces bikiniensis. Whereas the algorithm allows for

fine-tuning of parameters, here the scores are chosen ad-hoc.eThe example is computed in

about 75 seconds on a Intel Pentium 4 running at 2.4GHz, while the alignment of Fig. 2

is computed in a few milliseconds. Jiang et al.’s implementation needs approximately the

same time to produce a global alignment. The resulting local alignment is shown in Fig. 6.

It shows sequences and structures as well as the aligned bases. Only the bases that are con-

nected by | are aligned. Fig. 8 shows the secondary structures of the two RNA molecules,

where the parts included and excluded by the alignment of Fig. 6 are distinguished.

The sequences and structures are taken from Brown’s RNase P Database16. The reader

may compare our example to a corresponding example given by Jiang et al.7, where the

two RNase P molecules are alignedglobally.f. As a drawbackof givingthis opportunityfor

a comparison, the two molecules, which were originally chosen for global alignment, have

small dissimilar parts. In consequence, the exclusions are small and there are no unaligned

prefixes or suffixes.

As a further test-case, we used our pairwise lssa algorithm to produce input for T-

coffee17. The result of aligning SECIS elements of M.janaschii from Wilting et al.2is

shown in Fig. 7.

Acknowledgment

We thank the anonymous referees, who helped to improve the quality of the paper. Fur-

thermore, we thank Sven Siebert for the many discussions on global and local sequence-

structure alignment and his help with the multiple alignment example.

References

1. J. Cousin. Breakthrough of the year: Small RNAs make big splash. Science, 298:2296–97, 2002.

2. R. Wilting, S. Schorling, B. C. Persson, and A. B¨ ock. Selenoprotein synthesis in archaea: Iden-

tification of an mRNA element of Methanococcus jannaschii probably directing selenocysteine

insertion. J. Mol. Biol., 266(4):637–41, 1997.

3. SJ Klug and M Famulok. All you wanted to know about selex. Mol. Biol. Rep., 20(2):97–107,

1994.

eWe are aware that, in general, the quality of predicted alignments depends strongly on the actual parameters for

the similarity score. Unfortunately, there are notheoretical foundations to estimate suchparameters systematically.

fJiang et al. refer to the recently renamed organism R. eutrophus by Alcaligenes eutrophus

Page 15

January 7, 200418:21WSPC/INSTRUCTION FILE local-ssa

Local Sequence Structure Motifs in RNA

15

((((((((((((((((((.(((((((((((....)))))))))))....(((.(((((((((...((((.((((((((((...

AAAGCAGGCCAGGCAACCGCUGCCUGCACCGCAAGGUGCAGGGGGAGGAAAGUCCGGACUCCACAGGGCAGGGUGUUGGCUAA

|||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||

CGAGCCGGGCGGGCGGCCGCGUGGGGGUCUUC GGACCUCCCCGAGGAACGUCCGGGCUCCACAGAGCAGGGUGGUGGCUAA

((((((((((((((((((.((((((((((... .))))))))))....(((.(((((((((...((((.((((((((((...

..)))))(((((....)))).)((...(((((........... (((((((((((((....)))))))))))))........

CAGCCAUCCACGGCAACGUGCGGAAUAGGGCCACAGAGACGAG-UCUUGCCGCCGGGUUCGCCCGGCGGGAAGGGUGAAACG

|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

CGGCCACCCGGGGUGACCCGCGGGACAGUGCCACAGAAAACAGACC--GCCGGGGACCUCGGUCCUCGGUAAGGGUGAAAC G

..)))))(((((....)))).)((...(((((............(((((((((((....)))))))))..))....... (

.)))..)))))))))))))...((((.....

CGGUAACCUCCACCUGGAGCAAUCCCAAAUA

|||||||||||||||||||||||||| ||||

GUGGUGUAAGAGACCACCAGCGCCUGAGGCGACUCAGGCGGCUAGGUAAACCCCACUCGGAGCAAGGUC AAGAGGGGACACC

(((((.......))))))((((((((((....)))))))).)).)))..)))))))))))))...(((( ....((((((((.

. ((((((... (((((.

G-GCAGGCGAU GAAGCG

||||||||||| |||||

))))))))))))))).....))))......((((((((....)))))))).

GCCCGCUGAGUCUGCGGGUAGGGAGCUGGAGCCGGCUGGUAACAGCCGGCC

|||||||||||||||||||||||||||||||||||||||||||||||||||

CCGGUGUCCCUGCGCGGAUGUUCGAGGG CUGCUCGCCCGAGUCCGCGGGUAGACCGCACGAGGCCGGCGGCAACGCCGGCCC

...))))))))..((((((....((((( (.)))))))))))))))).....))))......((((((((....)))))))).

.... .....)))))))(((((((((((( ........ ))))))))))))......)))))))).).))))))))))....

UAGA GGAAUGGUUGUCACGCACCGUUUG-CCGCAAGG CGGGCGGGGCGCACAGAAUCCGGCUUAUCGGCCUGCUUUGCUU

|||| || |||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||

UAGAUGG AUGGCCGUC--GCCCCG-ACGACCGCGAGGUCC--CGGGG---ACAGAACCCGGCGUACAGCCCGACUCGUCUG

....... ..))))))).((((( .((..........)) ))))) ......)))))))).).))))))))))....

Fig. 6. Local sequence-structure alignment of RNase P of R.eutrophus and S.bikiniensis. The sequences and

structures are taken from the RNase P Database by Brown16. Only the bases that are shown connected by |

are aligned. The algorithm correctly identifies and excludes substructures that occur only in one of the RNA

molecules.

fdhA

fruA

fwdB

hdrA

selD

vhuD

vhuU

CG-C-CACCCUGC-GAACCCAAUAUAAAAUAAUACAAGGGA-GCA-GGUGG-CG-

---CCUCGA-GGG-GAACCCG-------------AAAGGGA-CCCGAGAGG----

AU-GUUGGA-GGG-GAACCCG-------------UAAGGGA-CCC-UCCAAGAU-

---GGCACC-ACUCGAAGGC--------------UAAGCCAAAGU-GGUGCU---

UUACGAUGU-GCC-GAACCCUU------------UAAGGGA-GGC-ACAUCGAAA

G--UUCUCU-CGG-GAACCCGU------------CAAGGGA-CCG-AGAGAAC--

AG-CUCACA-ACC-GAACCCA-------------UUUGGGA-GGU-UGUGAGCU-

fdhA

fruA

fwdB

hdrA

selD

vhuD

vhuU

((-(-((((.(((-...(((..(((......)))...))).-)))-)))))-))-

---((((..-(((-...(((.-------------...))).-)))..))))----

((-.(((((-(((-...(((.-------------...))).-)))-))))).))-

---((((((-(((....(((--------------...)))..)))-))))))---

...((((((-(((-...(((..------------...))).-)))-))))))...

(--((((((-(((-...(((..------------...))).-)))-)))))))--

((-((((((-(((-...(((.-------------...))).-)))-))))))))-

Fig. 7. Alignment of SECIS elements from Methanococcus janaschii using lssa. The bases, which supposedly

belong to the common motif are colored, where we further mark the bases that occur in both sequences identically.

4. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search

tool. J. Mol. Biol., 215(3):403–10, 1990.

5. M. Zuker. Prediction of RNA secondary structure by energy minimization. Meth. Mol. Biol.,

25:267–94, 1994.

6. S. Wuchty, W. Fontana, I. L. Hofacker, and P. Schuster. Complete suboptimal folding of RNA

and the stability of secondary structures. Biopolymers, 49(2):145–65, 1999.

7. T. Jiang, G. Lin, B. Ma, and K. Zhang. A general edit distance between RNA structures. J.

Comput. Biol., 9(2):371–88, 2002.

8. H. P. Lenhof, K. Reinert, and M. Vingron. A polyhedral approach to RNA sequence structure

alignment. J. Comput. Biol., 5(3):517–30, 1998.

9. S. R. Eddy. A memory-efficient dynamic programming algorithm for optimal alignment of a

Page 16

January 7, 200418:21WSPC/INSTRUCTION FILE local-ssa

16

Rolf Backofen and Sebastian Will

C

U

C

A

A

A

G

C

A

G

G

C

C

A

G

G

C

A

A

C

C

G

C

C

U

G

C

C

U

G

C

A

C

G

C

A

AGGUGCAGGGGGAGGAA

A

G

UC

C

G

G

A

C

G

U

G

C

G

C

A

C

A

G

G

G

C

A

G

G

G

U

G

U

U

G

G

C

U

A

CAG C C A U

C

G

C

AACGUGCGG

A

C

C

A

G

A

A

U A G

G

G

C

C

C

A

C

A

G

A

G

A

C

G

AGU

C

U

U

G

C

C

G

C

C

G

G

G

UCG

U

CCCGGCGGGAAGGG UGAA

A

C

G

GGU AACCUC

CACCUGGAGCAAUCCC

A

A

AUAG

GCAGGCGA U GAAGC

G

G

C

C

G

C

U

G

A

G

U

C

U

G

CGG

G

U

A

A

GCUG

G

A

G C C GG C UGGU

A

A

C

A

G

C

C

G

G

C

CU AG

A

G

G

A

A

U

G

G

U

U

G

U

CACGCACCGUUUGCCG

C

A

A

G

G

C

G

G

G

C

G

G

G

G

C

G

C

AC

A

G

A

A

C

G

G

C

U

UA

UC

G

G

C

C

U

G

C

U

U

UG

C

U

U

R. eutrophusS. bikiniensis

C

C

C

C

G

A

G

C

C

G

G

G

C

G

G

G

C

G

G

C

C G

C

C

G

U

G

G

G

G

G

C

U

C

GG

A

C

C

U

C

C

C

C

GAGGAA

C

G

UC

C

G

G

G

C

U

C

C

A

C

A

G

A

G

C

A

G

G

G

U

G

G

U

G

G

C

U

A

A

CGGCCAC

G

G

GAC C C GCG

C

C

G

G

U

G

G

A

CAG

U

G

C

C

A

C

A

G

A

A

A

A

C

AGAC

G

C

C

AAACG

U

G

G

U

G

C

G

G

G

G

A

C

C

U

C G

GUCCUCGGU AAGGG

U

G

G

A

G

U

AA GA

A

C

C

C

CA

G

C

G

C

C

U

G

A

G

G

CG

A

C

U

C

A

G

G

C

GG

C

U

A

GGU AAA

CCC

CACUCGGAGCAAGGUC

A

A

G

A

GGGGACAC

CCC

G

G

U

G

U

C

C

UG

GCGGAUG U U GAGGGC

U

G

C

U

C

G

C

C

C

G

A

G

U

C

C

G

CG

G

G

U

A

G

A

C

C

GC AC

G

A

GGCCGGCGGC

A

A

C

G

C

C

G

G

C

C

CUAG

A

U

G

G

A

U

G

G

C

C

G

U

CG

CCCCGACGACC

G

C

G

A

G

G

U

C

C

C

G

G

G

G

A

C

A

G

A

A

C

C

G

G

C

G

UA

A

C

C

G

C

C

G

A

C

U

C

GU

C

U

G

U

U

Fig. 8. Secondary structures of RNase P molecules. We show the parts of the structures that are not aligned by

our algorithm only lightly. Notably, in the presented form, the algorithm produces small exclusions, which can be

prevented easily, if wanted. Please also see Footnote d on page 11 on this issue.

sequence to an RNA secondary structure. BMC Bioinformatics, 3(1):18, 2002.

10. J. Gorodkin, S. L. Stricklin, and G. D. Stormo. Discovering common stem-loop motifs in un-

aligned RNA sequences. Nucleic Acids Res., 29(10):2135–44, 2001.

11. M. H¨ ochsmann, T. T¨ oller, R. Giegerich, and S. Kurtz. Local similarity in rna secondary struc-

tures. In Proc. of Comput. Sys. Bioinf. (CSB 2003), 2003.

12. T. Jiang, J. Wang, and K. Zhang. Alignment of trees - an alternative to tree edit. Theor. Comput.

Sci., 143(1):137–148, 1995.

13. M. Gerstein and M. Levitt. Comprehensive assessment of automatic structural alignment against

a manual standard, the scop classification of proteins. Prot. Sci., 7(2):445–56, 1998.

14. G. Lancia, R. Carr, B. Walenz, and S. Istrail. 101 optimal PDB structure alignments: a branch-

and-cut algorithm for the maximum contact map overlap problem. In Proc. of 5thInt. Conf. on

Comput. Mol. Biol. (RECOMB 2001), 2001.

15. J. Xu, M. Li, D. Kim, and Y. Xu. RAPTOR: Optimal protein threading by linear programming.

J. Bioinf. Comput. Biol., 2003.

16. J. W. Brown. The ribonuclease P database. Nucleic Acids Res., 27(1):314, 1999.

17. C. Notredame, D. G. Higgins, and J. Heringa. T-coffee: A novel method for fast and accurate

multiple sequence alignment. J. Mol. Biol., 302(1):205–17, 2000.