Page 1

Using multiple alignments to improve seeded

local alignment algorithms

Jason Flannick* and Serafim Batzoglou

Department of Computer Science, Stanford University, Stanford, CA 94304, USA

Received June 8, 2005; Revised July 6, 2005; Accepted July 27, 2005

ABSTRACT

Multiple alignments among genomes are becoming

increasingly prevalent. This trend motivates the

development of tools for efficient homology search

betweenaquerysequenceandadatabaseofmultiple

alignments.Inthispaper,wepresentanalgorithmthat

uses the information implicit in a multiple alignment

to dynamically build an index that is weighted most

heavilytowards the promising regions of the multiple

alignment. We have implemented Typhon, a local

alignment tool that incorporates our indexing algo-

rithm,whichourtestresultsshowtobemoresensitive

than algorithms that index only a sequence. This

suggests that when applied on a whole-genome

scale, Typhon should provide improved homology

searches in time comparable to existing algorithms.

INTRODUCTION

Sequence alignment is certainly one of the most well-

developed and pervasive topics of computational molecular

biology. Algorithms in this vein are widely used for tasks

varying from the comparative analysis of rodent (1–5) and

chicken (6) genomes to the construction of networks of protein

interactions(7).Withthecurrentsequencingofmanygenomes

(8), fast and sensitive sequence alignment algorithms will

likely maintain or increase their role in biological research.

As more and more genomic data has become available,

algorithms for locally aligning query sequences to genomic

databases have become increasingly important (9–13).

Because the exact Smith–Waterman (14) algorithm is imprac-

tical for large sequences, database search techniques are

almost always based upon the paradigm of seeded alignments.

The BLAST algorithm (10) was pivotal in popularizing

such a technique, and it has since been incorporated into

many tools, a few of which are BLASTZ (4), BLAT (13)

and Exonerate (15). In such algorithms, a set of seeds is

first generated between the database and the query. Each

seed is then extended to determine whether it is a part of

high scoring local alignment. Extensions typically consist

of two phases: first the seed is extended into an un-gapped

alignment, and if this alignment scores above a threshold, the

seed is then extended with the allowance of gaps. An enhance-

ment to this simple model is to extend only pairs of seeds close

to each other (11). Seeds for the BLAST algorithm are tradi-

tionally fixed-length words present in both the database and

the query, with the word length referred to as the seed’s

weight. This leads to an inevitable speed/sensitivity trade-

off; heavier seeds prune a larger fraction of the search

space but miss more alignments than do seeds with a smaller

weight.

In recent years, the introduction of spaced seeds has led to

significantly improved local alignment algorithms (12,16–20).

Spaced seeds allow non-contiguous patterns of matching

nucleotides to initiate a local alignment, and algorithms

have been developed (17–22) to compute the probability

that a seed will be found within an un-gapped alignment of

a given length between two sequences. The optimal seed can

then be chosen as the seed that maximizes this probability. It

is useful to think of un-gapped alignments of homologous

regions as being generated by a probabilistic model that

specifies the distribution over matches and mismatches

(17,20,22). The model outputs a bit string where each position

corresponds to a position in the alignment; the bit is 1 if

there is a match in the alignment and 0 if there is a mismatch.

While higher-order models are possible (19), in this article

we will focus on models that output a 1 independently in each

position with a fixed probability, which is called the similarity

level (12).

In addition to being provably more sensitive than consecu-

tive seeds in some cases (21), spaced seeds allow an important

new speed/sensitivity trade-off. Rather than lowering the

weight of a seed to boost sensitivity, one can index multiple

seeds per position and obtain a linear, rather than exponential,

rise in the size of the search space (18,20). Spaced seed design

operates under a resource-constrained paradigm (17), where

the weight and number of seeds is specified and the goal is to

design an optimal set of seeds that fits these constraints.

*To whom correspondence should be addressed. Tel: +1 650 289 0295; Fax: +1 650 725 1449; Email: flannick@cs.stanford.edu

Correspondence may also be addressed to Serafim Batzoglou. Tel: +1 650 723 3334; Fax +1 650 725 1449; Email: serafim@cs.stanford.edu

? The Author 2005. Published by Oxford University Press. All rights reserved.

The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access

version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press

areattributedastheoriginalplaceofpublicationwiththecorrectcitationdetailsgiven;ifanarticleissubsequentlyreproducedordisseminatednotinitsentiretybut

only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oupjournals.org

Nucleic Acids Research, 2005, Vol. 33, No. 14 4563–4577

doi:10.1093/nar/gki767

Page 2

In this article, we seek to build on these developments by

taking advantage of increasing amounts of available genomic

data as well as rapidly improving global multiple sequence

alignment algorithms (23–26). We predict that in the near

future, these trends will lead to the proliferation of genomic

databases consisting of multiple alignments. Information

implicit in an alignment has been used to aid in a variety

of bioinformatics tasks (27–30), and, similarly, one can

hope that a multiple alignment can be utilized to improve

database search algorithms. Previous research on searching

between multiple alignments has concentrated on position

specific scoring schemes (11,31,32). PSI-BLAST (11) is the

most popular such program; given a query sequence, it builds

a multiple alignment, or profile, from a set of high scoring

alignments of the query to the database. It then uses the con-

structed profile to iterate searches for improved sensitivity.

Approaches in this vein have been successful, but in this

paper, our focus is orthogonal to such techniques.

Theproblemwetackleistoalignaquerysequencetoafixed

multiple alignment database. As an example, it may be desir-

able to augment a multiple alignment of mammalian genomes

with a newly sequenced mammalian or vertebrate genome.

Our approach uses the multiple alignment database to improve

search sensitivity over that obtained using only a sequence

database. To do this, we extend the resource-constrained

paradigm to apply not only to seed design but also to

seed allocation; we allow different positions in the database

to index different sets of seeds and determine the best way

to do so based on the information implicit in the multiple

alignment.

We have implemented a local alignment tool, Typhon,

which incorporates our indexing algorithm. Tests on real

world data shows that Typhon is substantially more sensitive

than standard sequence indexing algorithms as well as

algorithms that index multiple alignments without using our

dynamic indexing methodology. The performance improve-

ment is most dramatic for indexes with a small number of

spaced seeds, which is important for large-scale database

searches. Source code for Typhon is available under the

GNU public license at http://typhon.stanford.edu.

ALGORITHMS

Overview

We are initially given a multiple alignment, a hypothetical

query sequence and a phylogenetic tree of all species in the

alignment as well as the query. We convert the multiple align-

ment into a probabilistic profile, where each position in the

profile is a tuple of six numbers (ppresent, pA, pC, pT, pG, pid).

The first number, ppresent, is the existence probability. It repre-

sents the probability that a homologue of the position is pre-

sent inthe query or, equivalently, the probability that the query

would align to the position without a gap. The next four values

indicate the respective conditional probabilities that the homo-

logouspositioncontainsanA,C,GorT, giventhatthere exists

a homologous position in the query. The nucleotide with the

highest such value is called the consensus character; gaps are

ignored when determining this. Note that the consensus char-

acter in principle depends on the position of the query in the

tree. The final value, pid, is the conditional similarity level (12)

of the position given that there exists a homologous position in

the query. In other words, it represents the probability that the

corresponding position in the query sequence contains the

consensus character of the multiple alignment. For the remain-

derof this paper, we will use the termsprofile and probabilistic

profile interchangeably.

To begin with, we define several terms. A seed of weight

w is a sequence of possibly non-consecutive positions

(i1< ·· < iw); by convention, i1¼ 0. The span of a seed is

defined as iw? i1+ 1. We define an un-gapped homology

h of length l beginning at position p in sequence s and position

q in sequence t as two sub-sequences (s[p], ..., s[p + l]) and

(t[q], ..., t[q + l]) that have descended from a common ances-

tor; one can think of such a homology as a bit string of length

l with h[i] ¼ 1 if and only if s[p + i] ¼ t[q + i]. A seed is said

to match the homology at offset j if h[j + i1] ¼ 1,...,

h[j + iw] ¼ 1, and indexing a seed at position p in a sequence

corresponds to recording in the index the presence of

(s[p + i1], ..., s[p + iw]) at position p. A seed matches a

homology if we index the seed at every position in the homol-

ogyandthe seedmatchesthehomologyatatleastoneoffset.A

set of seeds matches a homology if we index every seed in the

set at every position in the homology and at least one seed in

the set matches the homology.

We extend these notions to a multiple alignment in a simple

manner; homologies are defined to exist between the query

and the alignment. An un-gapped homology beginning at

position p in the alignment and position q in the query is a

setofsub-sequences,onefromeachspeciesinthealignmentas

well as the query, that have all descended from a common

ancestor. In this case, we define h[i] ¼ 1 if and only if the

consensus character at position p + i in the alignment matches

the query character at position q + i. Indexing a seed at a

position in a multiple alignment corresponds to recording

thepresenceofthestringconsistingoftheconsensuscharacters

of the multiple alignment. The notion of a seed matching a

homology follows from these definitions. We observe that

more complex definitions of homology between a query and

an alignment are possible, but we do not address them here.

With these definitions, we can formulate our problem as

follows:

Given a probabilistic profile, a set of candidate seeds and a

budget B, index a subset of the candidate set at each position in

the profile such that the average number of seeds indexed per

position of the database does not exceed B. The goal is to

maximize the expected number of homologies matched by at

least one seed.

Without a budget constraint, this value would obviously be

maximizedbyindexingeveryseedinthe candidatesetatevery

position in the alignment. The value of the budget determines

the size of the index and, therefore, the expected number of

seed matches;highervalueswillresultinmoreseed extensions

and therefore lead to larger running times. Algorithms that

build indexes from sequence databases assign the same set

of seeds, with cardinality equal to the budget, to each position

in the database. Because we have extra information in

the multiple alignment, we can be more flexible. Intuitively,

we would like to assign more seeds to positions where

this increase is most likely to result in additional detected

homologies. We must respect the constraint that the average

4564Nucleic Acids Research, 2005, Vol. 33, No. 14

Page 3

number of seeds indexed per position does not exceed our

budget.

Within a multiple alignment, local rates of conservation

vary widely due to both random fluctuations in the number

of accumulated mutations as well as differential selective

pressures. Both of these effects cause some portions of the

multiple alignment to be naturally less likely to contain a

match to a homologue in a query sequence. Our algorithm

exploits this property to determine how to vary the subset of

candidate seeds indexed at each position. Specifically, we use

localconservationratestopartitionthemultiplealignmentinto

regions, which are contiguous blocks of positions whose

boundaries reflect changes in the conservation level among

species in the alignment (Figure 1).

Our approach, then, is to change the set of seeds assigned on

a region-by-region basis. By assigning fewer seeds to unpro-

mising regions, we can pay more attention to promising

regions and increase sensitivity while still respecting our bud-

get. A high-level outline of our algorithm is shown in Figure 2.

First, we convert the multiple alignment into a probabilistic

profile. We then use a hidden Markov model to partition the

profile into a set of regions where each region gives us the

necessary information to evaluate the probability that a

homology will exist in that region as well as the probability

that aseed willmatch the homology.Finally, we choose the set

of seeds to assign to each position based upon the region to

which it belongs. We aim to assign enough seeds to ensure a

high probability of finding a match but not too many to waste

our budget when it can be used more effectively elsewhere.

We note that in theory one could use a different candidate

set of seeds for each position. Such an approach would be

particularly useful in cases where highly variable similarity

levels strongly influence the optimal shape of the candidate

seeds. A typical case is coding exons, where seeds with sig-

nificant 3-periodicity such as the (0,1,2,8,9,11,12,14,15,17,18)

seed (17) have been shown to perform well because they

accommodate 3rd-base wobble positions. We do not pursue

these options here; rather, we fix the candidate set for all

regions and vary only the number of seeds assigned to each

position.

Generation of the profile

Our first goal is to convert the alignment into a probabilistic

profile. We assume that we are given a phylogenetic tree and

the position in the tree at which we expect the query to lie; we

root this tree at the query. For each position, we work our way

up from the leaves, obtaining ppresentand pNfor each nucleo-

tide N 2 {A, C, T,G} at each node in the tree; we obtain pid

only at the root. A leaf has ppresent¼ 1 if the corresponding

sequence is un-gapped at the position in the multiple align-

ment and 0 otherwise. Furthermore, it has probability pN¼ 1

for the nucleotide present in the sequence; if ppresentis 0, then

we set pN¼ 0 for all N.

As we work up the tree, we obtain ppresentand pNindepen-

dently. For each internal node, we obtain pNby applying

Felsenstein’s algorithm with a Kimura rate matrix (34–36).

Since some children can have pN¼ 0 for each nucleotide, we

consider only children for which at least one pNis positive.

The task of obtaining ppresentis somewhat more problematic

because there is not as well-developed an evolutionary theory

(a)

(b)

Figure1.Sampleregionboundaries.Boundariesbetweenthetworegionsinthe

multiplealignmentreflectthechangesinconservationamongthespeciesinthe

alignment. Both (a) and (b) are taken from real data. In the first case, the latter

region is more likely to yield alignments to a query sequence, while in the

second case, the former region is more likely to yield alignments.

Figure 2. High-level diagram of the Typhon algorithm for indexing a multiple

alignment. The overall flow of Typhon consists of three main algorithmic

components;above,dataisshowninovalsandmethodsareshowninrectangles.

Given a tree and query, the multiple alignment is first converted into a prob-

abilisticprofile.Then,theprofileis decodedrecursivelyusinga simpleHidden

Markov Model. Finally, the regions are assigned a set of seeds to index.

Nucleic Acids Research, 2005, Vol. 33, No. 144565

Page 4

for insertions and deletions as there is for nucleotide substitu-

tions. For a node n, our method sets ppresent(n) to be a weighted

averageP

children of n. We choose the weight assigned to a child to be

proportional to the inverse of the branch length between the

parent and the child and normalize the weights sum to one.

When we have reached the root, we choose pid as

maxN2{A, C, T, G}(pN). This is because the maximum pNcor-

responds to the conditional probability that the consensus

character of the multiple alignment is present at the homo-

logous position in the query, given that there is a homologous

position in the query. Because at a given position in the multi-

ple alignment we choose the consensus character as the char-

actertorecordin the index, the averagevalue ofpidobtainedin

such a way over a region will correspond to the conditional

similarity level of the alignment of that region to a homologue

in the query.

It turns out that reconstructing the profile in this manner

results in empirically slightly inaccurate values. In particular,

it tends to overestimate actual values of pid, which presents

difficulties because small changes in the value of pidcan have

large effects on the estimated hit probability of a seed (12,18).

Furthermore, such an estimate for ppresentis heuristic and is not

guaranteed to yield accurate predictions.

A complete treatment of profile reconstructionis beyond the

scope of this paper. For our purposes, we plotted the predicted

versus experimental values of ppresentand pidfor several spe-

cies and assessed accuracy. To do this, we began with a multi-

ple alignment, removed one species as the test species, and

converted the remaining alignment to a probabilistic profile

usingthe method describedabove. Wethen groupedthe values

of ppresentand pidas predicted by the profile into a finite set of

buckets, each representing a discrete value.

For each discrete value of ppresentand pidas predicted by the

profile, we calculated the experimental values of ppresentand

pidfor the test species as follows. To obtain the experimental

ppresentforagivendiscrete predicted value,wefirst countedthe

total number of positions in the profile at which ppresentwas

that value. Of those positions, we counted the number of

positions that were un-gapped in the test species. The experi-

mental value was then determined as the latter number divided

by the former. The experimental values of pidwere obtained

similarly.

If our predictions were perfectly accurate, resulting plots of

experimental versus predicted values would show a linear

relationship with slope 1. Plots for ppresenthad an extremely

high variance and did not appear to follow an obvious pattern,

and, based upon these results, we kept our predictions for

ppresentunchanged. Improved predictions are likely possible

and can only improve the performance of our algorithm; how-

ever, they do not appear immediately available. The plots for

pid, on the other hand, did appear to follow a pattern; Figure 3

shows sample plots for pidfor two different test species, cat

and chicken, obtained from an alignment consisting of human,

chimp, baboon, dog, cat, pig, cow and chicken. These plots are

similar to plots obtained using other species as test cases.

We do not attempt to address any possible theoretical foun-

dations for the above plots here, as that would take us far from

our current focus. Currently, our chief concern is only to

convert our initial predictions into values that will work

well when given as input to our indexing algorithm, and we

iwi· ppresent(ni), where the sum is taken over all

found that fitting an exponential curve of the form

g(x) ¼ aeb(x)+ d to our data was more than adequate for

this purpose in practice. We chose a and d to fix

g(0) ¼ 0.25 and g(1) ¼ 1; this leaves b as an adjustable para-

meter. Based upon our observations, a value of b ¼ 4 seems to

work fairly well for a variety of species, and we fixed this

parameter for all of our tests.

Region decomposition

As mentioned above, one advantage of a multiple alignment is

that it delineates different regions, each of which can be char-

acterized by a conservation level among the species in the

alignment. Therefore, before building an index, our algorithm

partitions the probabilistic profile into a set of such regions.

For simplicity, we group regions into a finite number of region

classes; a region class is a pair of characteristics (ppresent, pid),

which represent typical values of ppresentand pid, respectively,

for each region in the region class. All regions in a region class

index the same set of seeds.

Partitioning a profile into regions can be done with the aid of

a simple Hidden Markov Model, where the states are region

classes that emit values of ppresentand pid. Similar ideas have

been explored before (19,36); for our purposes here, it is

enough that all regions in a region class possess roughly

the same properties. It is important that the cardinalities of

the region classes be roughly equal so that we have maximum

flexibility when assigning seeds; if one region class is enor-

mous, then in order to free enough budget to assign extra seeds

to it we must choose to assign fewer seeds to a large number of

smaller region classes, which may be undesirable.

We begin this section by considering a conceptually

straightforward approach for decomposing a profile in order

to introduce the basic ideas of our method. We then describe

how our particular approach extends this idea.

Suppose that we build an HMM consisting of states for each

region class (ppresent, pid). Each state emits a position of the

profile with values (ppresent, pid) with probability proportional

to exp(?|ppresent? ppresent|) · exp(?|pid? pid|) and transi-

tions to every state other than itself with equal probability.

This probability can be chosen to ensure that the optimal

Viterbi parse (33) gives no region shorter than a minimum

length; this length should be large enough to ensure that every

region is at least larger than the span of our seeds, and we

found that a minimum region length of 64 works reasonably

well for seeds of span ?20.

Once this HMM has been constructed, each position can be

assigned to the region class corresponding to the state that

emits it in the optimal parse obtained via the Viterbi algorithm.

Region boundaries then occurs between two positions that

belong to different region classes.

This basic algorithm suffers from the problem that we must

determine the set of region classes at the beginning of the

algorithm in order to be able to construct the HMM. The

weakness of this approach is shown in Figure 4a. If we choose

to represent each position in the profile as a point (ppresent, pid),

then choosing a set of region classes is conceptually related to

partitioning the plane in which the positions lie. All points

contained in a rectangle inthe partitionare closest to the center

of a specific region class. By fixing the region classes a priori,

we will likely make choices that do not fit the structure of the

4566Nucleic Acids Research, 2005, Vol. 33, No. 14

Page 5

profile. The chief problem occurs when several region classes

are empty and one region class contains many more regions

than the others; in this case partitioning achieves little.

An alternative method isto adaptively choose region classes

to match the manner in which the positions are distributed, as

shown in Figure 4b. One way of doing this is to use k-means

clustering (37) and choose region classes corresponding to

the resulting clusters. This does not translate directly to our

problem, however, as choosing region classes in this manner

cannot predict how the profile will actually be decoded by

the HMM. Instead, we use a related algorithm that somewhat

corrects this problem. This is related to the fundamental idea

behind wavelets (38), which can analyze data dynamically by

decomposing a signal into pieces that can be represented at

different scales of resolution.

Our algorithm is shown graphically in Figure 5. At a high-

level, we progressively split the profile into regions belonging

to one of two region classes. We perform the decoding at each

(a)

(b)

Figure3. Plotsofpredictedversusexperimentalprofilevalues.Foruseincorrectingpredictionsofprofilevalues,weplottedpredictedversusexperimentalvaluesof

(a)pidforcatand(b)pidforchicken.Althoughnotshown,weexaminedplotsforotherspecies,whicharesimilar.Plotsforppresentdidnotobeyanimmediatepattern

andthusdidnotleadustochangeourpredictions.Eachcrossrepresentsaplotteddatapoint;shownalsoisthefunctionweusedforconvertingourinitialpredictionsof

pidto our final predictions, as well as the linear fit that would be suitable if our predictions matched the experimental values.

Nucleic Acids Research, 2005, Vol. 33, No. 144567