Page 1

Inexact Local Alignment Search over Suffix Arrays

Mohammadreza Ghodsi and

Department of Computer Science, University of Maryland, College Park, MD 20742, USA

Mihai Pop

Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD

20742, USA

Mohammadreza Ghodsi: ghodsi@cs.umd.edu; Mihai Pop: mpop@umiacs.umd.edu

Abstract

We describe an algorithm for finding approximate seeds for DNA homology searches. In contrast

to previous algorithms that use exact or spaced seeds, our approximate seeds may contain

insertions and deletions. We present a generalized heuristic for finding such seeds efficiently and

prove that the heuristic does not affect sensitivity. We show how to adapt this algorithm to work

over the memory efficient suffix array with provably minimal overhead in running time.

We demonstrate the effectiveness of our algorithm on two tasks: whole genome alignment of

bacteria and alignment of the DNA sequences of 177 genes that are orthologous in human and

mouse. We show our algorithm achieves better sensitivity and uses less memory than other

commonly used local alignment tools.

Keywords

local alignment; inexact seeds; suffix array

I. Introduction

Finding local matches between two long sequences is a fundamental problem in

computational biology. For example one might want to find similar regions in the genome

sequence of two organisms. Due to differences in the genome sequence of different

organisms or to sequencing error we have to look for approximate matches. Many

algorithms and tools have been created that address this need, yet finding local alignments

still remains one of the most computationally intensive steps in biological sequence analysis.

We define the local alignment search problem as follows: Given two strings S1 and S2 of

lengths n1 and n2, a minimum match length l and a maximum distance k, find all pairs of

substrings of S1 and S2 of length l that are within distance k. We assume all strings are from

a finite alphabet Σ of size σ. A few popular distance metrics are Hamming distance, unit-cost

edit distance (Levenshtein distance) and general edit distance based on a substitution cost

matrix.

A string matching problem is called offline if we are allowed to pre-process the text and

make an index data structure, and online if we are not allowed to pre-process the text. The

exact alignment problems can be solved optimally (meaning the running time is bounded by

NIH Public Access

Author Manuscript

Proceedings (IEEE Int Conf Bioinformatics Biomed). Author manuscript; available in PMC

2011 January 27.

Published in final edited form as:

Proceedings (IEEE Int Conf Bioinformatics Biomed). 2009 November 1; 2009(1-4): 83–97. doi:10.1109/

BIBM.2009.25.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 2

a linear function of the size of input strings) using eg. the KMP algorithm for the online and

suffix trees for the offline case. In this paper we will focus on the offline problem.

The online local alignment problem can be solved in O(n1 · n2 · l) time by using each length

l substring of S2 as a pattern and using a variation of well known Smith-Waterman [14]. This

running time, however, is impractical for the size of the data-sets commonly encountered in

practice.

In practice the most common solution to the inexact local alignment problem is to find seeds

and try to extend them to longer local alignments [2], [4], [6]. A seed is a short substring of

one sequence that matches closely to a substring of the other sequence. Seed and extend

algorithms are fast because they avoid trying to align every region of the two sequences and

only focus on the regions that are likely to align well. Two most common type of seeds used

are exact seeds and spaced seeds. Exact seeds need to match perfectly, whereas spaced seeds

allow mismatches at some positions. Neither allow insertions and deletions within the seed.

In this work we look at more general inexact seeds that are based on the same notion of

similarity as in Smith-Waterman [14]. We describe an algorithm for finding these inexact

seed, and show that an implementation of this algorithm has good performance in practice.

Since our algorithm is based on dynamic programming it can use a general substitution cost

matrix for distance computation.

Seed finding algorithms are evaluated by their sensitivity and specificity. Sensitivity

corresponds to fraction of true local alignments that are found by the seeds. Specificity

corresponds to fraction of reported seeds that can be extended to true local alignments.

In the case of exact seeds, their sensitivity and specificity depends on only parameter: seed

length. Spaced seeds provide an additional parameter - weight - the number of characters

within the seed that must match within the aligned strings. For the same weight spaced seeds

achieve better sensitivity than exact seeds without sacrificing specificity. Computing

sensitivity of a spaced seed is, however, NP-hard [10]. Most spaced seed tools use a pre-

computed seed or set of seeds, making it difficult to tune their parameters at run-time in

order to achieve the best trade-off between sensitivity and specificity for a given alignment

task.

We will prove that our seed finding algorithm is perfectly sensitive i.e. if a good enough

local alignment is present it will be detected by our seeds. Furthermore since we allow

inexact seeds, by using longer seeds we can achieve better specificity too. With our

algorithm the user can pick any seed length and similarity at running time without the need

to re-compute the index.

Most of the tools used in practice today are based on one of two index data structures for

large strings; inverted k-mer index and the family of indexes related to suffix trees (suffix

arrays and Burrows-Wheeler index.) BLAST [2] uses an inverted index of exact seeds.

MUMmer [5] is a whole genome alignment tool that uses suffix trees. Bowtie [9] uses a

Burrows-Wheeler index of the human or other genome and can align short reads with few

differences to the reference. Our algorithm uses a suffix array index.

II. Suffix Array Search Algorithm

Our algorithm systematically tries to compute dynamic programming score (similar to

Needleman-Wunsch [13]) for aligning every pair of substrings of S1 and S2. We assume the

“cost” of aligning two characters is zero if they are identical and is some positive number

otherwise. What makes this algorithm fast is that it gives up as soon as it notices that the

alignment cost (i.e. distance or score) is getting “too high”.

Ghodsi and PopPage 2

Proceedings (IEEE Int Conf Bioinformatics Biomed). Author manuscript; available in PMC 2011 January 27.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 3

A. Recursive Search of Suffix Trie

It is easier to explain our algorithm using the Suffix Trie data structure. A suffix trie is a

rooted tree data structure that stores every suffix of a string. Unfortunately this

representation requires up to Θ(n2) space in general which makes it impractical for long

sequences. In the next section we will adapt this algorithm to use suffix array which require

only linear space. Let us assume we have built a suffix trie for each of S1 and S2. A

representation of an example suffix trie is shown in Figure 1. Each node of the suffix trie

corresponds to a substring of the original string that is spelled out by the labels of the edges

on the unique path from root to that node. An exact search algorithm starts at the root and

follows down the edges labeled with the same character in both trees. In other words, if the

search function is called on some internal node u from the first trie and v from the second

trie, then for each letter of the alphabet if both u and v have a child labeled with that letter

the search function is called recursively on the corresponding child nodes. After traversing l

edges down from the root (reaching depth l) the leaves under the internal node of each tree

represent suffixes from each sequence that share the first l characters exactly. Thus, we find

exact “alignments” of length l.

In the case of approximate search however we may have to follow edges labeled with

characters that are different in the two suffix tries. Therefore we have to call the search

function recursively for every pair of children of u and v. Without limiting the number of

mismatches allowed this recursive search (which is called for every pair of children) will try

to align every substring of length l from S1 to every substring of S2. A simple solution is to

stop following down branches from a pair of internal nodes as soon as the distance between

the two strings corresponding to those nodes is already greater than some threshold (also

when we reach depth l of course). i.e. as soon as the alignment cost becomes “too high” we

simply return from recursion and will not call search function for the children of current

nodes. We calculate distance by updating a dynamic programming table on the fly as

illustrated in Figure 3. We only need to update one row and one column of the dynamic

programming table for each edge traversal. If the distance is smaller than the threshold we

recursively call the search function for every pair of children of the two current internal

nodes. When we reach depth l we report all pairs of leaves under the two nodes. The cost

threshold can either be a constant or for example a function of the length of the alignment so

far.

A naive choice is to use a constant threshold equal to k, i.e. to stop when the distance

between strings corresponding to the two current nodes is greater than k. This strategy is

slow in practice however, because for most practical sequences the top levels of the suffix

trie are nearly full (each node has nearly σ children), whereas deep nodes (depth ≥ logσ n) of

the trie have about one child on average. The constant threshold strategy will spend lots of

time traversing top levels of the trie allowing many errors in the first few characters of the

alignment. In fact for example, in the unit-cost edit distance model, it aligns every substring

of length k of the first sequence to every substring of length k of the other sequence. The

main intuition is that we can look for shorter seeds with the desirable property that they do

not have many differences in the beginning. This causes the algorithm to backtrack more

easily on the first few levels of the tree and significantly improves the running time.

We show in Lemma 1 that if L1 and L2 (e.g. substrings of S1 and S2, respectively) of length l

match with distance ≤ k then there exists a seed E1 of length e (1 ≤ e < l) in L1 that matches

a substring E2 in L2 in such a way that the cost of edits necessary to match any prefix of E1

to some prefix of E2 is never more than times the length of the prefix. This is a

generalization of exact seeds. This lemma proves that our seed heuristic does not affect the

sensitivity of the algorithm at all.

Ghodsi and PopPage 3

Proceedings (IEEE Int Conf Bioinformatics Biomed). Author manuscript; available in PMC 2011 January 27.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 4

After finding the seeds one should verify that they can be extended to alignments of length l.

(It is also possible to use these seed directly in a chaining algorithm if we desire a global

alignment). This can be done in a post-processing step. But we assume for sufficiently long

seeds (e ≥ logσ n) the number of seed hits are small enough that the running time will be

dominated by the seed search phase. Of course this is not true for highly repetitive

sequences.

Lemma 1—If L1 of length l matches L2 with cost ≤ k, then for any seed length e, 1 ≤ e < l

there exists a seed E1 a substring of L1 of length e and E2 a substring of L2 such that for

every prefix of E1 of length j, E1 [1..j] matches some prefix of E2 with distance.

Proof: Given an alignment of L1 to L2 with distance ≤ k let f (i) be a the edit distance of L1

[1..i] from the prefix of L2 to which it is aligned. See Figure 4. It is easy to see that f is a

non-decreasing function, and by definition f (l) ≤ k and f (0) = 0.

We need to show that there exists an i ∈ {0, …, l − e} such that for all j = 1 … e,

.

As a way of contradiction assume for all i = 0 … (l − e), there exists j ∈ {1 … e} such that

or equivalently .

In particular for i1 = 0 there exists j1 such that

there exists j2 such that

on up to some z such that l − e < 0 + j1 + j2 + … + jz ≤ l. We have:

, and for i2 = (0 + j1)

, and so

Which implies f (l) > k which is a contradiction.

B. Adapting the Algorithm to Suffix Arrays

A suffix array [12] is the list of indexes all suffixes of a string in lexicographically sorted

order, as illustrated in Figure 2(b). A suffix array can be built in linear time 1 and occupies n

log2 n bits. Even though theoretically asymptotic memory requirements of suffix arrays and

suffix trees is the same, in practice highly optimized implementations of suffix trees require

over 10 bytes of memory per input character [8] whereas a basic suffix array implementation

requires just 4+1 bytes per character2. Space requirements of suffix arrays can be further

reduced to O(n) bits using compressed suffix arrays. Finally it has been shown that by

storing some auxiliary tables, Enhanced Suffix Arrays [1] can be used to simulate any type

of traversal of suffix trees in the same time complexity.

Our algorithm over suffix arrays is inspired by the simplest exact search algorithm on a

suffix array which is in essence a binary search. The main property of the suffix array that

our algorithm takes advantage of is that if some prefix of the suffix pointed to by ith element

of suffix array is equal to that of the suffix pointed to by j the element of suffix array, then

1Assuming log n is smaller than the word size of the machine and operations on them takes constant time

2Assuming length of input string can be stored in a 32 bit integer.

Ghodsi and PopPage 4

Proceedings (IEEE Int Conf Bioinformatics Biomed). Author manuscript; available in PMC 2011 January 27.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 5

for every k ∈ [i … j], every suffix pointed to by the kth element of suffix array shares the

same prefix. This property easily follows from the fact that the suffixes are lexicographically

sorted. The recursive algorithm described over suffix tries can be adapted to work on suffix

arrays with at most a log2 n factor increase in running time as will be shown in Lemma 2.

The main algorithm is as follows: We maintain two windows of the two suffix arrays and

the depth (length of the prefix) that we have aligned so far, If the characters at this depth

from both sides of the current window match (for each window of the two suffix arrays),

there is only one character to follow down in this window so we update the dynamic

programming table and call search for depth+1, otherwise we simply divide the window (for

which the characters from both sides at current depth do not match) in two equal windows

and recursively find the matches on both halves. The window size will always remain

greater than or equal to one.

The main search function

void

sasearch (

int

l1,

int

r1,

int

l2,

int

r2,

int

depth);

recursively finds all approximate matches between two windows of the two suffix arrays ([l1

… r1] in the first suffix array and [l2 … r2] in the second) assuming the first “depth-1”

characters in each window are the same. And the distance between the shared prefix of

suffixes in first window and shared prefix of suffixes in the second window is stored in a

global dynamic programming table like Figure 3.

Lemma 2—The number of calls to the recursive suffix array search function is at most log2

n times the number of suffix trie nodes visited by recursive search algorithm, where n is the

length of string.

Ghodsi and Pop Page 5

Proceedings (IEEE Int Conf Bioinformatics Biomed). Author manuscript; available in PMC 2011 January 27.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript