Procrastination Leads to Efficient Filtration for Local Multiple Alignment.
ABSTRACT We describe an efficient local multiple alignment filtration heuristic for identification of conserved regions in one or more DNA se quences. The method incorporates several novel ideas: (1) palindromic spaced seed patterns to match both DNA strands simultaneously, (2) seed extension (chaining) in order of decreasing multiplicity, and (3) procrastination when low multiplicity matches are encountered. The re sulting local multiple alignments may have nucleotide substitutions and internal gaps as large as w characters in any occurrence of the motif. The algorithm consumes O(wN) memory and O(wN log wN) time where N is the sequence length. We score the significance of multiple alignments using entropybased motif scoring methods. We demonstrate the per formance of our filtration method on Alurepeat rich segments of the human genome and a large set of Hepatitis C virus genomes. The GPL implementation of our algorithm in C++ is called procrastAligner and is freely available from http://gel.ahabs.wisc.edu/procrastination

Conference Paper: Novel Computational Methods for Large Scale Genome Comparison.
[Show abstract] [Hide abstract]
ABSTRACT: The current wealth of available genomic data provides an unprecedented opportunity to compare and contrast evolutionary histories of closely and distantly related organisms. The focus of this dissertation is on developing novel algorithms and software for efficient global and local comparison of multiple genomes and the application of these methods for a biologically relevant case study. The thesis research is organized into three successive phases, specifically: (1) multiple genome alignment of closely related species, (2) local multiple alignment of interspersed repeats, and finally, (3) a comparative genomics case study of Neisseria. In Phase 1, we first develop an efficient algorithm and data structure for maximal unique match search in multiple genome sequences. We implement these contributions in an interactive multiple genome comparison and alignment tool, MGCAT, that can efficiently construct multiple genome comparison frameworks in closely related species. In Phase 2, we present a novel computational method for local multiple alignment of interspersed repeats. Our method for local alignment of interspersed repeats features a novel method for gapped extensions of chained seed matches, joining global multiple alignment with a homology test based on a hidden Markov model (HMM). In Phase 3, using the results from the previous two phases we perform a case study of neisserial genomes by tracking the propagation of repeat sequence elements in attempt to understand why the important pathogens of the neisserial group have sexual exchange of DNA by natural transformation. In conclusion, our global contributions in this dissertation have focused on comparing and contrasting evolutionary histories of related organisms via multiple alignment of genomes.2nd International Workshop on Practical Applications of Computational Biology and Bioinformatics, IWPACBB 2008, Salamanca, Spain, 22th24th October 2008; 01/2008  SourceAvailable from: Oliver Attie[Show abstract] [Hide abstract]
ABSTRACT: During evolution, largescale genome rearrangements of chromosomes shuffle the order of homologous genome sequences ("synteny blocks") across species. Some years ago, a controversy erupted in genome rearrangement studies over whether rearrangements recur, causing breakpoints to be reused. We investigate this controversial issue using the synteny block's for humanmouserat reported by Bourque et al. and a series of synteny blocks we generated using Mauve at resolutions ranging from coarse to very finescale. We conducted analyses to test how resolution affects the traditional measure of the breakpoint reuse rate. We found that the inversionbased breakpoint reuse rate is low at finescale synteny block resolution and that it rises and eventually falls as synteny block resolution decreases. By analyzing the cycle structure of the breakpoint graph of humanmouserat synteny blocks for humanmouse and comparing with theoretically derived distributions for random genome rearrangements, we showed that the implied genome rearrangements at each level of resolution become more "random" as synteny block resolution diminishes. At highest synteny block resolutions the HannenhalliPevzner inversion distance deviates from the Double Cut and Join distance, possibly due to smallscale transpositions or simply due to inclusion of erroneous synteny blocks. At synteny block resolutions as coarse as the Bourque et al. blocks, we show the breakpoint graph cycle structure has already converged to the pattern expected for a random distribution of synteny blocks. The inferred breakpoint reuse rate depends on synteny block resolution in humanmouse genome comparisons. At finescale resolution, the cycle structure for the transformation appears less random compared to that for coarse resolution. Small synteny blocks may contain critical information for accurate reconstruction of genome rearrangement history and parameters.BMC Bioinformatics 01/2011; 12 Suppl 9:S1. · 3.02 Impact Factor 
Article: LOGICTOOLS TIN200403382
[Show abstract] [Hide abstract]
ABSTRACT: The aim of this project is the development of new techniques, both at the theoretical level and at the implementation level, for the following four kinds of logicbased tools: 1. Automated theorem provers for firstorder logic. 2. Ecient decision procedures for certain logics.
Page 1
Procrastination leads to efficient filtration for
local multiple alignment
Aaron E. Darling1†, Todd J. Treangen2†, Louxin Zhang5, Carla Kuiken6,
Xavier Messeguer2, Nicole T. Perna4
1Dept. of Computer Science, Univ. of WisconsinMadison, USA
darling@cs.wisc.edu,
2Dept. of Computer Science, Technical Univ. of Catalonia, Barcelona, Spain
treangen@lsi.upc.edu,
3Dept. of Animal Health and Biomedical Sciences, Genome Center,
Univ. of WisconsinMadison, USA
4Dept. of Mathematics, National University of Singapore, Singapore
5T10 Theoretical Biology Division, Los Alamos National Laboratory, USA
†These authors contributed equally to this work
Abstract. We describe an efficient local multiple alignment filtration
heuristic for identification of conserved regions in one or more DNA se
quences. The method incorporates several novel ideas: (1) palindromic
spaced seed patterns to match both DNA strands simultaneously, (2)
seed extension (chaining) in order of decreasing multiplicity, and (3)
procrastination when low multiplicity matches are encountered. The re
sulting local multiple alignments may have nucleotide substitutions and
internal gaps as large as w characters in any occurrence of the motif. The
algorithm consumes O(wN) memory and O(wN logwN) time where N
is the sequence length. We score the significance of multiple alignments
using entropybased motif scoring methods. We demonstrate the per
formance of our filtration method on Alurepeat rich segments of the
human genome and a large set of Hepatitis C virus genomes. The GPL
implementation of our algorithm in C++ is called procrastAligner and
is freely available from http://gel.ahabs.wisc.edu/procrastination
1Introduction
Pairwise local sequence alignment has a long and fruitful history in computa
tional biology and new approaches continue to be proposed [1–4]. Advanced
filtration methods based on spacedseeds have greatly improved the sensitivity,
specificity, and efficiency of many local alignment methods [5–9]. Common ap
plications of local alignment can range from orthology mapping [10] to genome
assembly [11] to information engineering tasks such as data compression [12].
Recent advances in sequence data acquisition technology [13] provide lowcost
sequencing and will continue to fuel the growth of molecular sequence databases.
To cope with advances in data volume, corresponding advances in computational
methods are necessary; thus we present an efficient method for local multiple
alignment of DNA sequence.
Page 2
2
ACAGCTAGCATGGCAA...GTTACCTAG
1*1*1
Step 1. Apply seed pattern at each position
to extract either the forward or reverse seed
...
N9 GAC
N8 GTA
N7 AGA
N6 ACA
N5 CAG
1 AAC
2 ACG
3 ACA
4 CAC
5 CAC
6 TCA
7 ACT
8 CTC
9 CAG
10 AGC
11 TCA
12 GCA
...
1 AAC
3 ACA
N6 ACA
2 ACG
7 ACT
N7 AGA
10 AGC
4 CAC
5 CAC
Step 2. Hash seeds to identify
matches of two or more seeds
9 CAG
N5 CAG
8 CTC
12 GCA
N9 GAC
N8 GTA
6 TCA
11 TCA
}
}
}
}
Fig.1. Application of the palindromic seed pattern 1*1*1 to identify degenerate match
ing subsequences in a nucleotide sequence of length N. The lexicographicallylesser of
the forward and reverse complement subsequence induced by the seed pattern is used
at each sequence position.
Unlike pairwise alignment, local multiple alignment constructs a single mul
tiple alignment for all occurrences of a motif in one or more sequences. The motif
occurrences may be identical or have degeneracy in the form of mismatches and
indels. As such, local multiple alignments identify the basic repeating units in
one or more sequences and can serve as a basis for downstream analysis tasks
such as multiple genome alignment [14–17], global alignment with repeats [18,
19], or repeat classification and analysis [20]. Local multiple alignment differs
from traditional pairwise methods for repeat analysis which either identify re
peat families de novo [21] or using a database of known repeat motifs [22].
Previous work on local multiple alignment includes an Eulerian path ap
proach proposed by Zhang and Waterman [23]. Their method uses a de Bruijn
graph based on exactly matching kmers as a filtration heuristic. Our method
can be seen as a generalization of the de Bruijn filtration to arbitrary spaced
seeds or seed families. However, our method employs a different approach to seed
extension that can identify long, lowcopy number repeats.
The local multiple alignment filtration method we present has been designed
to efficiently process large amounts of sequence data. It is not designed to de
tect subtle motifs such as transcription factor binding sites in small, targeted
sequence regions–stochastic methods are better suited for such tasks [24].
2Overview of the Method
Our local multiple alignment filtration method begins by generating a set of
candidate multimatches using palindromic spaced seed patterns (listed in Ta
ble 1). The seed pattern is evaluated at every position of the input sequence, and
the lexicographicallylesser of the forward and reverse complement subsequence
induced by the seed pattern is hashed to identify seed matches (Figure 1). The
use of palindromic seed patterns offers computational savings by allowing both
strands of DNA to be processed simultaneously.
Given an initial set of matching sequence regions, our algorithm then max
imally extends each match to cover the entire surrounding region of sequence
identity. A visual example of maximal extension is given by the black match
Page 3
3
Weight Pattern Seed Rank by Sequence Identity
65% 70% 75% 80% 85%
11
1*11***11*1
11
11**1*1*1**11
1
111**1**1**111
1
111*1**1**1*111
1
111*1**1*1**1*111
1
1111**1*1*1**1111
1
1111**1*1*1*1**1111
5
1111**1**1*1*1**1**1111 > 10
1111**11*1*1*11**1111
1111*1*11**1**11*1*1111
1111*1*11**11**11*1*1111
11111**11*1*11*1*11**11111
1111*111**1*111*1**111*1111
20 11111*1*11**11*11**11*1*11111 > 10 > 10
21
11111*111*11*1*11*111*11111
90%
5
6
7
8
9
11*1*11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
2
1
1
1
1
1
3
5
2
1
1
1
2
10
11
12
13
14
15
16
18
19
2
1
2
1
5
11
Table 1. Palindromic spaced seeds used by procrastAligner. The sensitivity ranking
of a seed at various levels of sequence identity is given in the columns at right. A seed
with rank 1 is the most sensitive seed pattern for a given weight and percent sequence
identity. The default seeds used by procrastAligner are listed here, while the full list
of highranking seeds appears on the website.
in Figure 2. In order to extend over each region of sequence O(1) times, our
method extends matches in order of decreasing multiplicity–we extend the high
est multiplicity matches first. When a match can no longer be extended without
including a gap larger than w characters, our method identifies the neighboring
subset matches within w characters, i.e. the light gray seed in Figure 2. We then
link each neighboring subset match to the extended match. We refer to the ex
tended match as a superset match. Rather than immediately extend the subset
match(es), we procrastinate and extend the subset match later when it has the
highest multiplicity of any match waiting to be extended. When extending a
match with a linked superset (light gray in Figure 2), we immediately include
the entire region covered by the linked superset match–obviating the need to
reexamine sequence already covered by a previous match extension.
We score alignments generated by our method using the entropy equation
and exact pvalue method in [25]. Our method may produce many hundreds or
thousands of local multiple alignments for a given genome sequence, thus it is
important to rank them by significance. When computing column entropy, we
treat gap characters as missing data.
Page 4
4
ACGGATTAGAT
Sequence:
Seed Matches:
Maximal extension
of black seed:
Subset link to
light gray seed:
Fig.2. Seed match extension. Three seed matches are depicted as black, gray, and light
gray regions of the sequence. Black and gray have multiplicity 3, while light gray has
multiplicity 2. We maximally extend the black seed to the left and right and in doing
so, the black seed chains with the gray seed to the left. The light gray seed is adjacent
to only two out of three components in the extended black seed. We procrastinate and
extend the light gray seed later. We create a link between light gray and the extended
black seed match.
3Algorithm
3.1 Notation and Assumptions
Given a sequence S = s1,s2,...,sN of length N defined over an alphabet
{A,C,G,T}, our goal is to identify local multiple alignments on subsequences of
S. Our filtration method first generates candidate chains of ungapped align
ments, which are later scored and possibly realigned. Denote an ungapped
alignment, or match, among subsequences in S as an object M. We assume
as input a set of ungapped alignments M. We refer the number of regions in
S matched by a given match Mi ∈ M as the multiplicity of Mi, denoted as
Mi. We refer to each matching region of Mias a component of Mi. Note that
Mi ≥ 2 ∀ M ∈ M. We denote the leftend coordinates in S of each compo
nent of Mias Mi.L1,Mi.L2,...,Mi.LMi, and similarly we denote the rightend
coordinates as Mi.Rx. When aligning DNA sequences, matches may occur on
the forward or reverse complement strands. To account for this phenomenon we
add an orientation value to each matching region: Mi.Ox ∈ {1,−1}, where 1
indicates a forward strand match and 1 for reverse.
Our algorithm has an important limitation on the matches in M: no two
matches Mi and Mj may have the same leftend coordinate, e.g. Mi.Lx ?=
Mj.Ly ∀ i,j,x,y except for the identity case when i = j and x = y. This
constraint has been referred to by others as consistency and transitivity [26]
of matches. In the present work we only require consistency and transitivity of
matches longer than the seed length, e.g. seed matches may overlap.
3.2Data structures
Our algorithm begins with an initialization phase that creates three data struc
tures. The first data structure is a set of Match Records for each match M ∈ M.
The Match Record stores M, a unique identifier for M, and two items which
Page 5
5
will be described later in Section 3.3: a set of linked match records, and a sub
suming match pointer. The linked match records are further subdivided into four
classes: a left and right superset link, and left and right subset links. The subsum
ing match pointer is initially set to a NULL value. Figure 3 shows a schematic
of the match record.
We refer to the second data structure as a Match Position Lookup Table, or
P. The table has N entries p1,p2,...,pN, one per character of S. The entry
for ptstores the unique identifier of the match Miand x for which Mi.Lx= t
or the NULL identifier if no match has t as a leftend coordinate. We call the
third data structure a Match extension procrastination queue, or simply the
procrastination queue. Again, we denote the multiplicity of a match M by M.
The procrastination queue is a binary heap of matches ordered on M with
higher values of M appearing near the top of the heap. The heap is initially
populated with all M ∈ M. This queue dictates the order in which matches will
be considered for extension.
3.3Extending Matches
Armed with the three aforementioned data structures, our algorithm begins the
chaining process with the match at the front of the procrastination queue. For a
match Mithat has not been subsumed, the algorithm first attempts extension
to the left, then to the right. Extension in each direction is done separately in
an identical manner and we arbitrarily choose to describe leftward extension
first. The first step in leftward match extension for Mi is to check whether it
has a left superset link. If so, we perform a link extension as described later.
For extension of Miwithout a superset link, we use the Match Position Lookup
Table P to enumerate all matches within a fixed distance w of Mi. For each
component x = 1,2,...,Mi and distance d = 1,2,...,w we evaluate first
whether pMi.Lx−(d·Mi.Ox)is not NULL. If not then pMi.Lx−(d·Mi.Ox)stores an
entry ?Mj,y? which is a pointer to neighboring match Mj and the matching
component y of Mj.
In order to consider matches on both forward and reverse strands, we must
evaluate whether Mi.Oxand Mj.Oyare consistent with each other. We define the
relative orientation of Mi.Oxand Mj.Oyas oi,j,x,y= Mi.Ox·Mj.Oywhich causes
oi,j,x,y= 1 if both Mi.Oxand Mj.Oymatch the same strand and −1 otherwise.
We create a tuple of the form ?Mj,oi,j,x,y,x,d,y? and add it to a list called the
neighborhood list. In other words, the tuple stores (1) the unique match ID of
the match with a leftend at sequence coordinate Mi.Lx− (d · Mi.Ox), (2) the
relative orientation of Mi.Oxand Mj.Oy, (3) the matching component x of Mi,
(4) the distance d between Miand Mj, and (5) the matching component y of Mj.
If Mj= Mifor a given value of d, we stop adding neighborhood list entries after
processing that one. The neighborhood list is then scanned to identify groups of
entries with the same match ID Mj and relative orientation oi,j,x,y. We refer
to such groups as neighborhood groups. Entries in the same neighborhood group
that have identical x or y values are considered “ties” and need to be broken.
Ties are resolved by discarding the entry with the larger value of d in the fourth
Page 6
6
[
[
w
[
w
[
w
[
w
[
w
M₁
M₂
Match Record List
...
4
3
3
2
Procrastination Queue
Left Links
Subset Superset
Right Links
Subset Superset
null
null null
null
Subsuming match pointer:
1
23
3
[
w
4
14
1
2
312
1
2
3
34
1
4
1
2
312
123
34
1
4
1
23
12
[
w[
w
[
w
[
w
123
3
4
4
1
23
1
2
1
12
3
3
1
2
312
(A)
(B)
(C)
(D)
1111
1111
222
333
44
Resulting local multiple alignment chain:
M₁.L₁
M₁.L₂
M₁.L₃
M₁.L₄
M₁.R₁
M₁.R₂
M₁.R₃
M₁.R₄
M₂.L₁
M₂.L₂
M₂.L₃
M₂.R₁
M₂.R₂
M₂.R₃
null
M₃
M₄
...
Fig.3. The match extension process and associated data structures. (A) First we
pop the match at the front of the procrastination queue: M1 and begin its leftward
extension. Starting with the leftmost position of M1, we use the Match Position
Lookup Table to enumerate every match with a leftend within some distance w. Only
M4.L1 is within w of M1, so it forms a singleton neighborhood group which we discard.
(B) M1 has no neighborhood groups to the left, so we begin extending M1 to the right.
We enumerate all matches within w to the right of M1. M2 lies to the right of 3 of 4
components of M1 and so is not subsumed, but instead gets linked as a rightsubset
of M1. We add a leftsuperset link from M2 to M1. (C) Once finished with M1 we
pop M2 from the front of the procrastination queue and begin leftward extension. We
find the leftsuperset link from M2 to M1, so we extend the leftend coordinates of
M2 to cover M1 accordingly. No further leftward extension of M2 is possible because
M1 has no leftsubset links. (D) Beginning rightward extension on M2 we construct a
neighborhood list and find a chainable match M3, and a subset M4. We extend M2 to
include M3 and mark M4 as inconsistent and hence not extendable. Upon completion
of the chaining process we have generated a list of local multiple alignments.
tuple element: we prefer to chain over shorter distances. After tiebraking, each
neighborhood group falls into one of several categories:
– Superset: The neighborhood group contains Mi separate entries. Mj has
higher multiplicity than Mi, e.g. Mj > Mi. We refer to Mjas a superset
of Mi.
– Chainable: The neighborhood group contains Mi separate entries. Mjand
Mihave equal multiplicity, e.g. Mj = Mi. We can chain Mjand Mi.
– Subset: The neighborhood group contains Mj separate entries such that
Mj < Mi. We refer to Mjas a subset of Mi.
– Novel Subset: The neighborhood group contains r separate entries such
that r < Mi∧r < Mj. We refer to the portion of Mjin the list as a novel
subset of Miand Mj because this combination of matching positions does
not exist as a match in the initial set of matches M.
Page 7
7
The algorithm considers each neighborhood group for chaining in the order
given above: chainable, subset, and finally, novel subset. Superset groups are
ignored, as any superset links would have already been created when processing
the superset match.
Chainable matches To chain match Mi with chainable match Mj we first
update the leftend coordinates of Miby assigning Mi.Lx← min(Mi.Lx,Mj.Ly)
for each ?i,j,x,y? in the neighborhood group entries. Similarly, we update the
rightend coordinates: Mi.Rx← max(Mi.Rx,Mj.Ry) for each ?i,j,x,y? in the
group. If any of the coordinates in Mi change we make note that a chainable
match has been chained. We then update the Match Record for Mj by setting
its subsuming match pointer to Mi, indicating that Mj is now invalid and is
subsumed by Mi. Any references to Mj in the Match Position Lookup Table
and elsewhere may be lazily updated to point to Mias they are encountered.
If Mj has a left superset link, the link is inherited by Mi and any remaining
neighborhood groups with chainable matches are ignored. Chainable groups are
processed in order of increasing d value so that the nearest chainable match with
a superset link will be encountered first. A special case exists when Mi= Mj.
This occurs when Mi represents an inverted repeat within w nucleotides. We
never allow Mito chain with itself.
Subset matches We defer subset match processing until no more chainable
matches exist in the neighborhood of Mi. A subset match Mj is considered
to be completely contained by Mi when for all x,y pairs in the neighborhood
group, Mi.Lx≤ Mj.Ly∧Mj.Ry≤ Mi.Rx. When subset match Mjis completely
contained by Mi, we set the subsuming match pointer of Mjto Mi. If the subset
match is not contained we create a link from Mi to Mj. The subset link is a
tuple of the form ?Mi,Mj,x1,x2,...,xMj? where the variables x1...xMjare
the x values associated with the y = 1...Mj from the neighborhood list group
entries. The link is added to the left subset links of Mi and we remove any
preexisting right superset link in Mjand replace it with the new link.
Novel subset matches A novel subset may only be formed when both Miand
Mjhave already been maximally extended, otherwise we discard any novel subset
matches. When a novel subset exists matches we create a new match record
Mnovelwith left and rightends equal to the outward boundaries of Miand Mj.
Rather than extend the novel subset match immediately, we procrastinate and
place the novel subset in the procrastination queue. Recall that the novel subset
match contains r matching components of Miand Mj. In constructing Mnovel,
we create links between Mnoveland each of Miand Mj such that Mnovelis a
left and a right subset of Miand Mj, respectively. The links are tuples of the
form outlined in the previous section on subset matches.
Occasionally a neighborhood group representing a novel subset match may
have Mi= Mj. This can occur when Mihas two or more components that form
Page 8
8
17 23456
[
w
Fig.4. Interplay between tandem repeats and novel subset matches. There are two
initial seed matches, one black, one gray. The black match has components labelled
17, and the neighborhood size w is shown with respect to component 7. As we attempt
leftward extension of the black match we discover the gray match in the neighborhood
of components 2 and 5 of black. A subset link is created. We also discover that some
components of the black match are within each others’ neighborhood. We classify
the black match as a tandem repeat and construct a novel subset match with one
component for each of the four tandem repeat units: {1},{2,3,4},{5,6},{7}.
a tandem or overlapping repeat. If Mi.Lxhas Mi.Ly in its neighborhood, and
Mi.Ly has Mi.Lz in its neighborhood, then we refer to {x,y,z} as a tandem
unit of Mi. A given tandem unit contains between one and Mi components of
Mi, and the set of tandem units forms a partition on the components of Mi. In
this situation we construct a novel subset match record with one component for
each tandem unit of Mi. If Mihas only a single tandem unit then we continue
without creating a novel subset match record. Figure 4 illustrates how we process
tandem repeats.
After the first round of chaining If the neighborhood list contained one or
more chainable groups we enter another round of extending Mi. The extension
process repeats starting with either link extension or by construction of a new
neighborhood list. When the boundaries of Mino longer change, we classify any
subset matches as either subsumed or outside of Miand treat them accordingly.
We process novel subsets. Finally, we may begin extension in the opposite (right
ward) direction. The rightward extension is accomplished in a similar manner,
except that the neighborhood is constructed from Mi.Rxinstead of Mi.Lxand
d ranges from −1,−2,...,−w and ties are broken in favor of the largest d value.
Where left links were previously used, right links are now used and viceversa.
Chaining the next match When the first match popped from the procras
tination queue has been maximally extended, we pop the next match from the
procrastination queue and consider it for extension. The process repeats until
the procrastination queue is empty. Prior to extending any match removed from
the procrastination queue, we check the match’s subsuming match pointer. If the
match has been subsumed extension is unnecessary.
3.4Link extension
To be considered for leftward link extension, Mimust have a left superset link to
another match, Mj. We first extend the boundaries of Mito include the region
Page 9
9
covered by Mjand unlink Mifrom Mj. Then each of the left subset links in Mj
are examined in turn to identify links that Mimay use for further extension. Re
call that the link from Mito Mjis of the form ?Mj,Mi,x1,...,xMi?. Likewise, a
left subset link from Mjto another match Mkis of the form ?Mj,Mk,z1,...,zMk?.
To evaluate whether Mimay follow a given link in the left subsets of Mj, we
take the set intersection of the x and z values for each Mkthat is a left subset
of Mj. We can classify the results of the set intersection as:
– Superset: {x1,...,xMi} ⊂ {z1,...,zMk} Here Mklinks to every compo
nent of Mjthat is linked by Mi, in addition to others.
– Chainable: {x1,...,xMi} = {z1,...,zMk} Here Mklinks to the same set
of components of Mjthat Milinks.
– Subset: {x1,...,xMi} ⊃ {z1,...,zMk} Here Milinks to every component
of Mjthat is linked by Mk, in addition to others.
– Novel Subset: {x1,...,xMi} ∩ {z1,...,zMk} ?= ∅ Here Mk is neither a
superset, chainable, nor subset relative to Mi, but the intersection of their
components in Mjis nonempty. Mkand Miform a novel subset.
Left subset links in Mjare processed in the order given above. Supersets are
never observed, because Mkwould have already unlinked itself from Mjwhen it
was processed (as described momentarily). When Mkis a chainable match, we
extend Mito include the region covered by Mk and set the subsuming match
pointer in Mkto point to Mi. We unlink Mkfrom Mj, and Miinherits any left
superset link that Mkmay have. When Mkis a subset of Miwe unlink Mkfrom
Mjand add it to the deferred subset list to be processed once Mihas been fully
extended. Finally, we never create novel subset matches during link extension
because Mkwill never be a fully extended match.
If a chainable match was found during leftward link extension, we continue
for another round of leftward extension. If not, we switch directions and begin
rightward extension.
3.5Time complexity
A neighborhood list may be constructed at most w times per character of S, and
construction uses sorting by key comparison, giving O(wN logwN) time and
space. Similarly, we spend O(wN logwN) time performing link extension. The
upper bound on the total number of components in the final set of matches
is O(wN). Thus, the overall time complexity for our filtration algorithm is
O(wN logwN).
4Results
We have created a program called procrastAligner for Linux, Windows, and
Mac OS X that implements the described algorithm. Our opensource imple
mentation is available as C++ source code licensed under the GPL.
Page 10
10
We compare the performance of our method in finding Alu repeats in the
human genome to an Eulerian path method for local multiple alignment [23].
The focus of our algorithm is efficient filtration, thus we use a scoring metric
that evaluates the filtration sensitivity and specificity of the ungapped alignment
chains produced by our method. We compute sensitivity as the number of Alu
elements hit by a match, out of the total number of Alu elements. We compute
specificity as the ratio of match components that hit an Alu to the sum of match
multiplicity for all matches that hit an Alu. Thus, we do not penalize our method
for finding legitimate repeats that are not in the Alu family.
The comparison between procrastAligner and the Eulerian method is nec
essarily indirect, as each method was designed to solve different (but related)
problems. The Eulerian method uses a de Bruijn graph for filtration, but goes
beyond filtration to compute gapped alignments using banded dynamic pro
gramming. We report scores for a version of the Eulerian method that computes
alignments only on regions identified by its de Bruijn filter. The results suggest
that by using our filtration method, the sensitivity of the Eulerian path local
multiple aligner could be significantly improved. A second important distinction
is that our method reports all local multiple alignment chains in its allotted
runtime, whereas the Eulerian method identifies only a single alignment.
We also test the ability of our method to provide accurate anchors for genome
alignment. Using a manually curated alignment of 144 Hepatitis C virus genome
sequences [27], we measure the anchoring sensitivity of our method as the frac
tion of pairwise positions aligned in the correct alignment that are also present
in procrastAligner chains. We measure positive predictive value as the num
ber of match component pairs that contain correctly aligned positions out of
the total number of match component pairs. procrastAligner may generate
legitimate matches in the repeat regions of a single genome. The PPV score
penalizes procrastAligner for identifying such legitimate repeats, which sub
sequent genome alignment would have to disambiguate. Using a seed size of 9
and w = 27, procrastAligner has a sensitivity of 63% and PPV of 67%.
5Discussion
We have described an efficient method for local multiple alignment filtration. The
chains of ungapped alignments that our filter outputs may serve as direct input
to multiple genome alignment algorithms. The test results of our prototype im
plementation on Alu sequences demonstrate improved sensitivity over de Bruijn
filtration. A promising avenue of further research will be to couple our filtration
method with subsequent refinement using banded dynamic programming.
The alignment scoring scheme we use can rank alignments by information
content, however a biological interpretation of the score remains difficult. If a
phylogeny and model of evolution for the sequences were known a priori then a
biologically relevant scoring scheme could be used [28]. Unfortunately, the phy
logenetic relationship for arbitrary local alignments is rarely known, especially
among repetitive elements or gene families within a single genome and across
Page 11
11
Accession Length Rep Family Alu (bp)
AF435921 22 Kb28
Div, %Method Sn % Sp % T (s) Sw w
10261 (69) 15.0 (6.4) Eulerian 96.3 99.4
procrast 100
245 (85) 15.7 (5.7) Eulerian 98.6 96.7
procrast 100
261 (72) 12.2 (5.9) Eulerian 93.5 95.2
procrast 100
277 (55) 15.0 (5.6) Eulerian 85.2 93.7
procrast 99.1 99.2
252 (79) 15.2 (6.1) Eulerian 72.4 99.4
procrast 98.3 97.3
1
1
4
2
14
3
32
3
85
20

9 27

9 27

15 45

15 45

15 45

95.9
Z1502538 Kb 52 13
82.5
AC034110 167 Kb 8718
97.9
AC010145 199 Kb 11813
Hs Chr 22 1 Mbp 40432
Table 2. Performance of procrastAlign and the Eulerian path approach on Alu re
peats. Rep: total number of Alu elements; Family: number of Alu families; Alu: average
Alu length in bp (S.D.); Div: average Alu divergence (S.D.); Sn: sensitivity; Sp: speci
ficity; T: compute time; Sw: palindromic seed weight; w: max gap size. Alus were
identified by RepeatMasker [22]. We report data for the fast version of the Eulerian
path method as given by Table 1 of [23]. Sensitivity and specificity of procrastAlign
was computed as described in the text.
genomes. It may be possible to use simulation and MCMC methods to score
alignments where the phylogeny and model of evolution is unknown a priori,
but doing so would be computationally prohibitive for our application.
6Acknowledgements
AED was supported by NLM Training Grant 5T15LM00735905. TJT was sup
ported by Spanish Ministry MECD Grant TIN200403382 and AGAUR Training
Grant FIIQUC2005. LZ was supported by AFT Grant 146000068112.
References
1. Ma, B., Tromp, J., Li, M.: PatternHunter: faster and more sensitive homology
search. Bioinformatics 18 (2002) 440–445
2. Brudno, M., Morgenstern, B.: Fast and sensitive alignment of large genomic se
quences. Proc IEEE Comput Soc Bioinform Conf 1 (2002) 138–147
3. No´ e, L., Kucherov, G.: Improved hit criteria for DNA local alignment. BMC
Bioinformatics 5 (2004)
4. Kahveci, T., Ljosa, V., Singh, A.K.:
indexing frequency vectors. Bioinformatics 20 (2004) 2122–2134
5. Choi, P, K., Zeng, F., Zhang, L.: Good spaced seeds for homology search. Bioin
formatics 20 (2004) 1053–1059
6. Li, M., Ma, B., Zhang, L.: Superiority and complexity of the spaced seeds. In:
SODA. (2006) 444–453
7. Sun, Y., Buhler, J.: Designing multiple simultaneous seeds for DNA similarity
search. J Comput Biol 12 (2005) 847–861
Speeding up wholegenome alignment by
Page 12
12
8. Xu, J., Brown, D.G., Li, M., Ma, B.: Optimizing multiple spaced seeds for homology
search. In: CPM. (2004) 47–58
9. Flannick, J., Batzoglou, S.: Using multiple alignments to improve seeded local
alignment algorithms. Nucleic Acids Res 33 (2005) 4563–4577
10. Li, L., Stoeckert, C.J., Roos, D.S.: OrthoMCL: identification of ortholog groups
for eukaryotic genomes. Genome Res 13 (2003) 2178–2189
11. Jaffe, D.B., Butler, J., Gnerre, S., Mauceli, E., LindbladToh, K., Mesirov, J.P.,
Zody, M.C., Lander, E.S.:Wholegenome sequence assembly for mammalian
genomes: Arachne 2. Genome Res 13 (2003) 91–96
12. Ane, C., Sanderson, M.: Missing the forest for the trees: phylogenetic compression
and its implications for inferring complex evolutionary histories. Syst Biol 54
(2005) I311–I317
13. Margulies, M., 55 other authors:Genome sequencing in microfabricated high
density picolitre reactors. Nature 437 (2005) 376–380
14. Darling, A.C.E., Mau, B., Blattner, F.R., Perna, N.T.: Mauve: multiple alignment
of conserved genomic sequence with rearrangements. Genome Res 14(7) (2004)
1394–403.
15. Hohl, M., Kurtz, S., Ohlebusch, E.: Efficient multiple genome alignment. Bioin
formatics 18 Suppl 1 (2002) S312–20.
16. Treangen, T., Messeguer, X.: MGCAT: Multiple Genome Comparison and Align
ment Tool. Submitted (2006)
17. Dewey, C.N., Pachter, L.: Evolution at the nucleotide level: the problem of multiple
wholegenome alignment. Hum Mol Genet 15 Suppl 1 (2006)
18. Sammeth, M., Heringa, J.: Global multiplesequence alignment with repeats. Pro
teins (2006)
19. Raphael, B., Zhi, D., Tang, H., Pevzner, P.: A novel method for multiple alignment
of sequences with repeated and shuffled elements. Genome Res 14(11) (2004)
2336–46.
20. Edgar, R.C., Myers, E.W.: PILER: identification and classification of genomic
repeats. Bioinformatics 21 Suppl 1 (2005)
21. Kurtz, S., Ohlebusch, E., Schleiermacher, C., Stoye, J., Giegerich, R.: Computation
and visualization of degenerate repeats in complete genomes. Proc Intell Syst Mol
Biol 8 (2000) 228–38.
22. Jurka, J., Kapitonov, V.V., Pavlicek, A., Klonowski, P., Kohany, O., Walichiewicz,
J.:Repbase Update, a database of eukaryotic repetitive elements.
Genome Res 110 (2005) 462–467
23. Zhang, Y., Waterman, M.S.: An Eulerian path approach to local multiple alignment
for DNA sequences. PNAS 102 (2005) 1285–90.
24. Siddharthan, R., Siggia, E.D., van Nimwegen, E.: PhyloGibbs: a Gibbs sampling
motif finder that incorporates phylogeny. PLoS Comput Biol 1 (2005)
25. Nagarajan, N., Jones, N., Keich, U.: Computing the Pvalue of the information
content from an alignment of multiple sequences. Bioinformatics 21 Suppl 1
(2005)
26. Szklarczyk, R., Heringa, J.: Tracking repeats using significance and transitivity.
Bioinformatics 20 Suppl 1 (2004) I311–I317
27. Kuiken, C., Yusim, K., Boykin, L., Richardson, R.: The Los Alamos hepatitis C
sequence database. Bioinformatics 21 (2005) 379–84
28. Prakash, A., Tompa, M.: Statistics of local multiple alignments. Bioinformatics
21 (2005) i344–i350
Cytogenet