Conference Paper

Novel Computational Methods for Large Scale Genome Comparison.

DOI: 10.1007/978-3-540-85861-4_9 Conference: 2nd International Workshop on Practical Applications of Computational Biology and Bioinformatics, IWPACBB 2008, Salamanca, Spain, 22th-24th October 2008
Source: DBLP


The current wealth of available genomic data provides an unprecedented opportunity to compare and contrast evolutionary histories
of closely and distantly related organisms. The focus of this dissertation is on developing novel algorithms and software
for efficient global and local comparison of multiple genomes and the application of these methods for a biologically relevant
case study. The thesis research is organized into three successive phases, specifically: (1) multiple genome alignment of
closely related species, (2) local multiple alignment of interspersed repeats, and finally, (3) a comparative genomics case
study of Neisseria. In Phase 1, we first develop an efficient algorithm and data structure for maximal unique match search in multiple genome
sequences. We implement these contributions in an interactive multiple genome comparison and alignment tool, M-GCAT, that
can efficiently construct multiple genome comparison frameworks in closely related species. In Phase 2, we present a novel
computational method for local multiple alignment of interspersed repeats. Our method for local alignment of interspersed
repeats features a novel method for gapped extensions of chained seed matches, joining global multiple alignment with a homology
test based on a hidden Markov model (HMM). In Phase 3, using the results from the previous two phases we perform a case study
of neisserial genomes by tracking the propagation of repeat sequence elements in attempt to understand why the important pathogens
of the neisserial group have sexual exchange of DNA by natural transformation. In conclusion, our global contributions in
this dissertation have focused on comparing and contrasting evolutionary histories of related organisms via multiple alignment
of genomes.

1 Follower
8 Reads
  • [Show abstract] [Hide abstract]
    ABSTRACT: We study the computational complexity of two popular problems in multiple sequence alignment: multiple alignment with SP-score and multiple tree alignment. It is shown that the first problem is NP-complete and the second is MAX SNP-hard. The complexity of tree alignment with a given phylogeny is also considered.
    Journal of Computational Biology 02/1994; 1(4):337-48. DOI:10.1089/cmb.1994.1.337 · 1.74 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Due to recent advances in whole genome shotgun sequencing and assembly technologies, the financial cost of decoding an organism's DNA has been drastically reduced, resulting in a recent explosion of genomic sequencing projects. This increase in related genomic data will allow for in depth studies of evolution in closely related species through multiple whole genome comparisons. To facilitate such comparisons, we present an interactive multiple genome comparison and alignment tool, M-GCAT, that can efficiently construct multiple genome comparison frameworks in closely related species. M-GCAT is able to compare and identify highly conserved regions in up to 20 closely related bacterial species in minutes on a standard computer, and as many as 90 (containing 75 cloned genomes from a set of 15 published enterobacterial genomes) in an hour. M-GCAT also incorporates a novel comparative genomics data visualization interface allowing the user to globally and locally examine and inspect the conserved regions and gene annotations. M-GCAT is an interactive comparative genomics tool well suited for quickly generating multiple genome comparisons frameworks and alignments among closely related species. M-GCAT is freely available for download for academic and non-commercial use at:
    BMC Bioinformatics 02/2006; 7(1):433. DOI:10.1186/1471-2105-7-433 · 2.58 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe a new method (T-Coffee) for multiple sequence alignment that provides a dramatic improvement in accuracy with a modest sacrifice in speed as compared to the most commonly used alternatives. The method is broadly based on the popular progressive approach to multiple alignment but avoids the most serious pitfalls caused by the greedy nature of this algorithm. With T-Coffee we pre-process a data set of all pair-wise alignments between the sequences. This provides us with a library of alignment information that can be used to guide the progressive alignment. Intermediate alignments are then based not only on the sequences to be aligned next but also on how all of the sequences align with each other. This alignment information can be derived from heterogeneous sources such as a mixture of alignment programs and/or structure superposition. Here, we illustrate the power of the approach by using a combination of local and global pair-wise alignments to generate the library. The resulting alignments are significantly more reliable, as determined by comparison with a set of 141 test cases, than any of the popular alternatives that we tried. The improvement, especially clear with the more difficult test cases, is always visible, regardless of the phylogenetic spread of the sequences in the tests.
    Journal of Molecular Biology 10/2000; 302(1-302):205-217. DOI:10.1006/jmbi.2000.4042 · 4.33 Impact Factor