Conference Paper

Common Intervals of Two Sequences.

DOI: 10.1007/978-3-540-39763-2_2 Conference: Algorithms in Bioinformatics, Third International Workshop, WABI 2003, Budapest, Hungary, September 15-20, 2003, Proceedings
Source: DBLP

ABSTRACT Looking for the subsets of genes appearing consecutively in two or more genomes is an useful approach to identify clusters
of genes functionally associated. A possible formalization of this problem is to modelize the order in which the genes appear
in all the considered genomes as permutations of their order in the first genome and find k-tuples of contiguous subsets of these permutations consisting of the same elements: the common intervals. A drawback of this
approach is that it doesn’t allow to take into account paralog genes and genomic internal duplications (each element occurs
only once in a permutation). To do it we need to modelize the order of genes by sequences which are not necessary permutations.

In this work, we study some properties of common intervals between two general sequences. We bound the maximum number of common
intervals between two sequences of length n by n
2 and present an O(n
2log(n)) time complexity algorithm to enumerate their whole set of common intervals. This complexity does not depend on the size
of the alphabets of the sequences.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The focus of this paper is the problem of finding all nested common intervals of two general sequences. Depending on the treatment one wants to apply to duplicate genes, Blin et al. introduced three models to define nested common intervals of two sequences: the uniqueness, the free-inclusion, and the bijection models. We consider all the three models. For the uniqueness and the bijection models, we give O(n + N<sub>out</sub>)-time algorithms, where N<sub>out</sub> denotes the size of the output. For the free-inclusion model, we give an O(n<sup>1+ε</sup> + N<sub>out</sub>)-time algorithm, where ε >; 0 is an arbitrarily small constant. We also present an upper bound on the size of the output for each model. For the uniqueness and the free-inclusion models, we show that N<sub>out</sub> = O(n<sup>2</sup>). Let C = Σ<sub>gϵΓ</sub> o<sub>1</sub>(g)o<sub>2</sub>(5), where Γ is the set of distinct genes, and o<sub>1</sub>(g) and o<sub>2</sub>(g) are, respectively, the numbers of copies of g in the two given sequences. For the bijection model, we show that N<sub>out</sub> = O(Cn). In this paper, we also study the problem of finding all approximate nested common intervals of two sequences on the bijection model. An O(δn + N<sub>out</sub>)-time algorithm is presented, where δ denotes the maximum number of allowed gaps. In addition, we show that for this problem N<sub>out</sub> is O(δn<sup>3</sup>).
    IEEE/ACM Transactions on Computational Biology and Bioinformatics 05/2012; · 1.62 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The automatic identification of syntenies across multiple species is a key step in comparative genomics that helps biologists shed light both on evolutionary and functional problems. In this paper, we present a versatile tool to extract all syntenies from multiple bacterial species based on a clear-cut and very flexible definition of the synteny blocks that allows for gene quorum, partial gene correspondence, gaps, and a partial or total conservation of the gene order. We apply this tool to two different kinds of studies. The first one is a search for functional gene associations. In this context, we compare our tool to a widely used heuristic--I-ADHORE--and show that at least up to ten genomes, the problem remains tractable with our exact definition and algorithm. The second application is linked to evolutionary studies: we verify in a multiple alignment setting that pairs of orthologs in synteny are more conserved than pairs outside, thus extending a previous pairwise study. We then show that this observation is in fact a function of the size of the synteny: the larger the block of synteny is, the more conserved the genes are.
    BMC Bioinformatics 01/2011; 12:193. · 3.02 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Common intervals have been defined as a modelisation of gene clusters in genomes represented either as permutations or as sequences. Whereas optimal algorithms for finding common intervals in permutations exist even for an arbitrary number of permutations, in sequences no optimal algorithm has been proposed yet even for only two sequences. Surprisingly enough, when sequences are reduced to permutations, the existing algorithms perform far from the optimum, showing that their performances are not dependent, as they should be, on the structural complexity of the input sequences. In this paper, we propose to characterize the structure of a sequence by the number $q$ of different dominating orders composing it (called the domination number), and to use a recent algorithm for permutations in order to devise a new algorithm for two sequences. Its running time is in $O(q_1q_2p+q_1n_1+q_2n_2+N)$, where $n_1, n_2$ are the sizes of the two sequences, $q_1,q_2$ are their respective domination numbers, $p$ is the alphabet size and $N$ is the number of solutions to output. This algorithm performs better as $q_1$ and/or $q_2$ reduce, and when the two sequences are reduced to permutations (i.e. when $q_1=q_2=1$) it has the same running time as the best algorithms for permutations. It is also the first algorithm for sequences whose running time involves the parameter size of the solution. As a counterpart, when $q_1$ and $q_2$ are of $O(n_1)$ and $O(n_2)$ respectively, the algorithm is less efficient than other approaches.