Conference Paper

Common Intervals of Two Sequences

DOI: 10.1007/978-3-540-39763-2_2 Conference: Algorithms in Bioinformatics, Third International Workshop, WABI 2003, Budapest, Hungary, September 15-20, 2003, Proceedings
Source: DBLP


Looking for the subsets of genes appearing consecutively in two or more genomes is an useful approach to identify clusters
of genes functionally associated. A possible formalization of this problem is to modelize the order in which the genes appear
in all the considered genomes as permutations of their order in the first genome and find k-tuples of contiguous subsets of these permutations consisting of the same elements: the common intervals. A drawback of this
approach is that it doesn’t allow to take into account paralog genes and genomic internal duplications (each element occurs
only once in a permutation). To do it we need to modelize the order of genes by sequences which are not necessary permutations.

In this work, we study some properties of common intervals between two general sequences. We bound the maximum number of common
intervals between two sequences of length n by n
2 and present an O(n
2log(n)) time complexity algorithm to enumerate their whole set of common intervals. This complexity does not depend on the size
of the alphabets of the sequences.

7 Reads
  • Source
    • "With these precisions, Didier's approach [8] consists then in considering each order O i and, in total time O(n 2 log n 2 ) (reducible to O(n 2 ) according to [17]), verifying whether the intervals Set(O i [1..h]) with 1 ≤ h ≤ ||O i || are also intervals of S. Our approach avoids to consider each order O i by defining dominating orders which contain other orders, with the aim of focalising the search for common intervals on each dominating order rather than spreading it on each of the orders it dominates. We introduce now the supplementary notions needed by our algorithm. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Common intervals have been defined as a modelisation of gene clusters in genomes represented either as permutations or as sequences. Whereas optimal algorithms for finding common intervals in permutations exist even for an arbitrary number of permutations, in sequences no optimal algorithm has been proposed yet even for only two sequences. Surprisingly enough, when sequences are reduced to permutations, the existing algorithms perform far from the optimum, showing that their performances are not dependent, as they should be, on the structural complexity of the input sequences. In this paper, we propose to characterize the structure of a sequence by the number $q$ of different dominating orders composing it (called the domination number), and to use a recent algorithm for permutations in order to devise a new algorithm for two sequences. Its running time is in $O(q_1q_2p+q_1n_1+q_2n_2+N)$, where $n_1, n_2$ are the sizes of the two sequences, $q_1,q_2$ are their respective domination numbers, $p$ is the alphabet size and $N$ is the number of solutions to output. This algorithm performs better as $q_1$ and/or $q_2$ reduce, and when the two sequences are reduced to permutations (i.e. when $q_1=q_2=1$) it has the same running time as the best algorithms for permutations. It is also the first algorithm for sequences whose running time involves the parameter size of the solution. As a counterpart, when $q_1$ and $q_2$ are of $O(n_1)$ and $O(n_2)$ respectively, the algorithm is less efficient than other approaches.
    Journal of Discrete Algorithms 10/2013; 29. DOI:10.1016/j.jda.2014.10.004
  • Source
    • "In these applications, genomes may be represented either as permutations, when they do not contain duplicated genes, or as sequences. In sequences, duplicated genes usually play similar roles and lead to a more complex interval search [10] [20], but sometimes they are appropriately matched and renumbered so as to obtain permutations [9] [2]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Common intervals of K permutations over the same set of n elements were firstly investigated by T. Uno and M.Yagiura (Algorithmica, 26:290:309, 2000), who proposed an efficient algorithm to find common intervals when K=2. Several particular classes of intervals have been defined since then, e.g. conserved intervals and nested common intervals, with applications mainly in genome comparison. Each such class, including common intervals, led to the development of a specific algorithmic approach for K=2, and - except for nested common intervals - for its extension to an arbitrary K. In this paper, we propose a common and efficient algorithmic framework for finding different types of common intervals in a set P of K permutations, with arbitrary K. Our generic algorithm is based on a global representation of the information stored in P, called the MinMax-profile of P, and an efficient data structure, called an LR-stack, that we introduce here. We show that common intervals (and their subclasses of irreducible common intervals and same-sign common intervals), nested common intervals (and their subclass of maximal nested common intervals) as well as conserved intervals (and their subclass of irreducible conserved intervals) may be obtained by appropriately setting the parameters of our algorithm in each case. All the resulting algorithms run in O(Kn+N)-time and need O(n) additional space, where N is the number of solutions. The algorithms for nested common intervals and maximal nested common intervals are new for K>2, in the sense that no other algorithm has been given so far to solve the problem with the same complexity, or better. The other algorithms are as efficient as the best known algorithms.
    Theoretical Computer Science 04/2013; 543. DOI:10.1016/j.tcs.2014.06.004 · 0.66 Impact Factor
  • Source
    • "Heber and Stoye [13] extended this work to find all common intervals of k permutations in O(kn + N out ) time, using O(kn) space. In addition, Didier [9] extended this model to include paralogs by considering a sequence definition more general than a strict permutation, and gave an algorithm that finds all common intervals of two sequences in O(n 2 log n) time, using O(n) space, on the extended model. Later, Schmidt and Stoye [23] improved this result to O(n 2 ) time. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The focus of this paper is the problem of finding all nested common intervals of two general sequences. Depending on the treatment one wants to apply to duplicate genes, Blin et al. introduced three models to define nested common intervals of two sequences: the uniqueness, the free-inclusion, and the bijection models. We consider all the three models. For the uniqueness and the bijection models, we give O(n + N<sub>out</sub>)-time algorithms, where N<sub>out</sub> denotes the size of the output. For the free-inclusion model, we give an O(n<sup>1+ε</sup> + N<sub>out</sub>)-time algorithm, where ε >; 0 is an arbitrarily small constant. We also present an upper bound on the size of the output for each model. For the uniqueness and the free-inclusion models, we show that N<sub>out</sub> = O(n<sup>2</sup>). Let C = Σ<sub>gϵΓ</sub> o<sub>1</sub>(g)o<sub>2</sub>(5), where Γ is the set of distinct genes, and o<sub>1</sub>(g) and o<sub>2</sub>(g) are, respectively, the numbers of copies of g in the two given sequences. For the bijection model, we show that N<sub>out</sub> = O(Cn). In this paper, we also study the problem of finding all approximate nested common intervals of two sequences on the bijection model. An O(δn + N<sub>out</sub>)-time algorithm is presented, where δ denotes the maximum number of allowed gaps. In addition, we show that for this problem N<sub>out</sub> is O(δn<sup>3</sup>).
    IEEE/ACM Transactions on Computational Biology and Bioinformatics 05/2012; 9(2-9):548 - 559. DOI:10.1109/TCBB.2011.112 · 1.44 Impact Factor
Show more