Looking for the subsets of genes appearing consecutively in two or more genomes is an useful approach to identify clusters
of genes functionally associated. A possible formalization of this problem is to modelize the order in which the genes appear
in all the considered genomes as permutations of their order in the first genome and find k-tuples of contiguous subsets of these permutations consisting of the same elements: the common intervals. A drawback of this
approach is that it doesn’t allow to take into account paralog genes and genomic internal duplications (each element occurs
only once in a permutation). To do it we need to modelize the order of genes by sequences which are not necessary permutations.
In this work, we study some properties of common intervals between two general sequences. We bound the maximum number of common
intervals between two sequences of length n by n
2 and present an O(n
2log(n)) time complexity algorithm to enumerate their whole set of common intervals. This complexity does not depend on the size
of the alphabets of the sequences.
"With these precisions, Didier's approach  consists then in considering each order O i and, in total time O(n 2 log n 2 ) (reducible to O(n 2 ) according to ), verifying whether the intervals Set(O i [1..h]) with 1 ≤ h ≤ ||O i || are also intervals of S. Our approach avoids to consider each order O i by defining dominating orders which contain other orders, with the aim of focalising the search for common intervals on each dominating order rather than spreading it on each of the orders it dominates. We introduce now the supplementary notions needed by our algorithm. "
[Show abstract][Hide abstract] ABSTRACT: Common intervals have been defined as a modelisation of gene clusters in
genomes represented either as permutations or as sequences. Whereas optimal
algorithms for finding common intervals in permutations exist even for an
arbitrary number of permutations, in sequences no optimal algorithm has been
proposed yet even for only two sequences. Surprisingly enough, when sequences
are reduced to permutations, the existing algorithms perform far from the
optimum, showing that their performances are not dependent, as they should be,
on the structural complexity of the input sequences.
In this paper, we propose to characterize the structure of a sequence by the
number $q$ of different dominating orders composing it (called the domination
number), and to use a recent algorithm for permutations in order to devise a
new algorithm for two sequences. Its running time is in
$O(q_1q_2p+q_1n_1+q_2n_2+N)$, where $n_1, n_2$ are the sizes of the two
sequences, $q_1,q_2$ are their respective domination numbers, $p$ is the
alphabet size and $N$ is the number of solutions to output. This algorithm
performs better as $q_1$ and/or $q_2$ reduce, and when the two sequences are
reduced to permutations (i.e. when $q_1=q_2=1$) it has the same running time as
the best algorithms for permutations. It is also the first algorithm for
sequences whose running time involves the parameter size of the solution. As a
counterpart, when $q_1$ and $q_2$ are of $O(n_1)$ and $O(n_2)$ respectively,
the algorithm is less efficient than other approaches.
Journal of Discrete Algorithms 10/2013; 29. DOI:10.1016/j.jda.2014.10.004
"In these applications, genomes may be represented either as permutations, when they do not contain duplicated genes, or as sequences. In sequences, duplicated genes usually play similar roles and lead to a more complex interval search  , but sometimes they are appropriately matched and renumbered so as to obtain permutations  . "
[Show abstract][Hide abstract] ABSTRACT: Common intervals of K permutations over the same set of n elements were
firstly investigated by T. Uno and M.Yagiura (Algorithmica, 26:290:309, 2000),
who proposed an efficient algorithm to find common intervals when K=2. Several
particular classes of intervals have been defined since then, e.g. conserved
intervals and nested common intervals, with applications mainly in genome
comparison. Each such class, including common intervals, led to the development
of a specific algorithmic approach for K=2, and - except for nested common
intervals - for its extension to an arbitrary K.
In this paper, we propose a common and efficient algorithmic framework for
finding different types of common intervals in a set P of K permutations, with
arbitrary K. Our generic algorithm is based on a global representation of the
information stored in P, called the MinMax-profile of P, and an efficient data
structure, called an LR-stack, that we introduce here. We show that common
intervals (and their subclasses of irreducible common intervals and same-sign
common intervals), nested common intervals (and their subclass of maximal
nested common intervals) as well as conserved intervals (and their subclass of
irreducible conserved intervals) may be obtained by appropriately setting the
parameters of our algorithm in each case. All the resulting algorithms run in
O(Kn+N)-time and need O(n) additional space, where N is the number of
solutions. The algorithms for nested common intervals and maximal nested common
intervals are new for K>2, in the sense that no other algorithm has been given
so far to solve the problem with the same complexity, or better. The other
algorithms are as efficient as the best known algorithms.
"Heber and Stoye  extended this work to find all common intervals of k permutations in O(kn + N out ) time, using O(kn) space. In addition, Didier  extended this model to include paralogs by considering a sequence definition more general than a strict permutation, and gave an algorithm that finds all common intervals of two sequences in O(n 2 log n) time, using O(n) space, on the extended model. Later, Schmidt and Stoye  improved this result to O(n 2 ) time. "
[Show abstract][Hide abstract] ABSTRACT: The focus of this paper is the problem of finding all nested common intervals of two general sequences. Depending on the treatment one wants to apply to duplicate genes, Blin et al. introduced three models to define nested common intervals of two sequences: the uniqueness, the free-inclusion, and the bijection models. We consider all the three models. For the uniqueness and the bijection models, we give O(n + N<sub>out</sub>)-time algorithms, where N<sub>out</sub> denotes the size of the output. For the free-inclusion model, we give an O(n<sup>1+ε</sup> + N<sub>out</sub>)-time algorithm, where ε >; 0 is an arbitrarily small constant. We also present an upper bound on the size of the output for each model. For the uniqueness and the free-inclusion models, we show that N<sub>out</sub> = O(n<sup>2</sup>). Let C = Σ<sub>gϵΓ</sub> o<sub>1</sub>(g)o<sub>2</sub>(5), where Γ is the set of distinct genes, and o<sub>1</sub>(g) and o<sub>2</sub>(g) are, respectively, the numbers of copies of g in the two given sequences. For the bijection model, we show that N<sub>out</sub> = O(Cn). In this paper, we also study the problem of finding all approximate nested common intervals of two sequences on the bijection model. An O(δn + N<sub>out</sub>)-time algorithm is presented, where δ denotes the maximum number of allowed gaps. In addition, we show that for this problem N<sub>out</sub> is O(δn<sup>3</sup>).
IEEE/ACM Transactions on Computational Biology and Bioinformatics 05/2012; 9(2-9):548 - 559. DOI:10.1109/TCBB.2011.112 · 1.44 Impact Factor
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.