Common Intervals of Two Sequences.
ABSTRACT Looking for the subsets of genes appearing consecutively in two or more genomes is an useful approach to identify clusters
of genes functionally associated. A possible formalization of this problem is to modelize the order in which the genes appear
in all the considered genomes as permutations of their order in the first genome and find ktuples of contiguous subsets of these permutations consisting of the same elements: the common intervals. A drawback of this
approach is that it doesn’t allow to take into account paralog genes and genomic internal duplications (each element occurs
only once in a permutation). To do it we need to modelize the order of genes by sequences which are not necessary permutations.
In this work, we study some properties of common intervals between two general sequences. We bound the maximum number of common
intervals between two sequences of length n by n
2 and present an O(n
2log(n)) time complexity algorithm to enumerate their whole set of common intervals. This complexity does not depend on the size
of the alphabets of the sequences.

[Show abstract] [Hide abstract]
ABSTRACT: Common intervals have been defined as a modelisation of gene clusters in genomes represented either as permutations or as sequences. Whereas optimal algorithms for finding common intervals in permutations exist even for an arbitrary number of permutations, in sequences no optimal algorithm has been proposed yet even for only two sequences. Surprisingly enough, when sequences are reduced to permutations, the existing algorithms perform far from the optimum, showing that their performances are not dependent, as they should be, on the structural complexity of the input sequences. In this paper, we propose to characterize the structure of a sequence by the number $q$ of different dominating orders composing it (called the domination number), and to use a recent algorithm for permutations in order to devise a new algorithm for two sequences. Its running time is in $O(q_1q_2p+q_1n_1+q_2n_2+N)$, where $n_1, n_2$ are the sizes of the two sequences, $q_1,q_2$ are their respective domination numbers, $p$ is the alphabet size and $N$ is the number of solutions to output. This algorithm performs better as $q_1$ and/or $q_2$ reduce, and when the two sequences are reduced to permutations (i.e. when $q_1=q_2=1$) it has the same running time as the best algorithms for permutations. It is also the first algorithm for sequences whose running time involves the parameter size of the solution. As a counterpart, when $q_1$ and $q_2$ are of $O(n_1)$ and $O(n_2)$ respectively, the algorithm is less efficient than other approaches.Journal of Discrete Algorithms 10/2013; DOI:10.1016/j.jda.2014.10.004 
[Show abstract] [Hide abstract]
ABSTRACT: An important model of a conserved gene cluster is called the gene team model, in which a chromosome is defined to be a permutation of distinct genes and a gene team is defined to be a set of genes that appear in two or more species, with the distance between adjacent genes in the team for each chromosome always no more than a certain threshold $delta$. A gene team tree is a succinct way to represent all gene teams for every possible value of $delta$. The previous fastest algorithm for constructing a gene team tree of two chromosomes requires $O(n {rm lg} n ;{rm lglg} n)$ time, which was given by Wang and Lin. Its bottleneck is a problem called the maximumgap problem. In this paper, by presenting an improved algorithm for the maximumgap problem, we reduce the upper bound of the gene team tree problem to $O(n {rm lg} n alpha (n))$. Since $alpha$ grows extremely slowly, this result is almost as efficient as the current best upper bound, $O(n {rm lg} n)$, for finding the gene teams of a fixed $delta$ value. Our new algorithm is very efficient from both the theoretical and practical points of view. Wang and Lin's geneteamtree algorithm can be extended to $k$ chromosomes with complexity $O(kn {rm lg} n {rm lglg} n)$. Similarly, our improved algorithm for the maximumgap problem reduces this running time to $O(kn {rm lg} n alpha (n))$. In addition, it also provides new upper bounds for the gene team tree problem on general sequences, in which multiple copies of the same gene are allowed.IEEE/ACM Transactions on Computational Biology and Bioinformatics 01/2014; 11(1):142153. DOI:10.1109/TCBB.2013.150 · 1.54 Impact Factor 
Conference Paper: Parikh matching in the streaming model
[Show abstract] [Hide abstract]
ABSTRACT: Let S be a string over an alphabet Σ={σ1, σ2, …}. A Parikhmapping maps a substring S′ of S to a Σlength vector that contains, in location i of the vector, the count of σi in S′. Parikh matching refers to the problem of finding all substrings of a text T which match to a given input Σlength count vector. In the streaming model one seeks spaceefficient algorithms for problems in which there is one pass over the data. We consider Parikh matching in the streaming model. To make this viable we search for substrings whose Parikhmappings approximately match the input vector. In this paper we present upper and lower bounds on the problem of approximate Parikh matching in the streaming model.Proceedings of the 19th international conference on String Processing and Information Retrieval; 10/2012