
Source Available from: psu.edu
[Show abstract] [Hide abstract]
ABSTRACT: Recent technologies for typing single nucleotide polymorphisms (SNPs) across a population are producing genomewide genotype data for tens of thousands of SNP sites. The emergence of such large data sets underscores the importance of algorithms for largescale haplotyping. Common haplotyping approaches first partition the SNPs into blocks of high linkagedisequilibrium, and then infer haplotypes for each block separately. We investigate an integrated haplotyping approach where a partition of the SNPs into a minimum number of noncontiguous subsets is sought, such that each subset can be haplotyped under the perfect phylogeny model. We show that finding an optimum partition is hard even if we are guaranteed that two subsets suffice. On the positive side, we show that a variant of the problem, in which each subset is required to admit a perfect path phylogeny haplotyping, is solvable in polynomial time. Discrete Mathematics 09/2009; 309(18309):56105617. DOI:10.1016/j.disc.2008.04.002 · 0.57 Impact Factor

Source Available from: Avivit Levy
[Show abstract] [Hide abstract]
ABSTRACT: Consider the following optimization problem: given two strings over the same alphabet, transform one into another by a succession of interchanges of two elements. In each interchange the two participating elements exchange positions. An interchange is given a weight that depends on the distance in the string between the two exchanged elements. The object is to minimize the total weight of the interchanges. This problem is a generalization of a classical problem on permutations (where every element appears once). The generalization considers general strings with possibly repeating elements, and a function assigning weights to the interchanges. The generalization to general strings (with unit weights) was mentioned by Cayley in the 19th century, and its complexity has been an open question since. We solve this open problem and consider various weight functions as well. SIAM Journal on Computing 01/2009; 39:14441461. DOI:10.1137/080712969 · 0.76 Impact Factor

Source Available from: Frances A. Rosamond
[Show abstract] [Hide abstract]
ABSTRACT: The haplotype inference problem (HIP) asks to find a set of haplotypes which resolve a given set of genotypes. This problem is important in practical fields such as the investigation of diseases or other types of genetic mutations. In order to find the haplotypes which are as close as possible to the real set of haplotypes that comprise the genotypes, two models have been suggested which are by now wellstudied: The perfect phylogeny model and the pure parsimony model. All known algorithms up till now for haplotype inference may find haplotypes that are not necessarily plausible, i.e., very rare haplotypes or haplotypes that were never observed in the population. In order to overcome this disadvantage, we study in this paper, a new constrained version of HIP under the abovementioned models. In this new version, a pool of plausible haplotypes H is given together with the set of genotypes G, and the goal is to find a subset H ⊆ H that resolves G. For constrained perfect phlogeny haplotyping (CPPH), we provide initial insights and polynomialtime algorithms for some restricted cases of the problem. For constrained parsimony haplotyping (CPH), we show that the problem is fixed parameter tractable when parameterized by the size of the solution set of haplotypes. IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM 01/2009; 8(6):16929. DOI:10.1109/TCBB.2010.72 · 1.54 Impact Factor

Source Available from: Oren Kapah
[Show abstract] [Hide abstract]
ABSTRACT: The Longest Common Subsequence (LCS) is a well studied problem, having a wide range of implementations. Its motivation is in comparing strings. It has long been of interest to devise a similar measure for comparing higher dimensional objects, and more complex structures. In this paper we study the Longest Common Substructure of two matrices and show that this problem is NPhard. We also study the Longest Common Subforest problem for multiple trees including a constrained version, as well. We show NPhardness for k>2 unordered trees in the constrained LCS. We also give polynomial time algorithms for ordered trees and prove a lower bound for any decomposition strategy for k trees. Theoretical Computer Science 12/2008; 409(3):438449. DOI:10.1016/j.tcs.2008.08.037 · 0.52 Impact Factor

Source Available from: Braha Riva Shalom
[Show abstract] [Hide abstract]
ABSTRACT: The Longest Common Subsequence (LCS) is a well studied problem, having a wide range of implementations. Its motivation is
in comparing strings. It has long been of interest to devise a similar measure for comparing higher dimensional objects, and
more complex structures. In this paper we give, what is to our knowledge, the first inherently multidimensional definition
of LCS. We discuss the Longest Common Substructure of two matrices and the Longest Common Subtree problem for multiple trees
including a constrained version. Both problems cannot be solved by a natural extension of the original LCS solution. We investigate
the tractability of the above problems. For the first we prove
NP\cal{NP}
Completeness. For the latter
NP\cal{NP}
hardness holds for two general unordered trees and for k trees in the constrained LCS. 09/2007: pages 5061;

Source Available from: psu.edu
Tzvika Hartman
[Show abstract] [Hide abstract]
ABSTRACT: An important problem in genome rearrangements is sorting permutations by transpositions. Its complexity is still open, and
two rather complicated 1.5approximation algorithms for sorting linear permutations are known (Bafna and Pevzner, 96 and Christie, 98). In this paper, we observe that the problem of sorting circular permutations by transpositions is equivalent to the problem
of sorting linear permutations by transpositions. Hence, all algorithms for sorting linear permutations by transpositions
can be used to sort circular permutations. Our main result is a new 1.5approximation algorithm, which is considerably simpler
than the previous ones, and achieves running time which is equal to the best known. Moreover, the analysis of the algorithm
is significantly less involved, and provides a good starting point for studying related open problems. 03/2007: pages 156169;

Source Available from: Oren Kapah
String Processing and Information Retrieval, 14th International Symposium, SPIRE 2007, Santiago, Chile, October 2931, 2007, Proceedings; 01/2007

Source Available from: Avivit Levy
[Show abstract] [Hide abstract]
ABSTRACT: An underlying assumption in the classical sorting problem is that the sorter does not know the index of every element in the
sorted array. Thus, comparisons are used to determine the order of elements, while the sorting is done by interchanging elements.
In the closely related interchange rearrangement problem, final positions of elements are already given, and the cost of the
rearrangement is the cost of the interchanges. This problem was studied only for the limited case of permutation strings,
where every element appears once. This paper studies a generalization of the classical and wellstudied problem on permutations
by considering general strings input, thus solving an open problem of Cayley from 1849, and examining various cost models. Algorithms  ESA 2007, 15th Annual European Symposium, Eilat, Israel, October 810, 2007, Proceedings; 01/2007

Source Available from: citeseerx.ist.psu.edu
[Show abstract] [Hide abstract]
ABSTRACT: Recent technologies for typing single nucleotide polymorphisms (SNPs) across a population are producing genomewide genotype
data for tens of thousands of SNP sites. The emergence of such large data sets underscores the importance of algorithms for
largescale haplotyping. Common haplotyping approaches first partition the SNPs into blocks of high linkagedisequilibrium,
and then infer haplotypes for each block separately. We investigate an integrated haplotyping approach where a partition of
the SNPs into a minimum number of noncontiguous subsets is sought, such that each subset can be haplotyped under the perfect
phylogeny model. We show that finding an optimum partition is NPhard even if we are guaranteed that two subsets suffice.
On the positive side, we show that a variant of the problem, in which each subset is required to admit a perfect path phylogeny haplotyping, is solvable in polynomial time. 09/2006: pages 92102;

[Show abstract] [Hide abstract]
ABSTRACT: We study the problems of sorting signed permutations by reversals (SBR) and sorting unsigned permutations by transpositions
(SBT), which are central problems in computational molecular biology. While a polynomialtime solution for SBR is known, the
computational complexity of SBT has been open for more than a decade and is considered a major open problem.
In the first efficient solution of SBR, Hannenhalli and Pevzner [HP99] used a graphtheoretic model for representing permutations,
called the interleaving graph. This model was crucial to their solution. Here, we define a new model for SBT, which is analogous to the interleaving graph.
Our model has some desirable properties that were lacking in earlier models for SBT. These properties make it extremely useful
for studying SBT.
Using this model, we give a linearalgebraic framework in which SBT can be studied. Specifically, for matrices over any algebraic
ring, we define a class of matrices called tight matrices. We show that an efficient algorithm which recognizes tight matrices over a certain ring,
\mathbbM\mathbb{M}, implies an efficient algorithm that solves SBT on an important class of permutations, called simple permutations. Such an
algorithm is likely to lead to an efficient algorithm for SBT that works on all permutations.
The problem of recognizing tight matrices is also a generalization of SBR and of a large class of other “sorting by rearrangements”
problems, and seems interesting in its own right as. We give an efficient algorithm for recognizing tight symmetric matrices
over any field of characteristic 2. We leave as an open problem to find an efficient algorithm for recognizing tight matrices
over the ring
\mathbbM\mathbb{M}. String Processing and Information Retrieval, 13th International Conference, SPIRE 2006, Glasgow, UK, October 1113, 2006, Proceedings; 01/2006

Source Available from: biu.ac.il
[Show abstract] [Hide abstract]
ABSTRACT: Sorting permutations by transpositions is an important problem in genome rearrangements. A transposition is a rearrangement operation in which a segment is cut out of the permutation and pasted in a different location. The complexity of this problem is still open and it has been a 10yearold open problem to improve the best known 1.5approximation algorithm. In this paper, we provide a 1.375approximation algorithm for sorting by transpositions. The algorithm is based on a new upper bound on the diameter of 3permutations. In addition, we present some new results regarding the transposition diameter: we improve the lower bound for the transposition diameter of the symmetric group and determine the exact transposition diameter of simple permutations. IEEE/ACM Transactions on Computational Biology and Bioinformatics 01/2006; 3(4):36979. DOI:10.1109/TCBB.2006.44 · 1.54 Impact Factor

Source Available from: citeseerx.ist.psu.edu
[Show abstract] [Hide abstract]
ABSTRACT: One of the most promising ways to determine evolutionary distance between two organisms is to compare the order of appearance of orthologous genes in their genomes. The resulting genome rearrangement problem calls for finding a shortest sequence of rearrangement operations that sorts one genome into the other. In this paper we provide a 1.5approximation algorithm for the problem of sorting by transpositions and transreversals, improving on a fiveyearold 1.75 ratio for this problem. Our algorithm is also faster than current approaches and requires time for n genes. Journal of Computer and System Sciences 05/2005; 70(370):300320. DOI:10.1016/j.jcss.2004.12.006 · 1.09 Impact Factor

Source Available from: citeseerx.ist.psu.edu

Source Available from: Ron Shamir
[Show abstract] [Hide abstract]
ABSTRACT: An important problem in genome rearrangements is sorting permutations by transpositions. The complexity of the problem is still open, and two rather complicated 1.5approximation algorithms for sorting linear permutations are known (Bafna and Pevzner, 98 and Christie, 99). The fastest known algorithm is the quadratic algorithm of Bafna and Pevzner. In this paper, we observe that the problem of sorting circular permutations by transpositions is equivalent to the problem of sorting linear permutations by transpositions. Hence, all algorithms for sorting linear permutations by transpositions can be used to sort circular permutations. Our main result is a new 1.5approximation algorithm, which is considerably simpler than the previous ones, and whose analysis is significantly less involved. Information and Computation 05/2004; 204(2204):275290. DOI:10.1016/j.ic.2005.09.002 · 0.60 Impact Factor

Source Available from: Amir BenDor
[Show abstract] [Hide abstract]
ABSTRACT: We study a design and optimization problem that occurs, for example, when single nucleotide polymorphisms (SNPs) are to be genotyped using a universal DNA tag array. The problem of optimizing the universal array to avoid disruptive crosshybridization between universal components of the system was addressed in previous work. Crosshybridization can, however, also occur assay specifically, due to unwanted complementarity involving assayspecific components. Here we examine the problem of identifying the most economic experimental configuration of the assayspecific components that avoids crosshybridization. Our formalization translates this problem into the problem of covering the vertices of one side of a bipartite graph by a minimum number of balanced subgraphs of maximum degree 1. We show that the general problem is NPcomplete. However, in the real biological setting, the vertices that need to be covered have degrees bounded by d. We exploit this restriction and develop an O(d)approximation algorithm for the problem. We also give an O(d)approximation for a variant of the problem in which the covering subgraphs are required to be vertex disjoint. In addition, we propose a stochastic model for the input data and use it to prove a lower bound on the cover size. We complement our theoretical analysis by implementing two heuristic approaches and testing their performance on synthetic data as well as on simulated SNP data. Journal of Computational Biology 02/2004; 11(23):47692. DOI:10.1089/1066527041410373 · 1.67 Impact Factor

Source Available from: citeseerx.ist.psu.edu
[Show abstract] [Hide abstract]
ABSTRACT: One of the most promising ways to determine evolutionary distance between two organisms is to compare the order of appearance of orthologous genes in their genomes. The resulting genome rearrangement problem calls for finding a shortest sequence of rearrangement operations that sorts one genome into the other. In this paper we provide a 1.5approximation algorithm for the problem of sorting by transpositions and transreversals, improving on a fiveyearold 1.75 ratio for this problem. Our algorithm is also faster than current approaches and requires Algorithms in Bioinformatics, 4th International Workshop, WABI 2004, Bergen, Norway, September 1721, 2004, Proceedings; 01/2004

Source Available from: Amir BenDor
[Show abstract] [Hide abstract]
ABSTRACT: We study a design and optimization problem that occurs, for example, when single nucleotide polymorphisms (SNPs) are to be genotyped using a universal DNA tag array. The problem of optimizing the universal array to avoid disruptive crosshybridization between universal components of the system was addressed in a previous work. However, crosshybridization can also occur assayspecifically, due to unwanted complementarity involving assayspecific components. Here we examine the problem of identifying the most economic experimental configuration of the assayspecific components that avoids crosshybridization. Our formalization translates this problem into the problem of covering the vertices of one side of a bipartite graph by a minimum number of balanced subgraphs of maximum degree 1. We show that the general problem is NPcomplete. However, in the real biological setting the vertices that need to be covered have degrees bounded by d. We exploit this restriction and develop an O(d)approximation algorithm for the problem. We also give an O(d)approximation for a variant of the problem in which the covering subgraphs are required to be vertexdisjoint. In addition, we propose a stochastic model for the input data and use it to prove a lower bound on the cover size. We complement our theoretical analysis by implementing two heuristic approaches and testing their performance on simulated and real SNP data. 04/2003

Source Available from: Ron Shamir
[Show abstract] [Hide abstract]
ABSTRACT: Sequencing by hybridization (SBH) is a DNA sequencing technique, in which the sequence is reconstructed using its kmer content. This content, which is called the spectrum of the sequence, is obtained by hybridization to a universal DNA array. Standard universal arrays contain all kmers for some fixed k, typically 8 to 10. Currently, in spite of its promise and elegance, SBH is not competitive with standard gelbased sequencing methods. This is due to two main reasons: lack of tools to handle realistic levels of hybridization errors and an inherent limitation on the length of uniquely reconstructible sequence by standard universal arrays. In this paper, we deal with both problems. We introduce a simple polynomial reconstruction algorithm which can be applied to spectra from standard arrays and has provable performance in the presence of both false negative and false positive errors. We also propose a novel design of chips containing universal bases that differs from the one proposed by Preparata et al. (1999). We give a simple algorithm that uses spectra from such chips to reconstruct with high probability random sequences of length lower only by a squared log factor compared to the information theoretic bound. Our algorithm is very robust to errors and has a provable performance even if there are both false negative and false positive errors. Simulations indicate that its sensitivity to errors is also very small in practice. Journal of Computational Biology 02/2003; 10(34):48397. DOI:10.1089/10665270360688138 · 1.67 Impact Factor

[Show abstract] [Hide abstract]
ABSTRACT: Introduction The genome of a species can be thought of as a set of ordered sequences of genes { the ordering devices being the chromosomes {, each gene having an orientation given by its location on the DNA double strand. Comparing two sets of similar genes appearing along a chromosome in two species yields two (signed) permutations. It is widely accepted that the reversal distance between these two permutations, de ned as the minimal number of reversals that transform one permutation into the other, provides a good estimate of the evolutionary distance between the two species. Computing the reversal distance is now a well understood computational problem that has linear complexity [1]. However, reconstructing sequences of reversals that realize this distance raises some interesting issues. In recent months, the assessment of the diculty of the problem shifted from Finding an optimal sequence is non trivial."([3], [4], [2]) to There is a huge number of optimal sequences" (See, for

Source Available from: psu.edu
[Show abstract] [Hide abstract]
ABSTRACT: The sorting by reversals problem is classical in the field of whole genome comparison. In this paper, we provide experimental and theoretical evidence showing that, typically, there is a huge number of optimal sequences of reversals that sort a given signed permutation. We study these sets of optimal sequences using secondary sorting constraints, and using theoretical tools developed in the context of trace monoids. We show that most sorting strategies work well with random permutations, and identify combinatorial parameters, such as stack height, that can be used to classify sequences of reversals, and permutations.