## About

104

Publications

16,659

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

2,249

Citations

Introduction

Additional affiliations

January 2002 - March 2013

Education

September 1991 - July 1996

## Publications

Publications (104)

SNP haplotyping problems have been the subject of extensive research in the last few years, and are one of the hottest areas of Computational Biology today. In this paper we report on our work of the last two years, whose preliminary results were presented at the European Symposium on Algorithms (Proceedings of the Annual European Symposium on Algo...

Single nucleotide polymorphisms (SNPs) are the most frequent form of human genetic variation. They are of fundamental importance for a variety of applications including medical diagnostic and drug design. They also provide the highest-resolution genomic fingerprint for tracking disease genes. This paper is devoted to algorithmic problems related to...

We describe an exact algorithm for the problem of sorting a permutation by the minimum number of reversals, originating from evolutionary studies in molecular biology. Our approach is based on an integer linear programming formulation of a graph-theoretic relaxation of the problem, calling for a decomposition of the edge set of a bicolored graph in...

In this paper we study some open questions related to the smallest order $f({\cal C},\lnot {\cal H})$ of a 4-regular graph which has a connectivity property ${{\cal C}}$ but does not have a hamiltonian property ${\cal H}$. In particular, ${\cal C}$ is either connectivity, 2-connectivity or 1-toughness and ${\cal H}$ is hamiltonicity, homogeneously...

Logical Analysis of Data is a procedure aimed at identifying relevant features in data sets with both positive and negative samples. The goal is to build Boolean formulas, represented by strings over {0,1,-} called patterns, which can be used to classify new samples as positive or negative. Since a data set can be explained in alternative ways, man...

Given a Traveling Salesman Problem solution, the best 3-OPT move requires us to remove three edges and replace them with three new ones so as to shorten the tour as much as possible. No worst-case algorithm better than the Θ(n 3) enumeration of all triples is likely to exist for this problem, but algorithms with average case O(n 3−) are not ruled o...

Given a Traveling Salesman Problem solution, the best 3-OPT move requires us to remove three edges and replace them with three new ones so as to shorten the tour as much as possible. No worst-case algorithm better than the Θ(n3) enumeration of all triples is likely to exist for this problem, but algorithms with average case O(n3−ϵ) are not ruled ou...

It is known that there exist 4-regular, 1-tough graphs which are non-hamiltonian. The smallest such graph known has \(n=18\) nodes and was found by Bauer et al., who conjectured that all 4-regular, 1-tough graphs with \(n\le 17\) are hamiltonian. They in fact proved that this is true for \(n\le 15\), but left open the possibility of non-hamiltonian...

Finding the largest triangle in an $n$-nodes edge-weighted graph cannot be done in less than $\Theta(n^3)$ unless a widely believed conjecture is false. This negative result does not rule out the possibility of algorithms whose average, rather than worst-case, running time is subcubic. In this work we describe a procedure which we show finds the la...

Finding the largest triangle in an n-nodes edge-weighted graph belongs to a set of problems all equivalent under subcubic reductions. Namely, a truly subcubic algorithm for any one of them would imply that they are all subcubic. A recent strong conjecture states that none of them can be solved in less than Θ(n³) time, but this negative result does...

We describe a way to compute the edit distance of two strings without having to fill the whole dynamic programming (DP) matrix, through a sequence of increasing guesses on the edit distance. If the strings share a certain degree of similarity, the edit distance can be quite smaller than the value of non-optimal solutions, and a large fraction (up t...

We describe a simple data structure for storing subsets of { 0 , … , N - 1 } , with N a given integer, which has optimal time performance for all the main set operations, whereas previous data structures are non-optimal for at least one such operation. We report on the comparison of a Java implementation of our structure with other structures of th...

The 4-OPT neighborhood for the TSP contains Θ(n⁴) moves so that finding the best move effectively requires some ingenuity. Recently, de Berg et al. have given a Θ(n³) dynamic program, but the cubic complexity is still too large for using 4-OPT in practice. We describe a new procedure which behaves, on average, slightly worse than a quadratic algori...

We describe mathematical models and practical algorithms for a problem concerned with monitoring the air pollution in a large city. We have worked on this problem within a project for assessing the air quality in the city of Rome by placing a certain number of sensors on some of the city buses. We cast the problem as a facility location model. By r...

In this paper we consider the exact solution of the closest string problem (CSP). In general, exact algorithms for an NP-hard problem are either branch and bound procedures or dynamic programs. With respect to branch and bound, we give a new Integer Linear Programming formulation, improving over the standard one, and also suggest some combinatorial...

In this chapter we show a general compact extended formulation for the relaxation of the stable set polytope. For some graphs the stable set polytope can be given an exact representation, although with an exponential number of inequalities. When the graphs are perfect, however, compact extended formulations are possible for the stable set polytope....

This chapter focuses on the famous cutting stock/bin packing problem which was the first problem to be modeled by column generation. The compact equivalent counterpart of this problem has a very interesting structure of a particular flow problem. Two other packing problems for which we may show compact extended formulations are the robust knapsack...

This chapter is devoted to the Traveling Salesman Problem (TSP), one of the most famous problems of combinatorial optimization. Compact ILP models for this problem have been proposed since a long time, but most of them are not effective for computational purposes. We describe an effective compact model which expresses the subtour inequalities via t...

This chapter illustrates the general way to build a compact extended formulation. The chapter also describes in detail how to use LP techniques to build a compact extended formulation. Some examples are immediately brought to the attention of the reader so that the technique can be better understood. Also the role of the nonnegative factorization o...

This chapter introduces the basic definitions and properties of polyhedra. Polyhedra are given an external description, in terms of a set of linear inequalities, and an internal description, in terms of vertices and extreme rays. The projection operator is described in detail. Other topics described are the union of polyhedra, Fourier elimination s...

This chapter provides an introduction to Integer Linear Programming (ILP). After reviewing the effective modeling of a problem via ILP, the chapter describes the two main solving procedures for integer programs, i.e., branch-and-bound and cutting planes. The theory of totally unimodular matrices is introduced to account for problems whose models ha...

This chapter describes ILP models of exponential-size, either in the number of constraints, the number of variables, or both. These are the models for which compact extended formulations are intended. The separation and pricing problems are introduced as a general paradigm for the solution of such large models.

This chapter is devoted to compact extended formulations of tree problems. First, we give a compact extended formulation for the relaxation of the Steiner tree problem. We then describe the well-known minimum spanning tree problem, for which there exist polynomial algorithms and exponential-size models. We use both LP techniques and nonnegative ran...

This chapter is focused on the parity polytope, i.e., the convex hull of all 0-1 vectors with an even number of ones. A closely related polytope is the convex hull of all 0-1 vectors with an odd number of ones. We show two alternative compact extended formulations for both polytopes, one based on the union of polyhedra and the other one on LP. Wher...

This chapter provides an introduction to Linear Programming theory. It discusses classical concepts such as duality, complementarity slackness, complexity and algorithmic issues.

In this chapter we compare three popular models for the maximum cut problem and show the equivalence of their relaxations by using compact extended formulations. These problems are closely related to the subject of edge-induced and node-induced bipartite subgraphs, for which we give compact extended formulations as well.

This chapter deals with some combinatorial optimization problems arising in computational biology. We first survey the assessment of the evolutionary distance between two genomes. The problem is equivalent to packing the edges of a bi-colored graph into a maximum number of alternating cycles, which naturally leads to an exponential-size ILP model....

This chapter refers to a very interesting combinatorial object called permutahedron. We provide three different compact extended formulations. The first one is based on LP techniques and the second one on a simple projection of a polyhedron whose vertices are integral. Both these formulations require a quadratic number of variables and inequalities...

Scheduling problems are notoriously difficult and ILP models have not yet shown adequate strength for them to be competitive with other approaches. This chapter presents a time-indexed model for the Job-Shop problem that can be solved either by column generation or by a compact equivalent formulation. We present also an interesting approach for a o...

A move of the 3-OPT neighborhood for the Traveling Salesman Problem consists in removing any three edges of the tour and replacing them with three new ones. The standard algorithm to find the best possibble move is cubic, both in its worst and average time complexity. Since TSP instances of interest can have thousands of nodes, up to now it has bee...

This book provides a handy, unified introduction to the theory of compact extended formulations of exponential-size integer linear programming (ILP) models. Compact extended formulations are equally powerful, but polynomial-sized, models whose solutions do not require the implementation of separation and pricing procedures. The book is written in a...

We study the complexity of the problem of searching for a set of patterns that separate two given sets of strings. This problem has applications in a wide variety of areas, most notably in data mining, computational biology, and in understanding the complexity of genetic algorithms. We show that the basic problem of finding a small set of patterns...

The best formulations for some combinatorial optimization problems are integer linear programming models with an exponential number of rows and/or columns, which are solved incrementally by generating missing rows and columns only when needed. As an alternative to row generation, some exponential formulations can be rewritten in a compact extended...

We study the complexity of the problem of searching for a set of patterns that separate two given sets of strings. This problem has applications in a wide variety of areas, most notably in data mining, computational biology, and in understanding the complexity of genetic algorithms. We show that the basic problem of finding a small set of patterns...

Many biomedical experiments produce large data sets in the form of binary matrices, with features labeling the columns and individuals (samples) associated to the rows. An important case is when the rows are also labeled into two groups, namely the positive (or healthy) and the negative (or diseased) samples. The Logical Analysis of Data (LAD) is a...

This chapter addresses the haplotyping problem for both a single individual and a set of individuals (a population). In the former case the input is haplotype data inconsistent with the existence of exactly two parents for an individual. This inconsistency is due to experimental errors and/or missing data. In the latter case the input data specify...

Since its introduction in 2001, the Single Individual Haplotyping problem has received an
ever-increasing attention from the scientific community. In this paper we survey, in the
form of an annotated bibliography, the developments in the study of the problem from its
origin until our days.

We illustrate how integer linear programming techniques can be applied to the popular game of poker Texas Hold’em in order to evaluate the strength of a hand. In particular, we give models aimed at (1) minimizing the number of features that a player should look at when estimating his winning probability (called his equity), (2) giving weights to su...

We describe a general method for deriving new inequalities for integer programming formulations of combinatorial optimization problems. The inequalities, motivated by local search algorithms, are valid for all optimal solutions but not necessarily for all feasible solutions. These local search inequalities can help in either pruning the search tree...

We describe an integer programming (IP) model that can be applied to the solution of all genome-rearrangement problems in the literature. No direct IP model for such problems had ever been proposed prior to this work. Our model employs an exponential number of variables , but it can be solved by column generation techniques. I.e., we start with a s...

The past few years have seen the birth and the growth of a new reseach area in bioinformatics, called haplotyping. Haplotyping problems are combinatorial and optimization problems concerned with the analysis of human polymorphisms in populations, and with the study of common patterns for such polymorphisms. In this chapter we review the most import...

The best formulations for some combinatorial optimization problems are integer linear programming models with an exponential number of rows and/or columns, which are solved incrementally by generating missing rows and columns only when needed. As an alternative to row generation, some exponential formulations can be rewritten in a compact extended...

We discuss the effectiveness of integer programming for solving large instances of the independent set problem. Typical LP formulations, even strengthened by clique inequalities, yield poor bounds for this problem. We show that a strong bound can be obtained by the use of the so called “rank inequalities”, which generalize the clique inequalities....

This chapter presents and (comparatively) discuss the data structures that have been proposed in the context of new-generation sequencing (NGS) data processing. The authors classify algorithms and data structures specifically designed for the alignment of short nucleotide sequences produced by NGS instruments against a database. They propose a divi...

We illustrate how Integer Linear programming techniques can be applied to the popular game of poker Texas Hold'em in order to evaluate the strength of a hand. In particular, we give models aimed at (i) minimizing the number of features that a player should look at when estimating his winning probability (called his equity); (ii) giving weights to s...

We address the Max Cut problem by developing a compact formulation from the model expressing the condition that cuts and circuits have even intersection. This formulation turns out to be effective on sparse graphs especially with respect to the model based on triples of nodes.

In this paper we propose two time-indexed IP formulations for job-shop scheduling problems with a min-sum objective. The first model has variables associated to job scheduling patterns. The exponential number of variables calls for a column generation scheme which is carried out by a dynamic programming procedure. The second model is of network flo...

Haplotype data play a relevant role in several genetic studies, e.g., mapping of complex disease genes, drug design, and evolutionary studies on populations. However, the experimental determination of haplotypes is expensive and time-consuming. This motivates the increasing interest in techniques for inferring haplotype data from genotypes, which c...

A tiling of a matrix is an exact cover of its elements by a set of row fragments, called tiles. A particular variant of the tiling problem has arisen in the context of computational biology for studying genetic variations between individuals, in which one wishes to find the minimum-cardinality tiling of a matrix whose rows correspond to genomic seq...

We introduce an exact algorithm, based on Integer Linear Programming, for the parsimony haplotyping problem (PHP). The PHP uses molecular data and is aimed at the determination of a smallest set of haplotypes that explain a given set of genotypes. Our approach is based on a Set Covering formulation of the problem, solved by branch and bound with bo...

Modern technology, where sophisticated instruments are coupled with the massive use of computers, has made molecular biology a science where the size of the data to be gathered and analyzed poses serious computational problems. Very large data sets are ubiquitous in computational molecular biology: The European Molecular Biology Laboratory (EMBL) n...

The field of computational biology has experienced a tremendous growth in the past 15 years. In this bibliography, we survey some of the most significant contributions that were made to the field and which employ mathematical programming techniques, while giving a broad overview of application areas of modern computational molecular biology. The ar...

We consider a combinatorial problem derived from haplotyping a population with respect to a genetic disease, either recessive or dominant. Given a set of individuals, partitioned into healthy and diseased, and the corresponding sets of genotypes, we want to infer "bad'' and "good'' haplotypes to account for these genotypes and for the disease. Assu...

In this paper we investigate logic classification and related feature selection algorithms for large biomedical data sets. When the data is in binary/logic form, the feature selection problem can be formulated as a Set Covering problem of very large dimensions, whose solution is computationally challenging. We propose an alternative approximated fo...

Combinatorial haplotyping problems have received great attention in the past few years. We review their definitions and the main results that were obtained for their solution. Haplotyping problems require one to determine a set H of binary vectors (called haplotypes) that explain a set of G of ternary vectors (called genotypes). The number χ(G) of...

We consider a problem defined on strings and inspired by the way DNA encodes amino-acids as triplets of nucleotides. Given a string s on an alphabet Σ, a word-length k and a budget D, we want to determine the smallest number of distinct k-mers that can be left in s, if we are allowed to replace up to D letters of s. This problem has several paramet...

We describe SALSA (Sequence ALignment via Steiner An-cestors), a public{domain suite of programs for generating multiple align-ments of a set of genomic sequences. We allow the use of either of the two popular objectives, Tree Alignment or Sum-of-Pairs. The main distin-guishing feature of our method is that the alignment is obtained via a tree in w...

In this paper we propose a time-indexed IP formulation for job-shop scheduling prob- lems. We first introduce a model with variables associated to job scheduling patterns and constraints associated to machine capacities and to job assignments. The exponential number of variables calls for a column generation scheme which is carried out by a dynamic...

The parsimony haplotyping problem was shown to be NP-hard when each genotype had k⩽3ambiguous positions, while the case for k⩽2 was open. In this paper, we show that the case for k⩽2 is polynomial, and we give approximation and FPT algorithms for the general case of k⩾0 ambiguous positions.

The String Barcoding (SBC) problem, introduced by Rash and Gusfield (RECOMB, 2002), consists in finding a minimum set of substrings that can be used to distinguish between all members of a set of given strings. In a computational biology context, the given strings represent a set of known viruses, while the substrings can be used as probes for an h...

We consider a problem defined on strings and inspired by the way DNA encodes amino-acids as triplets of nucleotides. Given a string s on an alphabet Σ, a word-length k and a budget D, we want to determine the smallest number of distinct k-mers that can be left in s, if we are allowed to replace up to D letters of s. This problem has several paramet...

The String Barcoding (SBC) problem, introduced by Rash and Gusfield (RECOMB, 2002), consists in finding a minimum set of substrings
that can be used to distinguish between all members of a set of given strings. In a computational biology context, the given
strings represent a set of known viruses, while the substrings can be used as probes for an h...

In this paper we address the pure parsimony haplotyping problem: Find a minimum number of haplotypes that explains a given set of genotypes. We prove that the problem is APX-hard and present a 2k-1-approximation algorithm for the case in which each genotype has at most k ambiguous positions. We further give a new integer-programming formulation tha...

In Combinatorial Optimization, one is frequently faced with linear programming (LP) problems with exponentially many constraints, which can be solved either using separation or what we call compact optimization. The former technique relies on a separation algorithm, which, given a fractional solution, tries to produce a violated valid inequality. C...

The recent years have seen an impressive increase in the use of Integer Programming models for the solution of optimization problems originating in Molecular Biology. In this survey, some of the most successful Integer Programming approaches are described, while a broad overview of application areas being is given in modern Computational Molecular...

This is a survey designed for mathematical programming people who do not know molecular biology and want to learn the kinds of combinatorial optimization problems that arise. After a brief introduction to the biology, we present optimization models pertaining to sequencing, evolutionary explanations, structure prediction, and recognition. Additiona...

Protein structure comparison is a fundamental problem for structural genomics, with applications to drug design, fold prediction, protein clustering, and evolutionary studies. Despite its importance, there are very few rigorous methods and widely accepted similarity measures known for this problem. In this paper we describe the last few years of de...

We consider the following problem: Given n genotypes, does there exist a set H of haplotypes such that each genotype is generated by a pair from this set, and this det can be derived on a perfect phylogeny. Recently, Gusfield, 2002, presented a polynomial time algorithm to solve this problem that uses established results from matroid and graph theo...

The maximum contact map overlap (MAX-CMO) between a pair of protein structures can be used as a measure of protein similarity. It is a purely topological measure and does not depend on the sequence of the pairs involved in the comparison. More importantly, the MAX-CMO present a very favorable mathematical structure which allows the formulation of i...

A full haplotype map of the human genome will prove extremely valuable as it will be used in large-scale screens of populations to associate specific haplotypes with specific complex genetic-influenced diseases. A haplotype map project has been announced by NIH. The biological key to that project is the surprising fact that some human genomic DNA c...

A protein is a complex molecule for which a simple linear structure, given by the sequence of its aminoacids, determines a unique, often very beautiful, three dimensional shape. Such shape (3D structure) is perhaps the most important of all protein's features, since it determines completely how the protein func-tions and interacts with other molecu...

Single nucleotide polymorphisms (SNPs) are the most fre- quent form of human genetic variation, of foremost importance for a va- riety of applications including medical diagnostic, phylogenies and drug design. The complete SNPs sequence information from each of the two copies of a given chromosome in a diploid genome is called a haplotype. The Hapl...

Given a set of points and distances between them, a basic problem in network design calls for selecting a graph connecting them at a minimum total routing cost, that is, the sum over all pairs of points of the length of their shortest path in the graph. In this paper, we describe some branch-and-bound algorithms for the exact solution of a relevant...

With the consensus human genome sequenced and many other sequencing projects at varying stages of completion, greater attention is being paid to the genetic differences among individuals and the abilities of those differences to predict phenotypes. A significant obstacle to such work is the difficulty and expense of determining haplotypes--sets of...

In this paper, we illustrate by means of examples a technique for formulating compact (i.e. polynomial-size) linear programming relaxations in place of exponential-size models requiring separation algorithms. In the same vein as a celebrated theorem by Grötschel, Lovász and Schrijver, we state the equivalence of compact separation and compact optim...

In this paper we describe exact and heuristic algorithms for the comparison of 3D structures via their contact maps. Given two contact maps, we consider the problem of finding the optimal sequence-order dependent and sequenced order independent alignments. We describe an integer programming formulation of the problems along with a Lagrangian relaxa...

We illustrate a new approach to the Contact Map Overlap problem for the comparison of protein structures. The approach is based on formulating the problem as an integer linear program and then relaxing in a Lagrangian way a suitable set of constraints. This relaxation is solved by computing a sequence of simple alignment problems, each in quadratic...

Structure comparison is a fundamental problem for structural genomics. A variety of structure comparison methods were proposed and several protein structure classification servers e.g., SCOP, DALI, CATH, were designed based on them, and are extensively used in practice. This area of research continues to be very active, being energized bi-annually...

We consider the problem of sorting permutations by reversals (SBR), calling for the minimum number of reversals transforming a given permutation of {1,⋯,n} into the identity permutation. SBR was inspired by computational biology applications, in particular genome rearrangement. We propose an exact branch-and-bound algorithm for SBR. A lower bound i...

In this paper we deal with the problem of assigning a set of n jobs, with release dates and tails, to either one of two unrelated parallel machines and scheduling each machine so that the makespan is minimized. This problem will be denoted by R2|ri, qi|Cmax. The model generalizes the problem on one machine 1|ri, qi|Cmax, for which a very efficient...

Combinatorial Chemistry is a powerful new technology in drug design and molecular recognition. It is a wetlaboratory methodology aimed at "massively parallel" screening of chemical compounds for the discovery of compounds that have a certain biological activity. The power of the method comes from the interaction between experimental design and comp...

We describe GESTALT (GEnomic sequences STeiner ALignmenT), a public-domain suite of programs for generating multiple alignments of a set of biosequences. We allow the use of either of the two popular objectives, Tree Alignment or Sum-of-Pairs. The main distinguishing feature of our method is that the alignment is obtained via a tree in which the in...

Given an undirected graph with nonnegative costs on the edges, the routing cost of any of its spanning trees is the sum over all pairs of vertices of the cost of the path between the pair in the tree. Finding a spanning tree of minimum routing cost is NP-hard, even when the costs obey the triangle inequality. We show that the general case is in fac...