Conference Paper

Complete Parsimony Haplotype Inference Problem and Algorithms

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Haplotype inference by pure parsimony (HIPP) is a well-known paradigm for haplotype inference. In order to assess the biological significance of this paradigm, we generalize the problem of HIPP to the problem of finding all optimal solutions, which we call complete HIPP. We study intrinsic haplotype features, such as backbone haplotypes and fat genotypes as well as equal columns and decomposability. We explicitly exploit these features in three computational approaches which are based on integer linear programming, depth-first branch-and-bound, and a hybrid algorithm that draws on the diverse strengths of the first two approaches. Our experimental analysis shows that our optimized algorithms are significantly superior to the baseline algorithms, often with orders of magnitude faster running time. Finally, our experiments provide some useful insights to the intrinsic features of this interesting problem.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... On the positive site Sharan, Halldórsson, and Istrail [22] devised a fixed-parameter algorithm for MH, where the parameter is the number of distinct haplotypes in the solution. Moreover, algorithms based on linear programming [4], branch-and-bound algorithms [25], and a recent combination of both methods [17] are known for this problem. To increase the accuracy of the predicted haplotypes, the perfect phylogeny and the maximum parsimony assumptions have been combined, leading to the problem MPPH. ...
Conference Paper
Full-text available
Haplotyping, also known as haplotype phase prediction, is the problem of predicting likely hap- lotypes based on genotype data. One fast computational haplotyping method is based on an evolu- tionary model where a perfect phylogenetic tree is sought that explains the observed data. In their CPM'09 paper, Fellows et al. studied an extension of this approach that incorporates prior knowl- edge in the form of a set of candidate haplotypes from which the right haplotypes must be chosen. While this approach is attractive to increase the accuracy of haplotyping methods, it was conjectured that the resulting formal problem constrained perfect phylogeny haplotyping might be NP-complete. In the paper at hand we present a polynomial-time algorithm for it. Our algorithmic ideas also yield new fixed-parameter algorithms for related haplotyping problems based on the maximum parsimony assumption.
Article
Haplotype inference by pure parsimony (Hipp) is a well-known paradigm for haplotype inference. In order to assess the biological significance of this paradigm, we generalize the problem of Hipp to the problem of finding all optimal solutions, which we call Chipp. We study intrinsic haplotype features, such as backbone haplotypes and fat genotypes as well as equal columns and decomposability. We explicitly exploit these features in three computational approaches that are based on integer linear programming, depth-first branch-and-bound, and Boolean satisfiability. Further we introduce two hybrid algorithms that draw upon the diverse strengths of the approaches. Our experimental analysis shows that our optimized algorithms are significantly superior to the baseline algorithms, often with orders of magnitude faster running time. Finally, our experiments provide some useful insights to the intrinsic features of this important problem.
Conference Paper
Full-text available
Parsimony haplotyping is the problem of finding a smallest size set of haplotypes that can explain a given set of genotypes. The problem is NP-hard, and many heuristic and approximation algorithms as well as polynomial-time solvable special cases have been discovered. We propose improved fixed-parameter tractability results with respect to the parameter “size of the target haplotype set” k by presenting an O *(k 4k )-time algorithm. This also applies to the practically important constrained case, where we can only use haplotypes from a given set. Furthermore, we show that the problem becomes polynomial-time solvable if the given set of genotypes is complete, i.e., contains all possible genotypes that can be explained by the set of haplotypes.
Conference Paper
Full-text available
It is widely anticipated that the study of variation in the human genome will provide a means of predicting risk of a variety of complex diseases. Single nucleotide polymorphisms (SNPs) are the most common form of genomic variation. Haplotypes have been suggested as one means for reducing the complexity of studying SNPs. In this paper we review some of the computational approaches that have been taking for determining haplotypes and suggest new approaches.
Article
Full-text available
In this paper we address the pure parsimony haplotyping problem: Find a minimum number of haplotypes that explains a given set of genotypes. We prove that the problem is APX-hard and present a 2k-1-approximation algorithm for the case in which each genotype has at most k ambiguous positions. We further give a new integer-programming formulation that has (for the first time) a polynomial number variables and constraints. Finally, we give approximation algorithms, not based on linear programming, whose running times are almost linear in the input size.
Article
Full-text available
In recent years, there has been much interest in phase transitions of combinatorial problems. Phase transitions have been successfully used to analyze combinatorial optimization problems, characterize their typical-case features and locate the hardest problem instances. In this paper, we study phase transitions of the asymmetric Traveling Salesman Problem (ATSP), an NP-hard combinatorial optimization problem that has many real-world applications. Using random instances of up to 1,500 cities in which intercity distances are uniformly distributed, we empirically show that many properties of the problem, including the optimal tour cost and backbone size, experience sharp transitions as the precision of intercity distances increases across a critical value. Our experimental results on the costs of the ATSP tours and assignment problem agree with the theoretical result that the asymptotic cost of assignment problem is pi ^2 /6 the number of cities goes to infinity. In addition, we show that the average computational cost of the well-known branch-and-bound subtour elimination algorithm for the problem also exhibits a thrashing behavior, transitioning from easy to difficult as the distance precision increases. These results answer positively an open question regarding the existence of phase transitions in the ATSP, and provide guidance on how difficult ATSP problem instances should be generated.
Article
Full-text available
A very challenging problem in the genetics domain is to in- fer haplotypes from genotypes. This process is expected to identify genes aecting health, disease and response to drugs. One of the approaches to haplotype inference aims to minimise the number of dieren t hap- lotypes used, and is known as haplotype inference by pure parsimony (HIPP). The HIPP problem is computationally dicult, being NP-hard. Recently, a SAT-based method (SHIPs) has been proposed to solve the HIPP problem. This method iteratively considers an increasing number of haplotypes, starting from an initial lower bound. Hence, one impor- tant aspect of SHIPs is the lower bounding procedure, which reduces the number of iterations of the basic algorithm, and also indirectly simplies the resulting SAT model. This paper describes the use of local search to improve existing lower bounding procedures. The new lower bound- ing procedure is guaranteed to be as tight as the existing procedures. In practice the new procedure is in most cases considerably tighter, al- lowing signican t improvement of performance on challenging problem instances.
Article
Full-text available
Motivation: Inference of haplotypes from genotype data is crucial and challenging for many vitally important studies. The first, and most critical step, is the ascertainment of a biologically sound model to be optimized. Many models that have been proposed rely partially or entirely on reducing the number of unique haplotypes in the solution. Results: This article examines the parsimony of haplotypes using known haplotypes as well as genotypes from the HapMap project. Our study reveals that there are relatively few unique haplotypes, but not always the least possible, for the datasets with known solutions. Furthermore, we show that there are frequently very large numbers of parsimonious solutions, and the number increases exponentially with increasing cardinality. Moreover, these solutions are quite varied, most of which are not consistent with the true solutions. These results quantify the limitations of the Pure Parsimony model and demonstrate the imperative need to consider additional properties for haplotype inference models. At a higher level, and with broad applicability, this article illustrates the power of combinatorial methods to tease out imperfections in a given biological model.
Article
Full-text available
The difficulty of experimental determination of haplotypes from phase-unknown genotypes has stimulated the development of nonexperimental inferral methods. One well-known approach for a group of unrelated individuals involves using the trivially deducible haplotypes (those found in individuals with zero or one heterozygous sites) and a set of rules to infer the haplotypes underlying ambiguous genotypes (those with two or more heterozygous sites). Neither the manner in which this "rule-based" approach should be implemented nor the accuracy of this approach has been adequately assessed. We implemented eight variations of this approach that differed in how a reference list of haplotypes was derived and in the rules for the analysis of ambiguous genotypes. We assessed the accuracy of these variations by comparing predicted and experimentally determined haplotypes involving nine polymorphic sites in the human apolipoprotein E (APOE) locus. The eight variations resulted in substantial differences in the average number of correctly inferred haplotype pairs. More than one set of inferred haplotype pairs was found for each of the variations we analyzed, implying that the rule-based approach is not sufficient by itself for haplotype inferral, despite its appealing simplicity. Accordingly, we explored consensus methods in which multiple inferrals for a given ambiguous genotype are combined to generate a single inferral; we show that the set of these "consensus" inferrals for all ambiguous genotypes is more accurate than the typical single set of inferrals chosen at random. We also use a consensus prediction to divide ambiguous genotypes into those whose algorithmic inferral is certain or almost certain and those whose less certain inferral makes molecular inferral preferable.
Article
Full-text available
We study the impact of backbones in optimization and approximation problems. We show that some optimization problems like graph coloring resemble decision problems, with problem hardness positively correlated with backbone size. For other optimization problems like blocks world planning and traveling salesperson problems, problem hardness is weakly and negatively correlated with backbone size, while the cost of finding optimal and approximate solutions is positively correlated with backbone size. A third class of optimization problems like number partitioning have regions of both types of behavior. We find that to observe the impact of backbone size on problem hardness, it is necessary to eliminate some symmetries, perform trivial reductions and factor out the effective problem size.
Article
Full-text available
Backbone variables are the elements that are common to all optimal solutions of a problem instance. We call variables that are absent from every optimal solution fat variables. Identification of backbone and fat variables is a valuable asset when attempting to solve complex problems. In this paper, we demonstrate a method for identifying backbones and fat. Our method is based on an intuitive concept, which we refer to as limit crossing. Limit crossing occurs when we force the lower bound of a graph problem to exceed the upper bound by applying the lower-bound function to a constrained version of the graph. A desirable feature of this procedure is that it uses approximation functions to derive exact information about optimal solutions. In this paper, we prove the validity of the limit-crossing concept as well as other related properties. Then we exploit limit crossing and devise a preprocessing tool for discovering backbone and fat arcs for various instances of the Asymmetric Traveling Salesman Problem (ATSP). Our experimental results demonstrate the power of the limit-crossing method. We compare our pre-processor with the Carpaneto, Dell'Amico, and Toth pre-processor for several different classes of ATSP instances and reveal dramatic performance improvements.
Book
Contents 1. Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Keep the Parameter Fixed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Preliminaries and Agreements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Parameterized Complexity---a Brief Overview . . . . . . . . . . . . . . 6 1.3.1 Basic Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.2 Interpreting Fixed-Parameter Tractability . . . . . . . . . . . 9 1.4 Vertex Cover -- an Illustrative Example . . . . . . . . . . . . . . . . . 11 1.4.1 Parameterize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4.2 Specialize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4.3 Generalize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4.4 Count or Enumerate . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Article
Abstract One of the main topics of research in genomics,is determining the relevance ofmutations, described in haplotype data, as causes of some genetic diseases. However, due to technological limitations, genotype data rather than haplotype data is usually obtained. The haplotype inference by pure parsimony (HIPP) problem,consists in inferring haplotypes from genotypes s.t. the number,of required haplotypes is minimum.,Previous approaches to the HIPP problem have focused on integer programming,models,and branch-and-bound algorithms. In contrast, this paper proposes the utilization of Boolean Satisfiability (SAT). The proposed solution entails a SAT model, a number of key pruning techniques, and an iterative algorithm that enumerates the possible solution values for the target optimization problem. Experimental results, obtained on a wide range of instances, demonstrate that the SAT-based approach,can be several orders of magnitude,faster than existing solutions. Besides being more efficient, the SAT-based approach,is also the only capable of computing,the solution for a large number,of instances.
Article
Boolean satisfiability (SAT) and maximum satisfiability (Max-SAT) are difficult combinatorial problems that have many important real-world applications. In this paper, we first investigate the configuration landscapes of local minima reached by the WalkSAT local search algorithm, one of the most effective algorithms for SAT. A configuration landscape of a set of local minima is their distribution in terms of quality and structural differences relative to an optimal or a reference solution. Our experimental results show that local minima from WalkSAT form large clusters, and their configuration landscapes constitute big valleys, in that high quality local minima typically share large partial structures with optimal solutions. Inspired by this insight into WalkSAT and the previous research on phase transitions and backbones of combinatorial problems, we propose and develop a novel method that exploits the configuration landscapes of such local minima. The new method, termed as backbone-guided search, can be embedded in a local search algorithm, such as WalkSAT, to improve its performance. Our experimental results show that backbone-guided local search is effective on overconstrained random Max-SAT instances. Moreover, on large problem instances from a SAT library (SATLIB), the backbone guided WalkSAT algorithm finds satisfiable solutions more often than WalkSAT on SAT problem instances, and obtains better solutions than WalkSAT on Max-SAT problem instances, improving solution quality by 20% on average.
Article
Similarity and diversity among individuals of the same species are expressed in small DNA variations called Single Nucleotide Polymorphism. The knowledge of SNP phase gives rise to the haplotyping problem that in the parsimonious version states to infer the minimum number of haplotypes from a given set of genotype data. ILP technique represents a good resolution strategy for this interesting combinatorial problem whose main limit lies in its NP-hardness. In this paper we present a new polynomial model for the haplotyping inference by parsimony problem characterized by the original use of a maximum formulation jointly with a good heuristic solution. This approach showed to be a robust basic model that can be used as starting point for more sophisticated ILP techniques like branch and cut and polyhedral studies.
Article
We give machine characterizations of most parameterized complexity classes, in particular, of W[P], of the classes of the W-hierarchy, and of the A-hierarchy. For example, we characterize W[P] as the class of all parameterized problems decidable by a nondeterministic fixed-parameter tractable algorithm whose number of nondeterministic steps is bounded in terms of the parameter. The machine characterizations suggest the introduction of a hierarchy Wfunc between the W- and the A-hierarchy. We study the basic properties of this hierarchy.
Conference Paper
The next high-priority phase of human genomics will involve the development and use of a full Haplotype Map of the human genome [7]. A critical, perhaps dominating, problem in all such efforts is the inference of large-scale SNP-haplotypes from raw genotype SNP data. This is called the Haplotype Inference (HI) problem. Abstractly, input to the HI problem is a set of n strings over a ternary alphabet. A solution is a set of at most 2n strings over the binary alphabet, so that each input string can be “generated” by some pair of the binary strings in the solution. For greatest biological fidelity, a solution should be consistent with, or evaluated by, properties derived from an appropriate genetic model. A natural model, that has been suggested repeatedly is called here the Pure Parsimony model, where the goal is to find a smallest set of binary strings that can generate the n input strings. The problem of finding such a smallest set is called the Pure Parsimony Problem. Unfortunately, the Pure Parsimony problem is NP-hard, and no paper has previously shown how an optimal Pure-parsimony solution can be computed efficiently for problem instances of the size of current biological interest. In this paper, we show how to formulate the Pure-parsimony problem as an integer linear program; we explain how to improve the practicality of the integer programming formulation; and we present the results of extensive experimentation we have done to show the time and memory practicality of the method, and to compare its accuracy against solutions found by the widely used general haplotyping program PHASE. We also formulate and experiment with variations of the Pure-Parsimony criteria, that allow greater practicality. The results are that the Pure Parsimony problem can be solved efficiently in practice for a wide range of problem instances of current interest in biology. Both the time needed for a solution, and the accuracy of the solution, depend on the level of recombination in the input strings. The speed of the solution improves with increasing recombination, but the accuracy of the solution decreases with increasing recombination.
Conference Paper
We present and investigate a new method for the Traveling Salesman Problem (TSP) that incorpo- rates backbone information into the well known and widely applied Lin-Kernighan (LK) local search family of algorithms for the problem. We consider how heuristic backbone information can be obtained and develop methods to make biased local pertur- bations in the LK algorithm and its variants by ex- ploiting heuristic backbone information to improve their efficacy. We present extensive experimental re- sults, using large instances from the TSP Challenge suite and real-world instances in TSPLIB, showing the significant improvement that the new method can provide over the original algorithms.
Article
To solve NP-hard problems, polynomial-time preprocessing is a natural and promising approach. Preprocessing is based on data reduction techniques that take a problem's input instance and try to perform a reduction to a smaller, equivalent problem kernel. Problem kernelization is a methodology that is rooted in parameterized computational complexity. In this brief survey, we present data reduction and problem kernelization as a promising research field for algorithm and complexity theory.
Conference Paper
Mutation in DNA is the principal cause for differences among human beings, and Single Nucleotide Polymorphisms (SNPs) are the most common mutations. Hence, a fundamental task is to complete a map of haplotypes (which identify SNPs) in the human population. Associated with this effort, a key computational problem is the inference of haplotype data from genotype data, since in practice genotype data rather than haplotype data is usually obtained. Recent work has shown that a SAT-based approach is by far the most efficient solution to the problem of haplotype inference by pure parsimony (HIPP), being several orders of magnitude faster than existing integer linear programming and branch and bound solutions. This paper proposes a number of key optimizations to the the original SAT-based model. The new version of the model can be orders of magnitude faster than the original SAT-based HIPP model, particularly on biological test data.
Article
Direct sequencing of genomic DNA from diploid individuals leads to ambiguities on sequencing gels whenever there is more than one mismatching site in the sequences of the two orthologous copies of a gene. While these ambiguities cannot be resolved from a single sample without resorting to other experimental methods (such as cloning in the traditional way), population samples may be useful for inferring haplotypes. For each individual in the sample that is homozygous for the amplified sequence, there are no ambiguities in the identification of the allele's sequence. The sequences of other alleles can be inferred by taking the remaining sequence after "subtracting off" the sequencing ladder of each known site. Details of the algorithm for extracting allelic sequences from such data are presented here, along with some population-genetic considerations that influence the likelihood for success of the method. The algorithm also applies to the problem of inferring haplotype frequencies of closely linked restriction-site polymorphisms.
Article
The next phase of human genomics will involve large-scale screens of populations for significant DNA polymorphisms, notably single nucleotide polymorphisms (SNPs). Dense human SNP maps are currently under construction. However, the utility of those maps and screens will be limited by the fact that humans are diploid and it is presently difficult to get separate data on the two "copies." Hence, genotype (blended) SNP data will be collected, and the desired haplotype (partitioned) data must then be (partially) inferred. A particular nondeterministic inference algorithm was proposed and studied by Clark (1990) and extensively used by Clark et al. (1998). In this paper, we more closely examine that inference method and the question of whether we can obtain an efficient, deterministic variant to optimize the obtained inferences. We show that the problem is NP-hard and, in fact, Max-SNP complete; that the reduction creates problem instances conforming to a severe restriction believed to hold in real data (Clark, 1990); and that even if we first use a natural exponential-time operation, the remaining optimization problem is NP-hard. However, we also develop, implement, and test an approach based on that operation and (integer) linear programming. The approach works quickly and correctly on simulated data.
Article
Haplotypes have been attracting increasing attention because of their importance in analysis of many fine-scale molecular-genetics data. Since direct sequencing of haplotype via experimental methods is both time-consuming and expensive, haplotype inference methods that infer haplotypes based on genotype samples become attractive alternatives. (1) We design and implement an algorithm for an important computational model of haplotype inference that has been suggested before in several places. The model finds a set of minimum number of haplotypes that explains the genotype samples. (2) Strong supports of this computational model are given based on the computational results on both real data and simulation data. (3) We also did some comparative study to show the strength and weakness of this computational model using our program. The software HAPAR is free for non-commercial uses. Available upon request (lwang@cs.cityu.edu.hk).
Article
In 2003, Gusfield introduced the Haplotype Inference by Pure Parsimony (HIPP) problem and presented an integer program (IP) that quickly solved many simulated instances of the problem. Although it solved well on small instances, Gusfield's IP can be of exponential size in the worst case. Several authors have presented polynomial-sized IPs for the problem. In this paper, we further the work on IP approaches to HIPP. We extend the existing polynomial-sized IPs by introducing several classes of valid cuts for the IP. We also present a new polynomial-sized IP formulation that is a hybrid between two existing IP formulations and inherits many of the strengths of both. Many problems that are too complex for the exponential-sized formulations can still be solved in our new formulation in a reasonable amount of time. We provide a detailed empirical comparison of these IP formulations on both simulated and real genotype sequences. Our formulation can also be extended in a variety of ways to allow errors in the input or model the structure of the population under consideration.
Article
Statistical methods for haplotype inference from multi-site genotypes of unrelated individuals have important application in association studies and population genetics. Understanding the factors that affect the accuracy of this inference is important, but their assessment has been restricted by the limited availability of biological data with known phase. We created hybrid cell lines monosomic for human chromosome 19 and produced single-chromosome complete sequences of a 48 kb genomic region in 39 individuals of African American (AA) and European American (EA) origin. We employ these phase-known genotypes and coalescent simulations to assess the accuracy of statistical haplotype reconstruction by several algorithms. Accuracy of phase inference was considerably low in our biological data even for regions as short as 25-50 kb, suggesting that caution is needed when analyzing reconstructed haplotypes. Moreover, the reliability of estimated confidence in phase inference is not high enough to allow for a reliable incorporation of site-specific uncertainty information in subsequent analyses. We show that, in samples of certain mixed ancestry (AA and EA populations), the most accurate haplotypes are probably obtained when increasing sample size by considering the largest, pooled sample, despite the hypothetical problems associated with pooling across those heterogeneous samples. Strategies to improve confidence in reconstructed haplotypes, and realistic alternatives to the analysis of inferred haplotypes, are discussed.
Conference Paper
Many real-world problems involve constraints that cannot be all satisfied. Solving an overconstrained problem then means to find solutions minimizing the number of constraints violated, which is an optimization problem. In this research, we study the behavior of the phase transitions and backbones of constraint optimization problems. We rst investigate the relationship between the phase transitions of Boolean satisfiability, or precisely 3-SAT (a well-studied NP-complete decision problem), and the phase transitions of MAX 3-SAT (an NP-hard optimization problem). To bridge the gap between the easy-hard-easy phase transitions of 3-SAT and the easy-hard transitions of MAX 3-SAT, we analyze bounded 3-SAT, in which solutions of bounded quality, e.g., solutions with at most a constant number of constraints violated, are sufficient.
The International HapMap Consortium: A Haplotype Map of the Human Genome