New Optimization Model and Algorithm for Sibling Reconstruction from Genetic Markers

INFORMS Journal on Computing (Impact Factor: 1.08). 05/2010; 22(2):180-194. DOI: 10.1287/ijoc.1090.0322
Source: DBLP


With improved tools for collecting genetic data from natural and experimental populations, new opportunities arise to study fundamental biological processes, including behavior, mating systems, adaptive trait evolution, and dispersal patterns. Full use of the newly available genetic data often depends upon reconstructing genealogical relationships of individual organisms, such as sibling reconstruction. This paper presents a new optimization framework for sibling reconstruction from single generation microsatellite genetic data. Our framework is based on assumptions of parsimony and combinatorial concepts of Mendel's inheritance rules. Here, we develop a novel optimization model for sibling reconstruction as a large-scale mixed-integer program (MIP), shown to be a generalization of the set covering problem. We propose a new heuristic approach to efficiently solve this large-scale optimization problem. We test our approach on real biological data as presented in other studies as well as simulated data, and compare our results with other state-of-the-art sibling reconstruction methods. The empirical results show that our approaches are very efficient and outperform other methods while providing the most accurate solutions for two benchmark data sets. The results suggest that our framework can be used as an analytical and computational tool for biologists to better study ecological and evolutionary processes involving knowledge of familial relationships in a wide variety of biological systems.

Download full-text


Available from: Bhaskar Dasgupta
  • Source
    • "The two-allele condition is tighter and more restricted than its four-allele counterpart, allowing a more accurate reconstruction. The mathematical constraints of the two-allele condition for sibling group were derived in Chaovalitwongse et al. (2010) and are used as the basis of mathematical models in this paper. From the example in Figure 2, shrimps a and b can be included in the same biologically consistent sibling group because they both satisfy constraints (i) and (ii). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Establishing family relationships, such as parentage and sibling relationships, is fundamental in biological research, especially in wild species, as they are often important to understanding evolutionary, ecological, and behavioral processes. Because it is commonly impossible to determine familial relationships from field observations alone, the reconstruction of sibling relationships often depends on informative genetic markers coupled with accurate sibling reconstruction algorithms. Most studies in the literature reconstruct sibling relationships using methods that are based on either statistical analyses (i.e., likelihood estimation) or combinatorial concepts (i.e., Mendelian inheritance laws) of genetic data. We present a novel computational framework that integrates both combinatorial concepts and statistical analyses into one sibling reconstruction optimization model. To solve this integrated model, we propose a column-generation approach with a branch-and-price method. Under the assumption of parsimonious reconstruction, the master problem is to find the minimum set of sibling groups to cover the tested population. Pricing subproblems, which include both statistical similarity and combinatorial concepts of genetic data, are iteratively solved to generate high-quality sibling group candidates. Tested on real biological data sets, our approach efficiently provides reconstruction results that are more accurate than those provided by other state-of-the-art reconstruction algorithms.
    Full-text · Article · Feb 2015 · Informs Journal on Computing
  • Source
    • "KINALYZER [11] seeks a minimum set cover, by using an integer programming (IP) formulation where each set is subject to the restrictions of Mendelian compatibility for full siblings. KINALYZER yields decent results [12]; however, like the COLONY programs, does not scale well with population size. The minimum set cover objective used by KINALYZER is NP-hard [12]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Kinship inference is the task of identifying genealogically related individuals. Kinship informationis important for determining mating structures, notably in endangered populations. Although manysolutions exist for reconstructing full sibling relationships, few exist for half-siblings. We consider the problem of determining whether a proposed half-sibling population reconstructionis valid under Mendelian inheritance assumptions. We show that this problem is NP-complete andprovide a 0/1 integer program that identifies the minimum number of individuals that must be removedfrom a population in order for the reconstruction to become valid. We also present SibJoin, a heuristic-based clustering approach based on Mendelian genetics, which is strikingly fast. The software isavailable at git:// Our SibJoin algorithm is reasonably accurate and thousands of times faster than existing algorithms.The heuristic is used to infer a half-sibling structure for a population which was, until recently, toolarge to evaluate.
    Full-text · Article · Jul 2013 · Algorithms for Molecular Biology
  • Source
    • "Our approaches enumerated all possible sibling groups by following the Mendel's laws and solved a set covering problem to find a minimum set of representative sibling groups, which is based on the parsimony assumption when the actually number of sibling groups is not known a priori. Most recently, Chaovalitwongse et al. (2010) proposed an iterative heuristic approach, IMCS, to solve a new optimization model (2AOM) with the combinatorial constraints to find a partition of maximal sibling groups. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The capacitated clustering problem (CCP) has been studied in a wide range of applications. In this study, we investigate a challenging CCP in computational biology, namely, sibling reconstruction problem (SRP). The goal of SRP is to establish the sibling relationship (i.e., groups of siblings) of a population from genetic data. The SRP has gained more and more interests from computational biologists over the past decade as it is an important and necessary keystone for studies in genetic and population biology. We propose a large-scale mixed-integer formulation of the CCP for SRP that is based on both combinatorial and statistical genetic concepts. The objective is not only to find the minimum number of sibling groups, but also to maximize the degree of similarity of individuals in the same sibling groups while each sibling group is subject to genetic constraints derived from Mendel's laws. We develop a new randomized greedy optimization algorithm to effectively and efficiently solve this SRP. The algorithm consists of two key phases: construction and enhancement. In the construction phase, a greedy approach with randomized perturbation is applied to construct multiple sibling groups iteratively. In the enhancement phase, a two-stage local search with a memory function is used to improve the solution quality with respect to the similarity measure. We demonstrate the effectiveness of the proposed algorithm using real biological data sets and compare it with state-of-the-art approaches in the literature. We also test it on larger simulated data sets. The experimental results show that the proposed algorithm provide the best reconstruction solutions.
    Full-text · Article · Mar 2012 · Computers & Operations Research
Show more