Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The minimum error correction (MEC) model is one of the important computational models for single individual single nucleotide polymorphism (SNP) haplotyping. Due to the NP-hardness of the model, Qian et al. presented a particle swarm optimization (PSO) algorithm to solve it, and the particle code length is equal to the number of SNP fragments. However, there are hundreds and thousands of SNP fragments in practical applications. The PSO algorithm based on this kind of long particle code cannot obtain high reconstruction rate efficiently. In this paper, a practical heuristic algorithm PGA-MEC based on parthenogenetic algorithm (PGA) is presented to solve the model. A kind of short chromosome code and an effective recombination operator are designed for the algorithm. The reconstruction rate of PGA-MEC algorithm is higher than that of PSO algorithm and the running time of PGA-MEC algorithm is shorter than that of PSO algorithm, which are proved by a number of experiments.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The code length of algorithms W_GA and Q_PSO equals the amount of SNP fragments, which is very huge in realistic applications. The reconstruction rates obtained by these algorithms are not high due to their long codes [7]. With the increase of chromosome ploidy, the number of SNP fragments increases. ...
... With regard to fragment data, as far as we know, real DNA fragments data are not available in the public domain. Therefore, in the experiments, an extensively used sequence simulator CELSIM [7,9,12] was adopted to produce simulated fragments. m 1 single SNP fragments and m 2 mate-pair SNP fragments were produced. ...
... The execution time of algorithm GTIHR has lower sensitivity to parameter [f_min, f_max] variations compared to those of algorithms W_GA and Q_PSO. When [f_min, f_max] changes from [3,7] to [1,2], the execution time of algorithms GTIHR, W_GA and Q_PSO increase by about 0.25, 2.56 and 0.73 times respectively. Table 5 shows the comparison results with different hamming distances d, and c=10, p s =0.05, f_min=3, f_max=7, n=100. ...
Article
The minimum error correction model is an important combinatorial model for haplotyping a single individual. In this article, triploid individual haplotype reconstruction problem is studied by using the model. A genetic algorithm based method GTIHR is presented for reconstructing the triploid individual haplotype. A novel coding method and an effectual hill-climbing operator are introduced for the GTIHR algorithm. This relatively short chromosome code can lead to a smaller solution space, which plays a positive role in speeding up the convergence process. The hill-climbing operator ensures algorithm GTIHR converge at a good solution quickly, and prevents premature convergence simultaneously. The experimental results prove that algorithm GTIHR can be implemented efficiently, and can get higher reconstruction rate than previous algorithms.
... In the case of retroviruses such as influenza, the target nucleic acid will be viral RNA and any PCR assay will perforce be preceded by a reverse transcription (RT) step resulting in a linear DNA template. While the sensitivity of such an assay will be heavily dependent upon the efficiency of the RT step, we have shown that PrimerHunter primers are functional and specific under a wide range of template concentrations and thus are likely to be robust under a variety of experimental conditions including viral subtyping by RT-PCR in the clinic and in the field [14,88,101,103]. ...
... Given two strings f 1 , f 2 of length n, we say that analyzed by [78,70] and several algorithms have been proposed for MEC [6,33,97,101]. If weights are available for each allele call on each freagment, the model called (WMLF) described by [102] tries to minimize the sum of weights of corrected alleles. ...
Article
The availability of large databases of genomic information has enabled research efforts focused on refining methods for diagnosis and treatment of human diseases. However, proper use of genomic databases can not be achieved without the devel-opment of sophisticated data analysis methods, which is by itself a challenging task due to the size and heterogeneity of the data. The focus of the research proposed in this document is on developing computational methods and software tools for diagnosis and treatment of human diseases. We describe a primers design tool for rapid virus subtype identification, ap-plied to Avian Influenza called PrimerHunter, which takes as input sets of both target and non-target sequences and select primers that efficiently amplify any one of the targets, and none of the non-targets. PrimerHunter ensures the de-sired amplification properties by using accurate estimates of melting temperature with mismatches, computed based on the nearest-neighbor model via an efficient fractional programming algorithm We also present a bioinformatics pipeline for detection of immunogenic cancer mutations by high throughput mRNA sequencing. As part of this pipeline, we de-veloped and integrated novel algorithms and strategies for mRNA reads mapping, SNV detection, genotyping and haplotyping. We show through validations on real data that our methods improve accuracy to identify expressed mutations over ex-isting methods and that our haplotyping algorithm is more efficient than other solutions with comparable accuracy levels. Our pipeline predicted more than a thousand candidate epitopes for six different mouse cancer tumor cell lines, which are currently used to find stable protocols for immunotherapy.
... Since the MEC problem is NP-hard, metaheuristic algorithms such as GA and PSO have been applied to solve it. In this case, the objective function has been designed based on the MEC model, and the method attempts to enhance it iteratively [6,15,[29][30][31][32]. A major shortcoming of these approaches is their high time complexity in evaluating objective functions. ...
Article
Full-text available
Studying human genetic evolution has attracted considerable attention. Haplotypes determination provides key information about human genetics, and facilitates understanding probable causal relations between traits and diseases. In general, experimental methods of haplotypes reconstruction are exorbitant in terms of time and resources. The state-of-the-art high throughput sequencing, enables leveraging computational methods for this task. However, current sequencing algorithms suffer from truncated accuracy once the error rate of their input fragment increases. In this article, we put forward FCMHap, an efficient and accurate method, which involves two steps. In the first step, it constructs a weighted fuzzy conflict graph obtained based on the similarities of the input fragments and divides the input fragments in two clusters by partitioning the graph in an iterative manner. Since the input fragments consist of noise and gaps, in the next step, it adopts the cluster centers by utilizing the fuzzy c-means (FCM) algorithm. The proposed method has been evaluated on several real datasets and compared with a selected set of current approaches. The evaluation results substantiate that this method can be an accompaniment of those approaches.
... Since MEC problem is NP-hard, metaheuristic algorithms such as GA and PSO have been applied to solve this problem. In this case, the objective function has been designed based on MEC model and the method attempts to enhance it iteratively [6,15,[27][28][29][30]. Existing gaps and errors in the input data, encouraged some researchers to propose probabilistic models to solve this problem. ...
Preprint
Evolution of human genetics is one of the most interesting areas for researchers. Determination of Haplotypes not only makes valuable information for this purpose but also performs a major role in investigating the probable relation between diseases and genomes. Determining haplotypes by experimental methods is a time-consuming and expensive task. Recent progress in high throughput sequencing allows researchers to use computational methods for this purpose. Although, several algorithms have been proposed but they are less accurate when the error rate of input fragments increases. In this paper, first, a fuzzy conflict graph is constructed based on the similarities of all input fragments and next, the cluster centers are used as initial centers by fuzzy c-means (FCM) algorithm. The proposed method has been tested on several real datasets and compared with some current methods. The comparison with the existing approaches shows that our method can be a complementary role among the others.
... This standard process need to use two or more parents to produce children. The PGA is a variant of the GA424344 and it simulates the partheno-genetic process of primary species in the nature [21]. Compared with the SGA, the PGA acts on single chromosome [19]. ...
Article
Multidimensional Knapsack problem (MKP) is a well-known, NP-hard combinatorial optimization problem. Several metaheuristics or exact algorithms have been proposed to solve stationary MKP. This study aims to solve this difficult problem with dynamic conditions, testing a new evolutionary algorithm. In the present study, the Partheno-genetic algorithm (PGA) is tested by evolving parameters in time. Originality of the study is based on comparing the performances in static and dynamic conditions. First the effectiveness of the PGA is tested on both the stationary, and the dynamic MKP. Then, the improvements with different random restarting schemes are observed. The PGA achievements are shown in statistical and graphical analysis.
... Different strategies to remove conflicts lead to optimization objectives studied in previous works like finding the minimum number of fragments to remove (MFR), the minimum number of loci to remove (MSR) or the minimum number of allele calls to correct (MEC). Computational properties of these problems have been analyzed by [16, 15] and several algorithms have been proposed for MEC [1, 6, 23, 26]. If weights are available for each allele call on each freagment, the model called (WMLF) described by [27] tries to minimize the sum of weights of corrected alleles. ...
Conference Paper
Full-text available
Full human genomic sequences have been published in the latest two years for a growing number of individuals. Most of them are a mixed consensus of the two real haplotypes because it is still very expensive to separate information coming from the two copies of a chromosome. However, latest improvements and new experimental approaches promise to solve these issues and provide enough information to reconstruct the sequences for the two copies of each chromosome through bioinformatics methods such as single individual haplotyping. Full haploid sequences provide a complete understanding of the structure of the human genome, allowing accurate predictions of translation in protein coding regions and increasing power of association studies. In this paper we present a novel problem formulation for single individual haplotyping. We start by assigning a score to each pair of fragments based on their common allele calls and then we use these score to formulate the problem as the cut of fragments that maximize an objective function, similar to the well known max-cut problem. Our algorithm initially finds the best cut based on a heuristic algorithm for max-cut and then builds haplotypes consistent with that cut. We have compared both accuracy and running time of ReFHap with other heuristic methods on both simulated and real data and found that ReFHap performs significantly faster than previous methods without loss of accuracy.
Article
In this paper, a method for single individual haplotype (SIH) reconstruction using Asexual reproduction optimization (ARO) is proposed. Haplotypes, as a set of genetic variations in each chromosome, contain vital information such as the relationship between human genome and diseases. Finding haplotypes in diploid organisms is a challenging task. Experimental methods are expensive and require special equipment. In SIH problem, we encounter with several fragments and each fragment covers some parts of desired haplotype. The main goal is bi-partitioning of the fragments with minimum error correction (MEC). This problem is addressed as NP-hard and several attempts have been made in order to solve it using heuristic methods. The current method, AROHap, has two main phases. In the first phase, most of the fragments are clustered based on a practical metric distance. In the second phase, ARO algorithm as a fast convergence bio-inspired method is used to improve the initial bi-partitioning of the fragments in the previous step. AROHap is implemented with several benchmark datasets. The experimental results demonstrate that satisfactory results were obtained, proving that AROHap can be used for SIH reconstruction problem.
Article
The maximum fragment length (MFL) is an important computational model for solving the founder sequence reconstruction problem. Benedettini et al. presented a meta-heuristic algorithm BACKFORTH based on iterative greedy method. The BACKFORTH algorithm starts with a single initial solution, and iteratively alternates between a partial destruction and reconstruction in order to obtain a final solution. The kind of optimization mechanism, which is based on a single initial solution, may make the performance of the BACKFORTH algorithm sensitive to the quality of the initialization. In this paper, a practical parthenogenetic algorithm PGMFL, which is a populationbased meta-heuristic method, is proposed. The PGMFL algorithm can search multiple regions of a solution space simultaneously. A novel genetic operator is introduced based on the presented heuristic algorithm HF, which takes advantage of look-ahead mechanism and some potential information, i.e., the proportions of 0 and 1 entries in a column of recombinant matrix and those in the corresponding column of the founder matrix, and some other heuristic information, to compute the column values. The PGMFL algorithm can get fewer breakpoints and longer fragment average length than the BACKFORTH algorithm, which are proved by a number of experiments.
Article
In the context of energy saving and carbon emission reduction, the electric vehicle (EV) has been identified as a promising alternative to traditional fossil fuel-driven vehicles. Due to a different refueling manner and driving characteristic, the introduction of EVs to the current logistics system can make a significant impact on the vehicle routing and the associated operation costs. Based on the traveling salesman problem, this paper proposes a new optimal EV route model considering the fast-charging and regular-charging under the time-of-use price in the electricity market. The proposed model aims to minimize the total distribution costs of the EV route while satisfying the constraints of battery capacity, charging time and delivery/pickup demands, and the impact of vehicle loading on the unit electricity consumption per mile. To solve the proposed model, this paper then develops a learnable partheno-genetic algorithm with integration of expert knowledge about EV charging station and customer selection. A comprehensive numerical test is conducted on the 36-node and 112-node systems, and the results verify the feasibility and effectiveness of the proposed model and solution algorithm.
Article
A haplotype is a single nucleotide polymorphism (SNP) sequence and a representative genetic marker describing the diversity of biological organs. Haplotypes have a wide range of applications such as pharmacology and medical applications. In particular, as a highly social species, haplotypes of the Apis mellifera (honeybee) benefit human health and medicine in diverse areas, including venom toxicology, infectious disease, and allergic disease. For this reason, assembling a pair of haplotypes from individual SNP fragments drives research and generates various computational models for this problem. The minimum error correction (MEC) model is an important computational model for an individual haplotype assembly problem. However, the MEC model has been proved to be NP-hard; therefore, no efficient algorithm is available to address this problem. In this study, we propose an improved version of a branch and bound algorithm that can assemble a pair of haplotypes with an optimal solution from SNP fragments of a honeybee specimen in practical time bound. First, we designed a local search algorithm to calculate the good initial upper bound of feasible solutions for enhancing the efficiency of the branch and bound algorithm. Furthermore, to accelerate the speed of the algorithm, we made use of the recursive property of the bounding function together with a lookup table. After conducting extensive experiments over honeybee SNP data released by the Human Genome Sequencing Center, we showed that our method is highly accurate and efficient for assembling haplotypes.
Article
Full-text available
The goal of the International HapMap Project is to determine the common patterns of DNA sequence variation in the human genome and to make this information freely available in the public domain. An international consortium is developing a map of these patterns across the genome by determining the genotypes of one million or more sequence variants, their frequencies and the degree of association between them, in DNA samples from populations with ancestry from parts of Africa, Asia and Europe. The HapMap will allow the discovery of sequence variants that affect common disease, will facilitate development of diagnostic tools, and will enhance our ability to choose targets for therapeutic intervention.
Article
Full-text available
Variation within genes has important implications for all biological traits. We identified 3899 single nucleotide polymorphisms (SNPs) that were present within 313 genes from 82 unrelated individuals of diverse ancestry, and we organized the SNPs into 4304 different haplotypes. Each gene had several variable SNPs and haplotypes that were present in all populations, as well as a number that were population-specific. Pairs of SNPs exhibited variability in the degree of linkage disequilibrium that was a function of their location within a gene, distance from each other, population distribution, and population frequency. Haplotypes generally had more information content (heterozygosity) than did individual SNPs. Our analysis of the pattern of variation strongly supports the recent expansion of the human population.
Article
Full-text available
With the consensus human genome sequenced and many other sequencing projects at varying stages of completion, greater attention is being paid to the genetic differences among individuals and the abilities of those differences to predict phenotypes. A significant obstacle to such work is the difficulty and expense of determining haplotypes--sets of variants genetically linked because of their proximity on the genome--for large numbers of individuals for use in association studies. This paper presents some algorithmic considerations in a new approach for haplotype determination: inferring haplotypes from localised polymorphism data gathered from short genome 'fragments.' Formalised models of the biological system under consideration are examined, given a variety of assumptions about the goal of the problem and the character of optimal solutions. Some theoretical results and algorithms for handling haplotype assembly given the different models are then sketched. The primary conclusion is that some important simplified variants of the problem yield tractable problems while more general variants tend to be intractable in the worst case.
Article
Full-text available
The proliferation of large-scale DNA-sequencing projects in recent years has driven a search for alternative methods to reduce time and cost. Here we describe a scalable, highly parallel sequencing system with raw throughput significantly greater than that of state-of-the-art capillary electrophoresis instruments. The apparatus uses a novel fibre-optic slide of individual wells and is able to sequence 25 million bases, at 99% or better accuracy, in one four-hour run. To achieve an approximately 100-fold increase in throughput over current Sanger sequencing technology, we have developed an emulsion method for DNA amplification and an instrument for sequencing by synthesis using a pyrosequencing protocol optimized for solid support and picolitre-scale volumes. Here we show the utility, throughput, accuracy and robustness of this system by shotgun sequencing and de novo assembly of the Mycoplasma genitalium genome with 96% coverage at 99.96% accuracy in one run of the machine.
Article
Full-text available
In genetic studies of complex diseases, haplotypes provide more information than genotypes. However, haplotyping is much more difficult than genotyping using biological techniques. Therefore effective computational techniques have been in demand. The individual haplotyping problem is the computational problem of inducing a pair of haplotypes from an individual's aligned SNP fragments. Based on various optimal criteria and including different extra information, many models for the problem have been proposed. Higher accuracy of the models has been an important issue in the study of haplotype reconstruction. The current article proposes a highly accurate model for the single individual haplotyping problem based on weighted fragments and genotypes with errors. The model is proved to be NP-hard even with gapless fragments. Based on the characteristics of Single Nucleotide Polymorphism (SNP) fragments, a parameterized algorithm of time complexity O(nk(2)2(k(2)) + m log m + mk(1)) is developed, where m is the number of fragments, n is the number of SNP sites, k(1) is the maximum number of SNP sites that a fragment covers (no more than n and usually smaller than 10) and k(2) is the maximum number of the fragments covering a SNP site (usually no more than 19). Extensive experiments show that this model is more accurate in haplotype reconstruction than other models. The program of the parameterized algorithm can be obtained by sending an email to the corresponding author.
Article
Genetic algorithms (GA) using ordinal strings must use special crossover operators such as PMX, OX and CX, instead of general crossover operators. Considering the above deficiency of GA using ordinal strings, this paper proposes a partheno-genetic algorithm (PGA) that uses ordinal strings and repeals crossover operators while introduces some particular genetic operators such as gene exchange operator which have the same function as crossover operators. Therefore genetic operation of PGA is simple and its initial population need not be varied and there is no immature convergence in PGA. Calculating examples show the efficiency of PGA.
Article
SNP haplotyping problems have been the subject of extensive research in the last few years, and are one of the hottest areas of Computational Biology today. In this paper we report on our work of the last two years, whose preliminary results were presented at the European Symposium on Algorithms (Proceedings of the Annual European Symposium on Algorithms (ESA), Vol. 2161. Lecture Notes in Computer Science, Springer, 2001, pp. 182–193.) and Workshop on Algorithms in Bioinformatics (Proceedings of the Annual Workshop on Algorithms in Bioinformatics (WABI), Vol. 2452. Lecture Notes in Computer Science, Springer, 2002, pp. 29–43.). We address the problem of reconstructing two haplotypes for an individual from fragment assembly data. This problem will be called the Single Individual Haplotyping Problem. On the positive side, we prove that the problem can be solved effectively for gapless data, and give practical, dynamic programming algorithms for its solution. On the negative side, we show that it is unlikely that polynomial algorithms exist, even to approximate the solution arbitrarily well, when the data contain gaps. We remark that both the gapless and gapped data arise in different real-life applications.
Conference Paper
We study the single individual SNP haplotype reconstruction problem. We introduce a simple heuristic and prove experimentally that is very fast and accurate. In particular, when compared with a dynamic programming of [8] it is much faster and also more accurate. We expect Fast Hare to be very useful in practical applications. We also introduce a combinatorial problem related to the SNP haplotype reconstruction problem that we call Min Element Removal. We prove its NP-hardness in the gapless case and its O(log n)-approximability in the general case.
Conference Paper
Single nucleotide polymorphisms (SNPs) are the most frequent form of human genetic variation. They are of fundamental importance for a variety of applications including medical diagnostic and drug design. They also provide the highest-resolution genomic fingerprint for tracking disease genes. This paper is devoted to algorithmic problems related to computational SNPs validation based on genome assemblies of diploid organisms. In diploid genomes, there are two copies of each chromosome. A description of the SNPs sequence information from one of the two chromosomes is called SNPs haplotype. The basic problem addressed here is the haplotyping, i.e., given a set of SNPs prospects inferred from the assembly alignment of a genomic region of a chromosome, find the maximally consistent pair of SNPs haplotypes by removing data “errors” related to DNA sequencing errors, repeats, and paralogous recruitment. We introduce several versions of the problem from a computational point of view. We show that the general SNPA Haplotyping Problem is NP-hard for mate-pairs assembly data, and design polynomial time algorithms for fragment assembly data. We give a network-flow based polynomial algorithm for the Minimum Fragment Removal Problem, and we show that the Minimum SNPs Removal problem amounts to finding the largest independent set in a weakly triangulated graph.
Article
The haplotype reconstruction based on aligned single nucleotide polymorphism (SNP) fragments is to conclude a pair of haplotypes from located polymorphism data. Known computational model of this problem is minimum error correction (MEC) that has been proved to be NP-complete by Lippert et al., but there are few practical algorithms for it. In this paper, we design a heuristic algorithm based on particle swarm optimization (PSO) which was proposed by Kennedy and Eberhart to solve the problem. Extensive computational experiments indicate that the designed PSO algorithm achieves a higher accuracy than the genetic algorithm (GA) designed by Ruisheng Wang to the MEC model in most cases.
Article
Direct sequencing of genomic DNA from diploid individuals leads to ambiguities on sequencing gels whenever there is more than one mismatching site in the sequences of the two orthologous copies of a gene. While these ambiguities cannot be resolved from a single sample without resorting to other experimental methods (such as cloning in the traditional way), population samples may be useful for inferring haplotypes. For each individual in the sample that is homozygous for the amplified sequence, there are no ambiguities in the identification of the allele's sequence. The sequences of other alleles can be inferred by taking the remaining sequence after "subtracting off" the sequencing ladder of each known site. Details of the algorithm for extracting allelic sequences from such data are presented here, along with some population-genetic considerations that influence the likelihood for success of the method. The algorithm also applies to the problem of inferring haplotype frequencies of closely linked restriction-site polymorphisms.
Article
Simulated data sets have been found to be useful in developing software systems because (1) they allow one to study the effect of a particular phenomenon in isolation, and (2) one has complete information about the true solution against which to measure the results of the software. In developing a software suite for assembling a whole human genome shotgun data set, we have developed a simulator, celsim, that permits one to describe and stochastically generate a target DNA sequence with a variety of repeat structures, to further generate polymorphic variants if desired, and to generate a shotgun data set that might be sampled from the target sequence(s). We have found the tool invaluable and quite powerful, yet the design is extremely simple, employing a special type of stochastic grammar.
Article
The next phase of human genomics will involve large-scale screens of populations for significant DNA polymorphisms, notably single nucleotide polymorphisms (SNPs). Dense human SNP maps are currently under construction. However, the utility of those maps and screens will be limited by the fact that humans are diploid and it is presently difficult to get separate data on the two "copies." Hence, genotype (blended) SNP data will be collected, and the desired haplotype (partitioned) data must then be (partially) inferred. A particular nondeterministic inference algorithm was proposed and studied by Clark (1990) and extensively used by Clark et al. (1998). In this paper, we more closely examine that inference method and the question of whether we can obtain an efficient, deterministic variant to optimize the obtained inferences. We show that the problem is NP-hard and, in fact, Max-SNP complete; that the reduction creates problem instances conforming to a severe restriction believed to hold in real data (Clark, 1990); and that even if we first use a natural exponential-time operation, the remaining optimization problem is NP-hard. However, we also develop, implement, and test an approach based on that operation and (integer) linear programming. The approach works quickly and correctly on simulated data.
Article
Motivation: Haplotype reconstruction based on aligned single nucleotide polymorphism (SNP) fragments is to infer a pair of haplotypes from localized polymorphism data gathered through short genome fragment assembly. An important computational model of this problem is the minimum error correction (MEC) model, which has been mentioned in several literatures. The model retrieves a pair of haplotypes by correcting minimum number of SNPs in given genome fragments coming from an individual's DNA. Results: In the first part of this paper, an exact algorithm for the MEC model is presented. Owing to the NP-hardness of the MEC model, we also design a genetic algorithm (GA). The designed GA is intended to solve large size problems and has very good performance. The strength and weakness of the MEC model are shown using experimental results on real data and simulation data. In the second part of this paper, to improve the MEC model for haplotype reconstruction, a new computational model is proposed, which simultaneously employs genotype information of an individual in the process of SNP correction, and is called MEC with genotype information (shortly, MEC/GI). Computational results on extensive datasets show that the new model has much higher accuracy in haplotype reconstruction than the pure MEC model.
Algorithms for SNP haplotype assembly problem
  • Wang