Content uploaded by Minzhu Xie

Author content

All content in this area was uploaded by Minzhu Xie on Mar 18, 2014

Content may be subject to copyright.

A preview of the PDF is not available

The individual haplotyping problem Minimum Letter Flip (MLF) is a computational problem that, given a set of aligned DNA sequence
fragment data of an individual, induces the corresponding haplotypes by flipping minimum SNPs. There has been no practical
exact algorithm to solve the problem. In DNA sequencing experiments, due to technical limits, the maximum length of a fragment
sequenced directly is about 1kb. In consequence, with a genome-average SNP density of 1.84 SNPs per 1 kb of DNA sequence,
the maximum number k
1 of SNP sites that a fragment covers is usually small. Moreover, in order to save time and money, the maximum number k
2 of fragments that cover a SNP site is usually no more than 19. Based on the properties of fragment data, the current paper
introduces a new parameterized algorithm of running time
O(nk22k2+mlogm+mk1)O(nk_22^{k_2}+mlogm+mk_1)
, where m is the number of fragments, n is the number of SNP sites. The algorithm solves the MLF problem efficiently even if m and n are large, and is more practical in real biological applications.

Content uploaded by Minzhu Xie

Author content

All content in this area was uploaded by Minzhu Xie on Mar 18, 2014

Content may be subject to copyright.

A preview of the PDF is not available

ResearchGate has not been able to resolve any citations for this publication.

We examine exact algorithms for the NP-complete Graph Bipartization problem that asks for a minimum set of vertices to delete from a graph to make it bipartite. Based on the “iterative compression”
method recently introduced by Reed, Smith, and Vetta, we present new algorithms and experimental results. The worst-case time
complexity is improved fromO(3
k
· kmn) toO(3
k
· mn), wheren is the number of vertices, m is the number of edges, andk is the number of vertices to delete. Our best algorithm can solve all problems from a testbed from computational biology
within minutes, whereas established methods are only able to solve about half of the problems within reasonable time.

Using current technology, large consecutive stretches of DNA (such as whole chromosomes) are usually assembled from short fragments obtained by shotgun sequencing, or from fragments and mate-pairs, if a "double-barreled" shotgun strategy is employed. The positioning of the fragments (and mate-pairs, if available) in an assembled sequence can be used to evaluate the quality of the assembly and also to compare two different assemblies of the same chro-mosome, even if they are obtained from two different sequencing projects. This paper describes some simple and fast methods of this type that were developed to evaluate and compare different assemblies of the human genome. Additional ap-plications are in "feature-tracking" from one version of an assembly to the next, comparisons of different chromosomes within the same genome and comparisons between similar chromosomes from different species.

This is a survey designed for mathematical programming people who do not know molecular biology and want to learn the kinds of combinatorial optimization problems that arise. After a brief introduction to the biology, we present optimization models pertaining to sequencing, evolutionary explanations, structure prediction, and recognition. Additional biology is given in the context of the problems, including some motivation for disease diagnosis and drug discovery. Open problems are cited with an extensive bibliography, and we offer a guide to getting started in this exciting frontier.

Inherited genetic variation has a critical but as yet largely uncharacterized role in human disease. Here we report a public database of common variation in the human genome: more than one million single nucleotide polymorphisms (SNPs) for which accurate and complete genotypes have been obtained in 269 DNA samples from four populations, including ten 500-kilobase regions in which essentially all information about common DNA variation has been extracted. These data document the generality of recombination hotspots, a block-like structure of linkage disequilibrium and low haplotype diversity, leading to substantial correlations of SNPs with many of their neighbours. We show how the HapMap resource can guide the design and analysis of genetic association studies, shed light on structural variation and recombination, and identify loci that may have been subject to natural selection during human evolution.

With the consensus human genome sequenced and many other sequencing projects at varying stages of completion, greater attention
is being paid to the genetic differences among individuals and abilities of those differences to predict phenotypes. A significant
obstacle to such work is the difficulty and expense of determining haplotypes - sets of various genetically linked because
of their proximity on the genome - for large numbers of individuals for use in association studies. This paper presents some
algorithmic considerations in a new approach for haplotype determination: inferring haplotypes from localised polymorphism
data gathered from short genome 'fragments'. Formalised models of the biological system under consideration are examined,
given a variety of assumptions about the goal of the problem and the charater of optimal solutions. Some theoretical results
and algorithms for handling haplotype assembly given models are then sketched. The primary conclusion is that some important
simplified variants of the problem yield tractable problems while more general variants tend to be intractable in the worst
case.

The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.

We study the single individual SNP haplotype reconstruction problem. We introduce a simple heuristic and prove experimentally
that is very fast and accurate. In particular, when compared with a dynamic programming of [8] it is much faster and also
more accurate. We expect Fast Hare to be very useful in practical applications. We also introduce a combinatorial problem
related to the SNP haplotype reconstruction problem that we call Min Element Removal. We prove its NP-hardness in the gapless
case and its O(log n)-approximability in the general case.

Single nucleotide polymorphisms (SNPs) are the most frequent form of human genetic variation. They are of fundamental importance for a variety of applications including medical diagnostic and drug design. They also provide the highest-resolution genomic fingerprint for tracking disease genes. This paper is devoted to algorithmic problems related to computational SNPs validation based on genome assemblies of diploid organisms. In diploid genomes, there are two copies of each chromosome. A description of the SNPs sequence information from one of the two chromosomes is called SNPs haplotype. The basic problem addressed here is the haplotyping, i.e., given a set of SNPs prospects inferred from the assembly alignment of a genomic region of a chromosome, find the maximally consistent pair of SNPs haplotypes by removing data “errors” related to DNA sequencing errors, repeats, and paralogous recruitment. We introduce several versions of the problem from a computational point of view. We show that the general SNPA Haplotyping Problem is NP-hard for mate-pairs assembly data, and design polynomial time algorithms for fragment assembly data. We give a network-flow based polynomial algorithm for the Minimum Fragment Removal Problem, and we show that the Minimum SNPs Removal problem amounts to finding the largest independent set in a weakly triangulated graph.