[Show abstract][Hide abstract] ABSTRACT: Copy number variations (CNVs) are abundant in the human genome. They have been associated with complex traits in genome-wide association studies (GWAS) and expected to continue playing an important role in identifying the etiology of disease phenotypes. As a result of current high throughput whole-genome single-nucleotide polymorphism (SNP) arrays, we currently have datasets that simultaneously have integer copy numbers in CNV regions as well as SNP genotypes. At the same time, haplotypes that have been shown to offer advantages over genotypes in identifying disease traits even though available for SNP genotypes are largely not available for CNV/SNP data due to insufficient computational tools. We introduce a new framework for inferring haplotypes in CNV/SNP data using a sequential Monte Carlo sampling scheme 'Tree-Based Deterministic Sampling CNV' (TDSCNV). We compare our method with polyHap(v2.0), the only currently available software able to perform inference in CNV/SNP genotypes, on datasets of varying number of markers. We have found that both algorithms show similar accuracy but TDSCNV is an order of magnitude faster while scaling linearly with the number of markers and number of individuals and thus could be the method of choice for haplotype inference in such datasets. Our method is implemented in the TDSCNV package which is available for download at http://www.ee.columbia.edu/~anastas/tdscnv.
EURASIP Journal on Bioinformatics and Systems Biology 01/2014; 2014(1):7.
[Show abstract][Hide abstract] ABSTRACT: DNA pooling constitutes a cost effective alternative in genome wide association studies. In DNA pooling, equimolar amounts of DNA from different individuals are mixed into one sample and the frequency of each allele in each position is observed in a single genotype experiment. The identification of haplotype frequencies from pooled data in addition to single locus analysis is of separate interest within these studies as haplotypes could increase statistical power and provide additional insight.
We developed a method for maximum-parsimony haplotype frequency estimation from pooled DNA data based on the sparse representation of the DNA pools in a dictionary of haplotypes. Extensions to scenarios where data is noisy or even missing are also presented. The resulting method is first applied to simulated data based on the haplotypes and their associated frequencies of the AGT gene. We further evaluate our methodology on datasets consisting of SNPs from the first 7Mb of the HapMap CEU population. Noise and missing data were further introduced in the datasets in order to test the extensions of the proposed method. Both HIPPO and HAPLOPOOL were also applied to these datasets to compare performances.
We evaluate our methodology on scenarios where pooling is more efficient relative to individual genotyping; that is, in datasets that contain pools with a small number of individuals. We show that in such scenarios our methodology outperforms state-of-the-art methods such as HIPPO and HAPLOPOOL.
[Show abstract][Hide abstract] ABSTRACT: BACKGROUND: Typically, the first phase of a genome wide association study (GWAS) includes genotyping across hundreds of individuals and validation of the most significant SNPs. Allelotyping of pooled genomic DNA is a common approach to reduce the overall cost of the study. Knowledge of haplotype structure can provide additional information to single locus analyses. Several methods have been proposed for estimating haplotype frequencies in a population from pooled DNA data. RESULTS: We introduce a technique for haplotype frequency estimation in a population from pooled DNA samples focusing on datasets containing a small number of individuals per pool (2 or 3 individuals) and a large number of markers. We compare our method with the publicly available state-of-the-art algorithms HIPPO and HAPLOPOOL on datasets of varying number of pools and marker sizes. We demonstrate that our algorithm provides improvements in terms of accuracy and computational time over competing methods for large number of markers while demonstrating comparable performance for smaller marker sizes. Our method is implemented in the "Tree-Based Deterministic Sampling Pool" (TDSPool) package which is available for download at www.ee.columbia.edu/~anastas/tdspool CONCLUSIONS: Using a tree-based determinstic sampling technique we present an algorithm for haplotype frequency estimation from pooled data. Our method demonstrates superior performance in datasets with large number of markers and could be the method of choice for haplotype frequency estimation in such datasets.
[Show abstract][Hide abstract] ABSTRACT: Many large genome-wide association studies include nuclear families with more than one child (trio families), allowing for analysis of differences between siblings (sib pair analysis). Statistical power can be increased when haplotypes are used instead of genotypes. Currently, haplotype inference in families with more than one child can be performed either using the familial information or statistical information derived from the population samples but not both. Building on our recently proposed tree-based deterministic framework (TDS) for trio families, we augment its applicability to general nuclear families. We impose a minimum recombinant approach locally and independently on each multiple children family, while resorting to the population-derived information to solve the remaining ambiguities. Thus our framework incorporates all available information (familial and population) in a given study. We demonstrate that using all the constraints in our approach we can have gains in the accuracy as opposed to breaking the multiple children families to separate trios and resorting to a trio inference algorithm or phasing each family in isolation. We believe that our proposed framework could be the method of choice for haplotype inference in studies that include nuclear families with multiple children. Our software (tds2.0) is downloadable from www.ee.columbia.edu/∼anastas/tds.
Annals of Human Genetics 05/2012; 76(4):312-25. · 2.22 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: In genome-wide association studies, thousands of individuals are genotyped in hundreds of thousands of single nucleotide polymorphisms (SNPs). Statistical power can be increased when haplotypes, rather than three-valued genotypes, are used in analysis, so the problem of haplotype phase inference (phasing) is particularly relevant. Several phasing algorithms have been developed for data from unrelated individuals, based on different models, some of which have been extended to father-mother-child "trio" data.
We introduce a technique for phasing trio datasets using a tree-based deterministic sampling scheme. We have compared our method with publicly available algorithms PHASE v2.1, BEAGLE v3.0.2 and 2SNP v1.7 on datasets of varying number of markers and trios. We have found that the computational complexity of PHASE makes it prohibitive for routine use; on the other hand 2SNP, though the fastest method for small datasets, was significantly inaccurate. We have shown that our method outperforms BEAGLE in terms of speed and accuracy for small to intermediate dataset sizes in terms of number of trios for all marker sizes examined. Our method is implemented in the "Tree-Based Deterministic Sampling" (TDS) package, available for download at http://www.ee.columbia.edu/~anastas/tds
Using a Tree-Based Deterministic sampling technique, we present an intuitive and conceptually simple phasing algorithm for trio data. The trade off between speed and accuracy achieved by our algorithm makes it a strong candidate for routine use on trio datasets.