A Comparison of Phasing Algorithms for Trios and Unrelated Individuals

Harvard University, Cambridge, Massachusetts, United States
The American Journal of Human Genetics (Impact Factor: 10.93). 04/2006; 78(3):437-50. DOI: 10.1086/500808
Source: PubMed


Knowledge of haplotype phase is valuable for many analysis methods in the study of disease, population, and evolutionary genetics. Considerable research effort has been devoted to the development of statistical and computational methods that infer haplotype phase from genotype data. Although a substantial number of such methods have been developed, they have focused principally on inference from unrelated individuals, and comparisons between methods have been rather limited. Here, we describe the extension of five leading algorithms for phase inference for handling father-mother-child trios. We performed a comprehensive assessment of the methods applied to both trios and to unrelated individuals, with a focus on genomic-scale problems, using both simulated data and data from the HapMap project. The most accurate algorithm was PHASE (v2.1). For this method, the percentages of genotypes whose phase was incorrectly inferred were 0.12%, 0.05%, and 0.16% for trios from simulated data, HapMap Centre d'Etude du Polymorphisme Humain (CEPH) trios, and HapMap Yoruban trios, respectively, and 5.2% and 5.9% for unrelated individuals in simulated data and the HapMap CEPH data, respectively. The other methods considered in this work had comparable but slightly worse error rates. The error rates for trios are similar to the levels of genotyping error and missing data expected. We thus conclude that all the methods considered will provide highly accurate estimates of haplotypes when applied to trio data sets. Running times differ substantially between methods. Although it is one of the slowest methods, PHASE (v2.1) was used to infer haplotypes for the 1 million-SNP HapMap data set. Finally, we evaluated methods of estimating the value of r(2) between a pair of SNPs and concluded that all methods estimated r(2) well when the estimated value was >or=0.8.

Download full-text


Available from: Zhaohui Qin
    • "Third, we phased the data for the parents of each trio using the SHAPEIT2 with the pedigree information (Delaneau et al. 2012Delaneau et al. , 2013), thus providing a total of 64 unrelated individuals. This strategy increases phasing accuracy by combining both transmission and LD information, and it mirrors the approach used to generate the high-quality HapMap CEU haplotypes (Marchini et al. 2006). To compare sequence data for three regions including rs6822844, rs3184504, and rs12913832 between CG and CapSeq, we prepared the CG data that had the same coordinates as CapSeq. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Genetic variation harbors signatures of natural selection driven by selective pressures that are often unknown. Estimating the ages of selection signals may allow reconstructing the history of environmental changes that shaped human phenotypes and diseases. We have developed an approximate Bayesian computation (ABC) approach to estimate allele ages under a model of selection on new mutations and under demographic models appropriate for human populations. We have applied it to two resequencing data sets: An ultra-high depth data set from a relatively small sample of unrelated individuals and a lower depth data set in a larger sample with transmission information. In addition to evaluating the accuracy of our method based on simulations, for each SNP, we assessed the consistency between the posterior probabilities estimated by the ABC approach and the ancient DNA record, finding good agreement between the two types of data and methods. Applying this ABC approach to data for eight single nucleotide polymorphisms (SNPs), we were able to rule out an onset of selection prior to the dispersal out-of-Africa for three of them and more recent than the spread of agriculture for an additional three SNPs.
    No preview · Article · Nov 2015 · Molecular Biology and Evolution
  • Source
    • "The genotype data obtained from 25 trios families and 75 unrelated individuals were used to estimate the haplotype frequency by use of FBAT and PHASE programs, respectively (Rabinowitz, 2000; Marchini et al., 2006). We also used 2LD program for the estimation of linkage disquilibrium (LD) between D5S351 and D5S1414 markers in the studied population (Zhao, 2004). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Spinal muscular atrophy (SMA) is a degenerative neuromuscular disease associated with progressive symmetric weakness and atrophy of the limb muscles. In view of the involvement of numerous point mutations and deletions associated with the disease, the application of polymorphic markers flanking the SMA critical region could be valuable in molecular diagnosis of the disease. In the present study, D5S351 and D5S1414 polymorphic markers located at the SMA critical region in the Iranian populations were characterized. Genotyping of the markers indicated the presence of six and nine different alleles for D5S351 and D5S1414, respectively. Haplotype frequency estimation in 25 trios families and 75 unrelated individuals indicated the presence of six informative haplotypes with frequency higher than 0.05 in the studied population. Furthermore, the D′ coefficient and the χ2 value for D5S351 and D5S1414 markers revealed the presence of linkage disequilibrium between the two markers in the Iranians. These data suggested that D5S351 and D5S1414 could be suggested as informative markers for linkage analysis and molecular diagnosis of SMA in the Iranian population.
    Full-text · Article · Nov 2015 · Meta Gene
  • Source
    • "We report the public availability of high resolution HLA typing in the samples of the 1000 Genomes Project and describe the ancestry specific content of HLA allele and SNP variant haplotypes of the MHC. The data complements the resource made available by the 1000 genomes project and other collaborative effort on those samples [23], [40]. The MHC region can be described as a “genome within the genome,” able to identify the ancestral history of the individual. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation by sequencing at a level that should allow the genome-wide detection of most variants with frequencies as low as 1%. However, in the major histocompatibility complex (MHC), only the top 10 most frequent haplotypes are in the 1% frequency range whereas thousands of haplotypes are present at lower frequencies. Given the limitation of both the coverage and the read length of the sequences generated by the 1000 Genomes Project, the highly variable positions that define HLA alleles may be difficult to identify. We used classical Sanger sequencing techniques to type the HLA-A, HLA-B, HLA-C, HLA-DRB1 and HLA-DQB1 genes in the available 1000 Genomes samples and combined the results with the 103,310 variants in the MHC region genotyped by the 1000 Genomes Project. Using pairwise identity-by-descent distances between individuals and principal component analysis, we established the relationship between ancestry and genetic diversity in the MHC region. As expected, both the MHC variants and the HLA phenotype can identify the major ancestry lineage, informed mainly by the most frequent HLA haplotypes. To some extent, regions of the genome with similar genetic or similar recombination rate have similar properties. An MHC-centric analysis underlines departures between the ancestral background of the MHC and the genome-wide picture. Our analysis of linkage disequilibrium (LD) decay in these samples suggests that overestimation of pairwise LD occurs due to a limited sampling of the MHC diversity. This collection of HLA-specific MHC variants, available on the dbMHC portal, is a valuable resource for future analyses of the role of MHC in population and disease studies.
    Full-text · Article · Jul 2014 · PLoS ONE
Show more