A Comparison of Phasing Algorithms for Trios and Unrelated Individuals

Department of Statistics, University of Oxford, Oxford OX1 3TG, United Kingdom.
The American Journal of Human Genetics (Impact Factor: 10.93). 04/2006; 78(3):437-50. DOI: 10.1086/500808
Source: PubMed


Knowledge of haplotype phase is valuable for many analysis methods in the study of disease, population, and evolutionary genetics. Considerable research effort has been devoted to the development of statistical and computational methods that infer haplotype phase from genotype data. Although a substantial number of such methods have been developed, they have focused principally on inference from unrelated individuals, and comparisons between methods have been rather limited. Here, we describe the extension of five leading algorithms for phase inference for handling father-mother-child trios. We performed a comprehensive assessment of the methods applied to both trios and to unrelated individuals, with a focus on genomic-scale problems, using both simulated data and data from the HapMap project. The most accurate algorithm was PHASE (v2.1). For this method, the percentages of genotypes whose phase was incorrectly inferred were 0.12%, 0.05%, and 0.16% for trios from simulated data, HapMap Centre d'Etude du Polymorphisme Humain (CEPH) trios, and HapMap Yoruban trios, respectively, and 5.2% and 5.9% for unrelated individuals in simulated data and the HapMap CEPH data, respectively. The other methods considered in this work had comparable but slightly worse error rates. The error rates for trios are similar to the levels of genotyping error and missing data expected. We thus conclude that all the methods considered will provide highly accurate estimates of haplotypes when applied to trio data sets. Running times differ substantially between methods. Although it is one of the slowest methods, PHASE (v2.1) was used to infer haplotypes for the 1 million-SNP HapMap data set. Finally, we evaluated methods of estimating the value of r(2) between a pair of SNPs and concluded that all methods estimated r(2) well when the estimated value was >or=0.8.

Download full-text


Available from: Zhaohui Qin,
  • Source
    • "The genotype data obtained from 25 trios families and 75 unrelated individuals were used to estimate the haplotype frequency by use of FBAT and PHASE programs, respectively (Rabinowitz, 2000; Marchini et al., 2006). We also used 2LD program for the estimation of linkage disquilibrium (LD) between D5S351 and D5S1414 markers in the studied population (Zhao, 2004). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Spinal muscular atrophy (SMA) is a degenerative neuromuscular disease associated with progressive symmetric weakness and atrophy of the limb muscles. In view of the involvement of numerous point mutations and deletions associated with the disease, the application of polymorphic markers flanking the SMA critical region could be valuable in molecular diagnosis of the disease. In the present study, D5S351 and D5S1414 polymorphic markers located at the SMA critical region in the Iranian populations were characterized. Genotyping of the markers indicated the presence of six and nine different alleles for D5S351 and D5S1414, respectively. Haplotype frequency estimation in 25 trios families and 75 unrelated individuals indicated the presence of six informative haplotypes with frequency higher than 0.05 in the studied population. Furthermore, the D′ coefficient and the χ2 value for D5S351 and D5S1414 markers revealed the presence of linkage disequilibrium between the two markers in the Iranians. These data suggested that D5S351 and D5S1414 could be suggested as informative markers for linkage analysis and molecular diagnosis of SMA in the Iranian population.
    Meta Gene 11/2015; 7. DOI:10.1016/j.mgene.2015.10.006
  • Source
    • "We report the public availability of high resolution HLA typing in the samples of the 1000 Genomes Project and describe the ancestry specific content of HLA allele and SNP variant haplotypes of the MHC. The data complements the resource made available by the 1000 genomes project and other collaborative effort on those samples [23], [40]. The MHC region can be described as a “genome within the genome,” able to identify the ancestral history of the individual. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation by sequencing at a level that should allow the genome-wide detection of most variants with frequencies as low as 1%. However, in the major histocompatibility complex (MHC), only the top 10 most frequent haplotypes are in the 1% frequency range whereas thousands of haplotypes are present at lower frequencies. Given the limitation of both the coverage and the read length of the sequences generated by the 1000 Genomes Project, the highly variable positions that define HLA alleles may be difficult to identify. We used classical Sanger sequencing techniques to type the HLA-A, HLA-B, HLA-C, HLA-DRB1 and HLA-DQB1 genes in the available 1000 Genomes samples and combined the results with the 103,310 variants in the MHC region genotyped by the 1000 Genomes Project. Using pairwise identity-by-descent distances between individuals and principal component analysis, we established the relationship between ancestry and genetic diversity in the MHC region. As expected, both the MHC variants and the HLA phenotype can identify the major ancestry lineage, informed mainly by the most frequent HLA haplotypes. To some extent, regions of the genome with similar genetic or similar recombination rate have similar properties. An MHC-centric analysis underlines departures between the ancestral background of the MHC and the genome-wide picture. Our analysis of linkage disequilibrium (LD) decay in these samples suggests that overestimation of pairwise LD occurs due to a limited sampling of the MHC diversity. This collection of HLA-specific MHC variants, available on the dbMHC portal, is a valuable resource for future analyses of the role of MHC in population and disease studies.
    PLoS ONE 07/2014; 9(7):e97282. DOI:10.1371/journal.pone.0097282 · 3.23 Impact Factor
  • Source
    • "If a SNP marker in PASNP dataset was missing in HapMap2, the nearest neighboring position was chosen to represent the position. The resultant haplotypes across the 22 chromosomes for these samples are compared against those from the HapMap which we considered as the benchmark, as these have been phased with PHASE [37] and incorporated pedigree information in inferring the haplotypes for CEU and YRI trios [38]. The quality of the phasing was quantified by the switch error, obtained by the ratio of the number of switches in the SHAPEIT haplotypes that were needed to recover the HapMap-phased haplotypes to the total number of heterozygote markers minus one across the genome in each individual. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background The HUGO Pan-Asian SNP Consortium (PASNP) has generated a genetic resource of almost 55,000 autosomal single nucleotide polymorphisms (SNPs) across more than 1,800 individuals from 73 urban and indigenous populations in Asia. This has offered valuable insights into the correlation between the genetic ancestry of these populations with major linguistic systems and geography. Here, we attempt to understand whether adaptation to local climate, diet and environment partly explains the genetic variation present in these populations by investigating the genomic signatures of positive selection. Results To evaluate the impact to the selection analyses due to the considerably lower SNP density as compared to other population genetics resources such as the International HapMap Project (HapMap) or the Singapore Genome Variation Project, we evaluated the extent of haplotype phasing switch errors and the consistency of selection signals from three haplotype-based approaches (iHS, XP-EHH, haploPS) when the HapMap data is thinned to a similar density as PASNP. We subsequently applied haploPS to detect and characterize positive selection in the PASNP populations, identifying 59 genomics regions that were selected in at least one PASNP populations. A cluster analysis on the basis of these 59 signals showed that indigenous populations such as the Negrito from Malaysia and Philippines, the China Hmong, and the Taiwan Ami and Atayal shared more of these signals. We also reported evidence of a positive selection signal encompassing the beta globin gene in the Taiwan Ami and Atayal that was distinct from the signal in the HapMap Africans, suggesting the possibility of convergent evolution at this locus due to malarial selection. Conclusions We established that the lower SNP content of the PASNP data conferred weaker ability to detect signatures of positive selection, but the availability of the new approach haploPS retained modest power. Out of all the populations in PASNP, we identified only 59 signals, suggesting a strong need for high-density population-level genotyping data or sequencing data in order to achieve a comprehensive survey of positive selection in Asian populations. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-332) contains supplementary material, which is available to authorized users.
    BMC Genomics 05/2014; 15(1):332. DOI:10.1186/1471-2164-15-332 · 3.99 Impact Factor
Show more