A Unified Approach to Genotype Imputation and Haplotype-Phase Inference for Large Data Sets of Trios and Unrelated Individuals

Department of Statistics, University of Auckland, Auckland 1142, New Zealand.
The American Journal of Human Genetics (Impact Factor: 10.93). 03/2009; 84(2):210-23. DOI: 10.1016/j.ajhg.2009.01.005
Source: PubMed


We present methods for imputing data for ungenotyped markers and for inferring haplotype phase in large data sets of unrelated individuals and parent-offspring trios. Our methods make use of known haplotype phase when it is available, and our methods are computationally efficient so that the full information in large reference panels with thousands of individuals is utilized. We demonstrate that substantial gains in imputation accuracy accrue with increasingly large reference panel sizes, particularly when imputing low-frequency variants, and that unphased reference panels can provide highly accurate genotype imputation. We place our methodology in a unified framework that enables the simultaneous use of unphased and phased data from trios and unrelated individuals in a single analysis. For unrelated individuals, our imputation methods produce well-calibrated posterior genotype probabilities and highly accurate allele-frequency estimates. For trios, our haplotype-inference method is four orders of magnitude faster than the gold-standard PHASE program and has excellent accuracy. Our methods enable genotype imputation to be performed with unphased trio or unrelated reference panels, thus accounting for haplotype-phase uncertainty in the reference panel. We present a useful measure of imputation accuracy, allelic R(2), and show that this measure can be estimated accurately from posterior genotype probabilities. Our methods are implemented in version 3.0 of the BEAGLE software package.

1 Follower
44 Reads
    • "Allelic-R2 was defined as the squared correlation between imputed and true (e.g. genotyped) alleles multiplied by 100, as in Browning & Browning (2009). The 878 oldest sires were used as the training population, and the 143 youngest bulls were used as the test population. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Genotype imputation is routinely applied in a large number of cattle breeds. Imputation has become a need due to the large number of SNP arrays with variable density (currently, from 2900 to 777 962 SNPs). Although many authors have studied the effect of different statistical methods on imputation accuracy, the impact of a (likely) change in the reference genome assembly on imputation from lower to higher density has not been determined so far. In this work, 1021 Italian Simmental SNP genotypes were remapped on the three most recent reference genome assemblies. Four imputation methods were used to assess the impact of an update in the reference genome. As expected, the four methods behaved differently, with large differences in terms of accuracy. Updating SNP coordinates on the three tested cattle reference genome assemblies determined only a slight variation on imputation results within method.
    Animal Genetics 12/2014; DOI:10.1111/age.12251 · 2.21 Impact Factor
  • Source
    • "Imputation methods are widely used for inferring unobserved genotypes in a genotypic dataset using haplotypes from a more densely genotyped reference dataset (Browning, 2008; Howie et al., 2009, 2011, 2012; Li et al., 2009). This process is particularly important when combining or performing meta-analysis on data generated using multiple different genotyping platforms. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 52,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R2 (estimated correlation between the imputed and true genotypes), and the relationship between allelic R2 and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2) were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.
    Frontiers in Genetics 10/2014; 5. DOI:10.3389/fgene.2014.00370
  • Source
    • "We compared three commonly used imputation programs, including two population-based methods, Beagle [29] and Impute2 [30], and one method that combines LD and pedigree information, FImpute [31]. Methods were evaluated for two equine chromosomes (ECA), ECA16 and ECA31. "
    [Show abstract] [Hide abstract]
    ABSTRACT: A cost-effective strategy to increase the density of available markers within a population is to sequence a small proportion of the population and impute whole-genome sequence data for the remaining population. Increased densities of typed markers are advantageous for genome-wide association studies (GWAS) and genomic predictions. We obtained genotypes for 54 602 SNPs (single nucleotide polymorphisms) in 1077 Franches-Montagnes (FM) horses and Illumina paired-end whole-genome sequencing data for 30 FM horses and 14 Warmblood horses. After variant calling, the sequence-derived SNP genotypes (~13 million SNPs) were used for genotype imputation with the software programs Beagle, Impute2 and FImpute. The mean imputation accuracy of FM horses using Impute2 was 92.0%. Imputation accuracy using Beagle and FImpute was 74.3% and 77.2%, respectively. In addition, for Impute2 we determined the imputation accuracy of all individual horses in the validation population, which ranged from 85.7% to 99.8%. The subsequent inclusion of Warmblood sequence data further increased the correlation between true and imputed genotypes for most horses, especially for horses with a high level of admixture. The final imputation accuracy of the horses ranged from 91.2% to 99.5%. Using Impute2, the imputation accuracy was higher than 91% for all horses in the validation population, which indicates that direct imputation of 50k SNP-chip data to sequence level genotypes is feasible in the FM population. The individual imputation accuracy depended mainly on the applied software and the level of admixture.
    Genetics Selection Evolution 10/2014; 46(63):1-8. DOI:10.1186/s12711-014-0063-7 · 3.82 Impact Factor
Show more


44 Reads
Available from