A Unified Approach to Genotype Imputation and Haplotype-Phase Inference for Large Data Sets of Trios and Unrelated Individuals

Department of Statistics, University of Auckland, Auckland 1142, New Zealand.
The American Journal of Human Genetics (Impact Factor: 10.93). 03/2009; 84(2):210-23. DOI: 10.1016/j.ajhg.2009.01.005
Source: PubMed


We present methods for imputing data for ungenotyped markers and for inferring haplotype phase in large data sets of unrelated individuals and parent-offspring trios. Our methods make use of known haplotype phase when it is available, and our methods are computationally efficient so that the full information in large reference panels with thousands of individuals is utilized. We demonstrate that substantial gains in imputation accuracy accrue with increasingly large reference panel sizes, particularly when imputing low-frequency variants, and that unphased reference panels can provide highly accurate genotype imputation. We place our methodology in a unified framework that enables the simultaneous use of unphased and phased data from trios and unrelated individuals in a single analysis. For unrelated individuals, our imputation methods produce well-calibrated posterior genotype probabilities and highly accurate allele-frequency estimates. For trios, our haplotype-inference method is four orders of magnitude faster than the gold-standard PHASE program and has excellent accuracy. Our methods enable genotype imputation to be performed with unphased trio or unrelated reference panels, thus accounting for haplotype-phase uncertainty in the reference panel. We present a useful measure of imputation accuracy, allelic R(2), and show that this measure can be estimated accurately from posterior genotype probabilities. Our methods are implemented in version 3.0 of the BEAGLE software package.

    • "First, imputation reliability per SNP was obtained from the allelic R 2 generated by Beagle, which is a prediction of the squared correlation between the allele dosage (number of B 2 alleles) of the most likely imputed genotype and the allele dosage of the true genotype. The estimated B 2 -allele dosage was obtained from the imputed posterior genotype probabilities as: 0 9 P(B 1 B 1 ) + 1 9 P(B 1 B 2 ) + 2 9 P(B 2 B 2 ) (Browning & Browning 2009). Second, we were interested in imputation reliability per animal (animal-specific imputation reliability). "
    [Show abstract] [Hide abstract]
    ABSTRACT: There is an increasing interest in using whole-genome sequence data in genomic selection breeding programmes. Prediction of breeding values is expected to be more accurate when whole-genome sequence is used, because the causal mutations are assumed to be in the data. We performed genomic prediction for the number of eggs in white layers using imputed whole-genome resequence data including ~4.6 million SNPs. The prediction accuracies based on sequence data were compared with the accuracies from the 60 K SNP panel. Predictions were based on genomic best linear unbiased prediction (GBLUP) as well as a Bayesian variable selection model (BayesC). Moreover, the prediction accuracy from using different types of variants (synonymous, non-synonymous and non-coding SNPs) was evaluated. Genomic prediction using the 60 K SNP panel resulted in a prediction accuracy of 0.74 when GBLUP was applied. With sequence data, there was a small increase (~1%) in prediction accuracy over the 60 K genotypes. With both 60 K SNP panel and sequence data, GBLUP slightly outperformed BayesC in predicting the breeding values. Selection of SNPs more likely to affect the phenotype (i.e. non-synonymous SNPs) did not improve the accuracy of genomic prediction. The fact that sequence data were based on imputation from a small number of sequenced animals may have limited the potential to improve the prediction accuracy. A small reference population (n = 1004) and possible exclusion of many causal SNPs during quality control can be other possible reasons for limited benefit of sequence data. We expect, however, that the limited improvement is because the 60 K SNP panel was already sufficiently dense to accurately determine the relationships between animals in our data.
    No preview · Article · Jan 2016 · Journal of Animal Breeding and Genetics
    • "Allelic-R2 was defined as the squared correlation between imputed and true (e.g. genotyped) alleles multiplied by 100, as in Browning & Browning (2009). The 878 oldest sires were used as the training population, and the 143 youngest bulls were used as the test population. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Genotype imputation is routinely applied in a large number of cattle breeds. Imputation has become a need due to the large number of SNP arrays with variable density (currently, from 2900 to 777 962 SNPs). Although many authors have studied the effect of different statistical methods on imputation accuracy, the impact of a (likely) change in the reference genome assembly on imputation from lower to higher density has not been determined so far. In this work, 1021 Italian Simmental SNP genotypes were remapped on the three most recent reference genome assemblies. Four imputation methods were used to assess the impact of an update in the reference genome. As expected, the four methods behaved differently, with large differences in terms of accuracy. Updating SNP coordinates on the three tested cattle reference genome assemblies determined only a slight variation on imputation results within method.
    No preview · Article · Dec 2014 · Animal Genetics
  • Source
    • "Imputation methods are widely used for inferring unobserved genotypes in a genotypic dataset using haplotypes from a more densely genotyped reference dataset (Browning, 2008; Howie et al., 2009, 2011, 2012; Li et al., 2009). This process is particularly important when combining or performing meta-analysis on data generated using multiple different genotyping platforms. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 52,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R2 (estimated correlation between the imputed and true genotypes), and the relationship between allelic R2 and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2) were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.
    Full-text · Article · Oct 2014 · Frontiers in Genetics
Show more