A Unified Approach to Genotype Imputation and Haplotype-Phase Inference for Large Data Sets of Trios and Unrelated Individuals

Department of Statistics, University of Auckland, Auckland 1142, New Zealand.
The American Journal of Human Genetics (Impact Factor: 10.99). 03/2009; 84(2):210-23. DOI: 10.1016/j.ajhg.2009.01.005
Source: PubMed

ABSTRACT We present methods for imputing data for ungenotyped markers and for inferring haplotype phase in large data sets of unrelated individuals and parent-offspring trios. Our methods make use of known haplotype phase when it is available, and our methods are computationally efficient so that the full information in large reference panels with thousands of individuals is utilized. We demonstrate that substantial gains in imputation accuracy accrue with increasingly large reference panel sizes, particularly when imputing low-frequency variants, and that unphased reference panels can provide highly accurate genotype imputation. We place our methodology in a unified framework that enables the simultaneous use of unphased and phased data from trios and unrelated individuals in a single analysis. For unrelated individuals, our imputation methods produce well-calibrated posterior genotype probabilities and highly accurate allele-frequency estimates. For trios, our haplotype-inference method is four orders of magnitude faster than the gold-standard PHASE program and has excellent accuracy. Our methods enable genotype imputation to be performed with unphased trio or unrelated reference panels, thus accounting for haplotype-phase uncertainty in the reference panel. We present a useful measure of imputation accuracy, allelic R(2), and show that this measure can be estimated accurately from posterior genotype probabilities. Our methods are implemented in version 3.0 of the BEAGLE software package.

1 Follower
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 52,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R2 (estimated correlation between the imputed and true genotypes), and the relationship between allelic R2 and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2) were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.
    Frontiers in Genetics 10/2014; 5. DOI:10.3389/fgene.2014.00370
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Recently, the "Common Disease-Multiple Rare Variants" hypothesis has received much attention, especially with current availability of next-generation sequencing. Family-based designs are well suited for discovery of rare variants, with large and carefully selected pedigrees enriching for multiple copies of such variants. However, sequencing a large number of samples is still prohibitive. Here, we evaluate a cost-effective strategy (pseudosequencing) to detect association with rare variants in large pedigrees. This strategy consists of sequencing a small subset of subjects, genotyping the remaining sampled subjects on a set of sparse markers, and imputing the untyped markers in the remaining subjects conditional on the sequenced subjects and pedigree information. We used a recent pedigree imputation method (GIGI), which is able to efficiently handle large pedigrees and accurately impute rare variants. We used burden and kernel association tests, famWS and famSKAT, which both account for family relationships and heterogeneity of allelic effect for famSKAT only. We simulated pedigree sequence data and compared the power of association tests for pseudosequence data, a subset of sequence data used for imputation, and all subjects sequenced. We also compared, within the pseudosequence data, the power of association test using best-guess genotypes and allelic dosages. Our results show that the pseudosequencing strategy considerably improves the power to detect association with rare variants. They also show that the use of allelic dosages results in much higher power than use of best-guess genotypes in these family-based data. Moreover, famSKAT shows greater power than famWS in most of scenarios we considered.
    Genetic Epidemiology 01/2014; 38(1). DOI:10.1002/gepi.21776 · 2.95 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Single nucleotide polymorphism (SNP) data derived from array-based technology or massive parallel sequencing are often flawed with missing data. Missing SNPs can bias the results of association analyses. To maximize information usage, imputation is often adopted to compensate for the missing data by filling in the most probable values. To better understand the available tools for this purpose, we compare the imputation performances among BEAGLE, IMPUTE, BIMBAM, SNPMStat, MACH, and PLINK with data generated by randomly masking the genotype data from the International HapMap Phase III project. In addition, we propose a new algorithm called simple imputation (Simpute) that benefits from the high resolution of the SNPs in the array platform. Simpute does not require any reference data. The best feature of Simpute is its computational efficiency with complexity of order (mw + n), where n is the number of missing SNPs, w is the number of the positions of the missing SNPs, and m is the number of people considered. Simpute is suitable for regular screening of the large-scale SNP genotyping particularly when the sample size is large, and efficiency is a major concern in the analysis.
    02/2013; 2013:813912. DOI:10.1155/2013/813912


Available from