Utilizing Genotype Imputation for the Augmentation of
Brooke L. Fridley1*, Gregory Jenkins1, Matthew E. Deyo-Svendsen1, Scott Hebbring2, Robert Freimuth1
1Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America, 2Department of Molecular Pharmacology and Experimental
Therapeutics, Mayo Clinic, Rochester, Minnesota, United States of America
Background: In recent years, capabilities for genotyping large sets of single nucleotide polymorphisms (SNPs) has increased
considerably with the abilityto genotypeover 1 millionSNP markersacrossthe genome.This advancementintechnologyhas
led to an increase in the number of genome-wide association studies (GWAS) for various complex traits. These GWAS have
resulted in the implication of over 1500 SNPs associated with disease traits. However, the SNPs identified from these GWAS
are not necessarily the functional variants. Therefore, the next phase in GWAS will involve the refining of these putative loci.
Methodology: A next step for GWAS would be to catalog all variants, especially rarer variants, within the detected loci,
followed by the association analysis of the detected variants with the disease trait. However, sequencing a locus in a large
number of subjects is still relatively expensive. A more cost effective approach would be to sequence a portion of the
individuals, followed by the application of genotype imputation methods for imputing markers in the remaining individuals.
A potentially attractive alternative option would be to impute based on the 1000 Genomes Project; however, this has the
drawbacks of using a reference population that does not necessarily match the disease status and LD pattern of the study
population. We explored a variety of approaches for carrying out the imputation using a reference panel consisting of
sequence data for a fraction of the study participants using data from both a candidate gene sequencing study and the
1000 Genomes Project.
Conclusions: Imputation of genetic variation based on a proportion of sequenced samples is feasible. Our results indicate
the following sequencing study design guidelines which take advantage of the recent advances in genotype imputation
methodology: Select the largest and most diverse reference panel for sequencing and genotype as many ‘‘anchor’’ markers
Citation: Fridley BL, Jenkins G, Deyo-Svendsen ME, Hebbring S, Freimuth R (2010) Utilizing Genotype Imputation for the Augmentation of Sequence Data. PLoS
ONE 5(6): e11018. doi:10.1371/journal.pone.0011018
Editor: Manfred Kayser, Erasmus University Medical Center, Netherlands
Received November 17, 2009; Accepted May 18, 2010; Published June 8, 2010
Copyright: ? 2010 Fridley et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: The research was supported by the National Institutes of Health U01 GM61388, R01 GM28157 and Minnesota Partnership for Biotechnology and
Medical Genomics grant H9046000431. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: firstname.lastname@example.org
In the last five years, the capabilities and technology for
genotyping large sets of single nucleotide polymorphisms (SNPs)
has increased significantly. Current genome-wide SNP arrays have
the capability to genotype over one million SNP markers across
the genome. This advancement in technology has led to an
increased number of completed and on-going genome-wide
association studies (GWAS) for various complex disease and
drug-related phenotypes. These GWAS have resulted in more
than 350 publications and over 1500 SNPs implicated for
association with multiple (.80) disease phenotypes or traits .
However, the SNPs identified are not necessarily the functional
variant and many GWAS studies are moving into the next phase
of disease mapping involving the validation, augmentation and
refining of these putative regions or loci . The task of
determining the ‘‘causative’’ variant(s) is difficult since 43% of
associated SNPs are located in intergenic regions, and 45% are
located within intronic regions of known genes .
Indirect association, as a result of linkage disequilibrium (LD), is
a key factor in the success of genetic association studies. As a result
of LD, a disease-susceptibility SNP need not be genotyped, as long
as it is ‘‘tagged’’ by a SNP or set of SNPs that are genotyped (i.e.,
SNPs in LD with the disease-susceptibility SNP are genotyped).
Recently this concept has been further exploited by the
introduction of methods to impute genotypes at untyped markers,
based on genotypes at typed markers and information about LD
within the region [3,4,5,6,7,8,9,10,11,12]. These methods are
particularly useful in the context of failed genotyping and
combining data across multiple platforms and recently have been
extended to untyped markers using a reference data set [8,10,11].
One approach for following up replicated findings from a
GWAS would be to determine all genetic variation within the
locus, especially rarer variants not currently included on GWAS
SNP arrays, as they may play an important role in the etiology of
the disease . This could be accomplished using the 1000
Genomes Project. However, one limitation of the use of 1000
Genomes Project for imputation of markers in a locus of interest is
PLoS ONE | www.plosone.org1 June 2010 | Volume 5 | Issue 6 | e11018
that the possible ‘‘deleterious’’ or ‘‘protective’’ alleles may not be
represented in this relatively ‘‘healthy’’ cohort. An alternative
approach would be to catalog all variants by sequencing the locus
in the study subjects , followed by association analysis of each
variant in the locus. However, sequencing is still relatively
expensive and it may be cost prohibitive to sequence a region
on a large set of individuals. A more cost effective approach would
be to sequence a portion of the individuals, possibly selected based
on the distribution of the phenotype and/or haplotypes, and then
employ genotype imputation methods [15,16,17,18] for imputing
the sequenced markers in the remaining individuals. This
approach could also be augmented with the additional inclusion
of data from the 1000 Genomes Project.
In this manuscript we explore the use of the recently developed
genotype imputation method implemented in MACH  for
sequencing studies with the goal of localizing possible functional
variants through statistical analysis. In doing so, we explore a
variety of approaches for carrying out the imputation of untyped
markers using a reference panel consisting of sequencing data for a
fraction of the study participants. The various approaches are
implemented using data from a candidate gene sequencing study
conducted at the Mayo Clinic and data from the 1000 Genomes
Project (http://www.1000genomes.org) .
Materials and Methods
Mayo Sequencing Study: GENE1
To explore various approaches for imputation of untyped
markers using a reference panel determined from sequencing data,
we utilized a recently completed sequencing study for a gene
which we will denote as GENE1 (unpublished data). Little is known
in regard to common genetic variations within GENE1, and even
Table 1. Summary of sequence data for GENE1 for variants with MAF.1% or in HapMap.
African American White non-Hispanic AmericanHan Chinese American
25* 4467 0.0520.026000.1460.083
29* 503100 0.0520.02600
37 59740.0210.011 0.0420.02100
60 8230 0.0210.010000
*SNP Marker in HapMap; used as typed genotypes in all samples (i.e., markers on a GWAS SNP array).
MAF=minor allele frequency based on imputed ‘‘dosage’’ or expected genotype, position=physical base-pair location of the SNP based on build 36,
ObsHET=observed heterozygote rate.
Sequence Data and Imputation
PLoS ONE | www.plosone.org2June 2010 | Volume 5 | Issue 6 | e11018
phase haplotypes to account for the uncertainty in haplotype
(3) Genotype as many ‘‘anchor’’ markers as possible, in that, the
number of markers genotyped on all subjects impacts
accuracy. Therefore, additional genotyping of a few common
SNP markers not already genotyped on all subjects using a
cost effective platform, like Taqman, may be needed if the
GWAS SNP array does not provide adequate coverage in the
locus to be sequenced.
We would like to thank Linda Pelleymounter and Irene Moon for their
contribution to the sequencing study of GENE1 at the Mayo Clinic.
Conceived and designed the experiments: BLF GDJ MDS RF. Analyzed
the data: BLF GDJ MDS SH. Contributed reagents/materials/analysis
tools: BLF GDJ MDS. Wrote the paper: BLF SH.
1. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, et al. (2009)
Potential etiologic and functional implications of genome-wide association loci
for human diseases and traits. Proc Natl Acad Sci U S A 106: 9362–9367.
2. Ioannidis JP, Thomas G, Daly MJ (2009) Validating, augmenting and refining
genome-wide association signals. Nat Rev Genet 10: 318–329.
3. Sun YV, Kardia SL (2008) Imputing missing genotypic data of single-nucleotide
polymorphisms using neural networks. Eur J Hum Genet 16: 487–495.
4. Foulkes AS, Yucel R, Reilly MP (2007) Mixed modeling and multiple
imputation for unobservable genotype clusters. Stat Med.
5. Servin B, Stephens M (2007) Imputation-based analysis of association studies:
candidate regions and quantitative traits. PLoS Genet 3: e114.
6. Roberts A, McMillan L, Wang W, Parker J, Rusyn I, et al. (2007) Inferring
missing genotypes in large SNP panels using fast nearest-neighbor searches over
sliding windows. Bioinformatics 23: i401–407.
7. Dai JY, Ruczinski I, LeBlanc M, Kooperberg C (2006) Imputation methods to
improve inference in SNP association studies. Genet Epidemiol 30: 690–702.
8. Marchini J, Howie B, Myers S, McVean G, Donnelly P (2007) A new multipoint
method for genome-wide association studies by imputation of genotypes. Nat
Genet 39: 906–913.
9. Yu Z, Schaid DJ (2007) Methods to impute missing genotypes for population
data. Hum Genet 122: 495–504.
10. Nicolae DL (2006) Testing untyped alleles (TUNA)-applications to genome-wide
association studies. Genet Epidemiol 30: 718–727.
11. Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR (2008) Markov Model for Rapid
Haplotyping and Genotype Imputation in Genome Wide Studies. University of
12. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from
incomplete data via the EM algorithm. Journal of the Royal Statistical Society,
Series B (Methodological) 39: 1–38.
13. Gorlov IP, Gorlova OY, Sunyaev SR, Spitz MR, Amos CI (2008) Shifting
paradigm of association studies: value of rare single-nucleotide polymorphisms.
American Journal of Human Genetics 82: 100–112.
14. Shendure J, Ji H (2008) Next-generation DNA sequencing. Nat Biotechnol 26:
15. Marchini J, Howie B, Myers S, McVean G, Donnelly P (2007) A new multipoint
method for genome-wide association studies by imputation of genotypes.[see
comment]. Nature Genetics 39: 906–913.
16. Li Y, Willer CJ, Ding J, Scheet P, Abecasis G (2008) Markov Model for Rapid
Haplotyping and Genotype Imputation in Genome Wide Studies. Ann Arbor:
University of Michigan School of Public Health.
17. Browning BL, Browning SR (2009) A unified approach to genotype imputation
and haplotype-phase inference for large data sets of trios and unrelated
individuals. Am J Hum Genet 84: 210–223.
18. Howie BN, Donnelly P, Marchini J (2009) A flexible and accurate genotype
imputation method for the next generation of genome-wide association studies.
PLoS Genet 5: e1000529.
19. Kuehn BM (2008) 1000 Genomes Project promises closer look at variation in
human genome. Jama 300: 2715.
20. (2009) 1000 Genomes Project to Sequence Nearly 1,000 More Samples by Early
2010; New Samples Collected. GenomeWeb: In Sequence.
21. Biernacka JM, Tang R, Li J, McDonnell SK, Rabe KG, et al. (2008) Assessment
of Genotype Imputation Methods. Rochester, MN, USA: Department of Health
Sciences Research, Mayo Clinic.
22. Nothnagel M, Ellinghaus D, Schreiber S, Krawczak M, Franke A (2008) A
comprehensive evaluation of SNP genotype imputation. Hum Genet.
23. Pei YF, Li J, Zhang L, Papasian CJ, Deng HW (2008) Analyses and comparison
of accuracy of different genotype imputation methods. PLoS ONE 3: e3551.
24. Scheet P, Stephens M (2006) A Fast and Flexible Statistical Model for Large-
Scale Population Genotype Data: Applications to Inferring Missing Genotypes
and Haplotypic Phase. The American Journal of Human Genetics 78: 629–644.
25. Barrett JC, Fry B, Maller J, Daly MJ (2005) Haploview: analysis and
visualization of LD and haplotype maps. Bioinformatics 21: 263–265.
26. International HapMap C, Frazer KA, Ballinger DG, Cox DR, Hinds DA, et al.
(2007) A second generation human haplotype map of over 3.1 million SNPs.
Nature 449: 851–861.
27. Huang L, Li Y, Singleton AB, Hardy JA, Abecasis G, et al. (2009) Genotype-
imputation accuracy across worldwide human populations. Am J Hum Genet
Sequence Data and Imputation
PLoS ONE | www.plosone.org9 June 2010 | Volume 5 | Issue 6 | e11018