Using population mixtures to optimize the utility of genomic databases: linkage disequilibrium and association study design in India.

Institute for Genetic Medicine, University of Southern California, 2250 Alcazar St., Los Angeles, California 90033, USA.
Annals of Human Genetics (Impact Factor: 2.22). 08/2008; 72(Pt 4):535-46. DOI:10.1111/j.1469-1809.2008.00457.x
Source: PubMed

ABSTRACT When performing association studies in populations that have not been the focus of large-scale investigations of haplotype variation, it is often helpful to rely on genomic databases in other populations for study design and analysis - such as in the selection of tag SNPs and in the imputation of missing genotypes. One way of improving the use of these databases is to rely on a mixture of database samples that is similar to the population of interest, rather than using the single most similar database sample. We demonstrate the effectiveness of the mixture approach in the application of African, European, and East Asian HapMap samples for tag SNP selection in populations from India, a genetically intermediate region underrepresented in genomic studies of haplotype variation.

0 0
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Genotype imputation, used in genome-wide association studies to expand coverage of single nucleotide polymorphisms (SNPs), has performed poorly in African Americans compared to less admixed populations. Overall, imputation has typically relied on HapMap reference haplotype panels from Africans (YRI), European Americans (CEU), and Asians (CHB/JPT). The 1000 Genomes project offers a wider range of reference populations, such as African Americans (ASW), but their imputation performance has had limited evaluation. Using 595 African Americans genotyped on Illumina's HumanHap550v3 BeadChip, we compared imputation results from four software programs (IMPUTE2, BEAGLE, MaCH, and MaCH-Admix) and three reference panels consisting of different combinations of 1000 Genomes populations (February 2012 release): (1) 3 specifically selected populations (YRI, CEU, and ASW); (2) 8 populations of diverse African (AFR) or European (AFR) descent; and (3) all 14 available populations (ALL). Based on chromosome 22, we calculated three performance metrics: (1) concordance (percentage of masked genotyped SNPs with imputed and true genotype agreement); (2) imputation quality score (IQS; concordance adjusted for chance agreement, which is particularly informative for low minor allele frequency [MAF] SNPs); and (3) average r2hat (estimated correlation between the imputed and true genotypes, for all imputed SNPs). Across the reference panels, IMPUTE2 and MaCH had the highest concordance (91%-93%), but IMPUTE2 had the highest IQS (81%-83%) and average r2hat (0.68 using YRI+ASW+CEU, 0.62 using AFR+EUR, and 0.55 using ALL). Imputation quality for most programs was reduced by the addition of more distantly related reference populations, due entirely to the introduction of low frequency SNPs (MAF≤2%) that are monomorphic in the more closely related panels. While imputation was optimized by using IMPUTE2 with reference to the ALL panel (average r2hat = 0.86 for SNPs with MAF>2%), use of the ALL panel for African American studies requires careful interpretation of the population specificity and imputation quality of low frequency SNPs.
    PLoS ONE 01/2012; 7(11):e50610. · 3.73 Impact Factor
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: ALFRED ( is a free, web accessible, curated compilation of allele frequency data on DNA sequence polymorphisms in anthropologically defined human populations. Currently, ALFRED has allele frequency tables on over 663,400 polymorphic sites; 170 of them have frequency tables for more than 100 different population samples. In ALFRED, a population may have multiple samples with each 'sample' consisting of many individuals on which an allele frequency is based. There are 3566 population samples from 710 different populations with allele frequency tables on at least one polymorphism. Fifty of those population samples have allele frequency data for over 650,000 polymorphisms. Records also have active links to relevant resources (dbSNP, PharmGKB, OMIM, Ethnologue, etc.). The flexible search options and data display and download capabilities available through the web interface allow easy access to the large quantity of high-quality data in ALFRED.
    Nucleic Acids Research 01/2012; 40(Database issue):D1010-5. · 8.28 Impact Factor
  • [show abstract] [hide abstract]
    ABSTRACT: Imputation in admixed populations is an important problem but challenging due to the complex linkage disequilibrium (LD) pattern. The emergence of large reference panels such as that from the 1,000 Genomes Project enables more accurate imputation in general, and in particular for admixed populations and for uncommon variants. To efficiently benefit from these large reference panels, one key issue to consider in modern genotype imputation framework is the selection of effective reference panels. In this work, we consider a number of methods for effective reference panel construction inside a hidden Markov model and specific to each target individual. These methods fall into two categories: identity-by-state (IBS) based and ancestry-weighted approach. We evaluated the performance on individuals from recently admixed populations. Our target samples include 8,421 African Americans and 3,587 Hispanic Americans from the Women' Health Initiative, which allow assessment of imputation quality for uncommon variants. Our experiments include both large and small reference panels; large, medium, and small target samples; and in genome regions of varying levels of LD. We also include BEAGLE and IMPUTE2 for comparison. Experiment results with large reference panel suggest that our novel piecewise IBS method yields consistently higher imputation quality than other methods/software. The advantage is particularly noteworthy among uncommon variants where we observe up to 5.1% information gain with the difference being highly significant (Wilcoxon signed rank test P-value < 0.0001). Our work is the first that considers various sensible approaches for imputation in admixed populations and presents a comprehensive comparison.
    Genetic Epidemiology 10/2012; · 4.02 Impact Factor


Available from

Trevor J Pemberton