Pemberton TJ, Jakobsson M, Conrad DF, Coop G, Wall JD, Pritchard JK et al.. Using population mixtures to optimize the utility of genomic databases: linkage disequilibrium and association study design in India. Ann Hum Genet 72: 535-546

Institute for Genetic Medicine, University of Southern California, 2250 Alcazar St., Los Angeles, California 90033, USA.
Annals of Human Genetics (Impact Factor: 2.21). 08/2008; 72(Pt 4):535-46. DOI: 10.1111/j.1469-1809.2008.00457.x
Source: PubMed


When performing association studies in populations that have not been the focus of large-scale investigations of haplotype variation, it is often helpful to rely on genomic databases in other populations for study design and analysis - such as in the selection of tag SNPs and in the imputation of missing genotypes. One way of improving the use of these databases is to rely on a mixture of database samples that is similar to the population of interest, rather than using the single most similar database sample. We demonstrate the effectiveness of the mixture approach in the application of African, European, and East Asian HapMap samples for tag SNP selection in populations from India, a genetically intermediate region underrepresented in genomic studies of haplotype variation.

Full-text preview

Available from:
  • Source
    • "In African Americans, combining YRI with at least one other reference population boosts imputation performance when compared to YRI alone [5], [7], [8], but an optimal imputation strategy is not well established. Two or more reference populations can be combined in their entirety [9], [10], combined in equal proportions [11], [12], or weighted to match the ancestral proportions of the study population [13], [14]. Alternatively, the imputation procedure can be conducted sequentially (once for each selected reference population) rather than as a combined population, followed by merging the imputed genotypes [15], [16]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Genotype imputation, used in genome-wide association studies to expand coverage of single nucleotide polymorphisms (SNPs), has performed poorly in African Americans compared to less admixed populations. Overall, imputation has typically relied on HapMap reference haplotype panels from Africans (YRI), European Americans (CEU), and Asians (CHB/JPT). The 1000 Genomes project offers a wider range of reference populations, such as African Americans (ASW), but their imputation performance has had limited evaluation. Using 595 African Americans genotyped on Illumina's HumanHap550v3 BeadChip, we compared imputation results from four software programs (IMPUTE2, BEAGLE, MaCH, and MaCH-Admix) and three reference panels consisting of different combinations of 1000 Genomes populations (February 2012 release): (1) 3 specifically selected populations (YRI, CEU, and ASW); (2) 8 populations of diverse African (AFR) or European (AFR) descent; and (3) all 14 available populations (ALL). Based on chromosome 22, we calculated three performance metrics: (1) concordance (percentage of masked genotyped SNPs with imputed and true genotype agreement); (2) imputation quality score (IQS; concordance adjusted for chance agreement, which is particularly informative for low minor allele frequency [MAF] SNPs); and (3) average r2hat (estimated correlation between the imputed and true genotypes, for all imputed SNPs). Across the reference panels, IMPUTE2 and MaCH had the highest concordance (91%-93%), but IMPUTE2 had the highest IQS (81%-83%) and average r2hat (0.68 using YRI+ASW+CEU, 0.62 using AFR+EUR, and 0.55 using ALL). Imputation quality for most programs was reduced by the addition of more distantly related reference populations, due entirely to the introduction of low frequency SNPs (MAF≤2%) that are monomorphic in the more closely related panels. While imputation was optimized by using IMPUTE2 with reference to the ALL panel (average r2hat = 0.86 for SNPs with MAF>2%), use of the ALL panel for African American studies requires careful interpretation of the population specificity and imputation quality of low frequency SNPs.
    Full-text · Article · Nov 2012 · PLoS ONE
  • Source
    • "Over 2800 SNPs typed on the CEPH-HGDP panel and an additional two Indian populations (total of 55 samples) (14). "
    [Show abstract] [Hide abstract]
    ABSTRACT: ALFRED ( is a free, web accessible, curated compilation of allele frequency data on DNA sequence polymorphisms in anthropologically defined human populations. Currently, ALFRED has allele frequency tables on over 663 400 polymorphic sites; 170 of them have frequency tables for more than 100 different population samples. In ALFRED, a population may have multiple samples with each ‘sample’ consisting of many individuals on which an allele frequency is based. There are 3566 population samples from 710 different populations with allele frequency tables on at least one polymorphism. Fifty of those population samples have allele frequency data for over 650 000 polymorphisms. Records also have active links to relevant resources (dbSNP, PharmGKB, OMIM, Ethnologue, etc.). The flexible search options and data display and download capabilities available through the web interface allow easy access to the large quantity of high-quality data in ALFRED.
    Full-text · Article · Jan 2012 · Nucleic Acids Research
  • Source
    • "associated with the allele frequencies. We note that for five of the Native American populations we report on here, data exist as part of the HGDP-CEPH dataset both for STRPs (Rosenberg et al., 2002; Zhivotovsky et al., 2003) and for SNPs (Li et al., 2008; Pemberton et al., 2008). However, in those cases we have SNP data on additional individuals in those populations and have used the larger dataset. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Autosomal DNA polymorphisms can provide new information and understanding of both the origins of and relationships among modern Native American populations. At the same time that autosomal markers can be highly informative, they are also susceptible to ascertainment biases in the selection of the markers to use. Identifying markers that can be used for ancestry inference among Native American populations can be considered separate from identifying markers to further the quest for history. In the current study, we are using data on nine Native American populations to compare the results based on a large haplotype-based dataset with relatively small independent sets of single nucleotide polymorphisms. We are interested in what types of limited datasets an individual laboratory might be able to collect are best for addressing two different questions of interest. First, how well can we differentiate the Native American populations and/or infer ancestry by assigning an individual to her population(s) of origin? Second, how well can we infer the historical/evolutionary relationships among Native American populations and their Eurasian origins? We conclude that only a large comprehensive dataset involving multiple autosomal markers on multiple populations will be able to answer both questions; different small sets of markers are able to answer only one or the other of these questions. Using our largest dataset, we see a general increasing distance from Old World populations from North to South in the New World except for an unexplained close relationship between our Maya and Quechua samples.
    Full-text · Article · Dec 2011 · American Journal of Physical Anthropology
Show more