Review and Evaluation of Methods Correcting for Population Stratification with a Focus on Underlying Statistical Principles

Department of Biostatistics, Section on Statistical Genetics, University of Alabama at Birmingham, Birmingham, AL 35294, USA.
Human Heredity (Impact Factor: 1.47). 02/2008; 66(2):67-86. DOI: 10.1159/000119107
Source: PubMed


When two or more populations have been separated by geographic or cultural boundaries for many generations, drift, spontaneous mutations, differential selection pressures and other factors may lead to allele frequency differences among populations. If these 'parental' populations subsequently come together and begin inter-mating, disequilibrium among linked markers may span a greater genetic distance than it typically does among populations under panmixia [see glossary]. This extended disequilibrium can make association studies highly effective and more economical than disequilibrium mapping in panmictic populations since less marker loci are needed to detect regions of the genome that harbor phenotype-influencing loci. However, under some circumstances, this process of intermating (as well as other processes) can produce disequilibrium between pairs of unlinked loci and thus create the possibility of confounding or spurious associations due to this population stratification. Accordingly, researchers are advised to employ valid statistical tests for linkage disequilibrium mapping allowing conduct of genetic association studies that control for such confounding. Many recent papers have addressed this need. We provide a comprehensive review of advances made in recent years in correcting for population stratification and then evaluate and synthesize these methods based on statistical principles such as (1) randomization, (2) conditioning on sufficient statistics, and (3) identifying whether the method is based on testing the genotype-phenotype covariance (conditional upon familial information) and/or testing departures of the marginal distribution from the expected genotypic frequencies.

Full-text preview

Available from:
  • Source
    • "The approach of Li et al. combined the results from MDS and a phylogenetic analysis and found they were better able to capture population stratification. Overall, the literature is quite rich in extensions of methods to account for each type of population stratification (Tian et al., 2008; Tiwari et al., 2008; Zhang et al., 2008); however, there is no gold standard that can be applied to all stratification scenarios. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Genome-wide association (GWA) studies have become a standard approach for discovering and validating genomic polymorphisms putatively associated with phenotypes of interest. Accounting for population structure in GWA studies is critical to attain unbiased parameter measurements and control Type I error. One common approach to accounting for population structure is to include several principal components derived from the entire autosomal dataset, which reflects population structure signal. However, knowing which components to include is subjective and generally not conclusive. We examined how phylogenetic signal from mitochondrial DNA (mtDNA) and chromosome Y (chr:Y) markers is concordant with principal component data based on autosomal markers to determine whether mtDNA and chr:Y phylogenetic data can help guide principal component selection. Using HAPMAP and other original data from individuals of multiple ancestries, we examined the relationships of mtDNA and chr:Y phylogenetic signal with the autosomal PCA using best subset logistic regression. We show that while the two approaches agree at times, this is independent of the component order and not completely represented in the Eigen values. Additionally, we use simulations to demonstrate that our approach leads to a slightly reduced Type I error rate compared to the standard approach. This approach provides preliminary evidence to support the theoretical concept that mtDNA and chr:Y data can be informative in locating the PCs that are most associated with evolutionary history of populations that are being studied, although the utility of such information will depend on the specific situation.
    Full-text · Article · Dec 2012 · Frontiers in Genetics
  • Source
    • "Hence, further study of these GWAS " hits " in different racial/ethnic groups is warranted. While it is now standard practice to test for population stratification (PS) and remove ancestral outliers from analysis in GWAS studies, further adjustment for PS may be necessary when studying a recently admixed population such as African Americans or Hispanic Americans (Barnholtz-Sloan et al., 2008; Tiwari et al., 2008). In addition, because risk allele frequencies can vary by ancestral group, interaction effects between a SNP of interest and PS may be needed in order to fully understand differences in potential genetic associations by race. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In this study, we assessed association of genome-wide association studies (GWAS) "hits" by race with adjustment for potential population stratification (PS) in two large, diverse study populations; the Carolina Breast Cancer Study (CBCS; N total = 3693 individuals) and the University of Pennsylvania Study of Clinical Outcomes, Risk, and Ethnicity (SCORE; N total = 1135 individuals). In both study populations, 136 ancestry information markers and GWAS "hits" (CBCS: FGFR2, 8q24; SCORE: JAZF1, MSMB, 8q24) were genotyped. Principal component analysis was used to assess ancestral differences by race. Multivariable unconditional logistic regression was used to assess differences in cancer risk with and without adjustment for the first ancestral principal component (PC1) and for an interaction effect between PC1 and the GWAS "hit" (SNP) of interest. PC1 explained 53.7% of the variance for CBCS and 49.5% of the variance for SCORE. European Americans and African Americans were similar in their ancestral structure between CBCS and SCORE and cases and controls were well matched by ancestry. In the CBCS European Americans, 9/11 SNPs were significant after PC1 adjustment, but after adjustment for the PC1 by SNP interaction effect, only one SNP remained significant (rs1219648 in FGFR2); for CBCS African Americans, 6/11 SNPs were significant after PC1 adjustment and after adjustment for the PC1 by SNP interaction effect, all six SNPs remained significant and an additional SNP now became significant. In the SCORE European Americans, 0/9 SNPs were significant after PC1 adjustment and no changes were seen after additional adjustment for the PC1 by SNP interaction effect; for SCORE African Americans, 2/9 SNPs were significant after PC1 adjustment and after adjustment for the PC1 by SNP interaction effect, only one SNP remained significant (rs16901979 at 8q24). We show that genetic associations by race are modified by interaction between individual SNPs and PS.
    Preview · Article · Jul 2011 · Frontiers in Genetics
  • Source
    • "Family-based designs play an important role in genetic association analysis, primarily by allowing the comparison of subjects that are matched for shared risk factors, such as population membership, that might confound analyses (Whittaker & Morris, 2001; Tiwari et al., 2008). There are many family structures that are informative for association, but the most popular are based on the nuclear family consisting of two parents and their full offspring. "
    [Show abstract] [Hide abstract]
    ABSTRACT: A common design in family-based association studies consists of siblings without parents. Several methods have been proposed for analysis of sibship data, but they mostly do not allow for missing data, such as haplotype phase or untyped markers. On the other hand, general methods for nuclear families with missing data are computationally intensive when applied to sibships, since every family has missing parents that could have many possible genotypes. We propose a computationally efficient model for sibships by conditioning on the sets of alleles transmitted into the sibship by each parent. This means that the likelihood can be written only in terms of transmitted alleles and we do not have to sum over all possible untransmitted alleles when they cannot be deduced from the siblings. The model naturally accommodates missing data and admits standard theory of estimation, testing, and inclusion of covariates. Our model is quite robust to population stratification and can test for association in the presence of linkage. We show that our model has similar power to FBAT for single marker analysis and improved power for haplotype analysis. Compared to summing over all possible untransmitted alleles, we achieve similar power with considerable reductions in computation time.
    Full-text · Article · May 2011 · Annals of Human Genetics
Show more