Population Substructure and Control Selection in Genome-Wide Association Studies

Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, Maryland, United States of America.
PLoS ONE (Impact Factor: 3.53). 02/2008; 3(7):e2551. DOI: 10.1371/journal.pone.0002551
Source: PubMed

ABSTRACT Determination of the relevance of both demanding classical epidemiologic criteria for control selection and robust handling of population stratification (PS) represents a major challenge in the design and analysis of genome-wide association studies (GWAS). Empirical data from two GWAS in European Americans of the Cancer Genetic Markers of Susceptibility (CGEMS) project were used to evaluate the impact of PS in studies with different control selection strategies. In each of the two original case-control studies nested in corresponding prospective cohorts, a minor confounding effect due to PS (inflation factor lambda of 1.025 and 1.005) was observed. In contrast, when the control groups were exchanged to mimic a cost-effective but theoretically less desirable control selection strategy, the confounding effects were larger (lambda of 1.090 and 1.062). A panel of 12,898 autosomal SNPs common to both the Illumina and Affymetrix commercial platforms and with low local background linkage disequilibrium (pair-wise r(2)<0.004) was selected to infer population substructure with principal component analysis. A novel permutation procedure was developed for the correction of PS that identified a smaller set of principal components and achieved a better control of type I error (to lambda of 1.032 and 1.006, respectively) than currently used methods. The overlap between sets of SNPs in the bottom 5% of p-values based on the new test and the test without PS correction was about 80%, with the majority of discordant SNPs having both ranks close to the threshold. Thus, for the CGEMS GWAS of prostate and breast cancer conducted in European Americans, PS does not appear to be a major problem in well-designed studies. A study using suboptimal controls can have acceptable type I error when an effective strategy for the correction of PS is employed.

Download full-text


Available from: Zhaoming Wang, Aug 19, 2015
  • Source
    • "Analyses of the newly genotyped Stage 2 data (i.e., all Stage 2 studies except ANECS/SEARCH or SECGS) were adjusted for study and the first four principal components. Principal components for Stage 1 were calculated using ~7,600 independent markers (Yu et al. 2008); principal components for Stage 2 were calculated using 47,097 common SNPs on the exome chip. Of the 1,818 SNPs selected for replication in Stage 2, 1,371 loci included additional in silico data from two previously reported GWAS (Spurdle et al. 2011; Long et al. 2012) in a total of 2,121 cases and 10,209 controls from SEARCH/ANECS and SECGS studies. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Endometrial cancer (EC), a neoplasm of the uterine epithelial lining, is the most common gynecological malignancy in developed countries and the fourth most common cancer among US women. Women with a family history of EC have an increased risk for the disease, suggesting that inherited genetic factors play a role. We conducted a two-stage genome-wide association study of Type I EC. Stage 1 included 5,472 women (2,695 cases and 2,777 controls) of European ancestry from seven studies. We selected independent single-nucleotide polymorphisms (SNPs) that displayed the most significant associations with EC in Stage 1 for replication among 17,948 women (4,382 cases and 13,566 controls) in a multiethnic population (African America, Asian, Latina, Hawaiian and European ancestry), from nine studies. Although no novel variants reached genome-wide significance, we replicated previously identified associations with genetic markers near the HNF1B locus. Our findings suggest that larger studies with specific tumor classification are necessary to identify novel genetic polymorphisms associated with EC susceptibility.
    Human Genetics 10/2013; DOI:10.1007/s00439-013-1369-1 · 4.52 Impact Factor
  • Source
    • "Results from several GWA studies that have included public control genotype data on Caucasian samples have not revealed strong systematic differences in allele frequencies between previously genotyped public controls and study samples (Hom et al., 2008;Luca et al., 2008;Silverberg et al., 2009;Wrensch et al., 2009;Yu et al., 2008). However, results from a recent study that used public controls have raised concerns about the impact of batch genotype effects when cases and controls are genotyped on different platforms (Sebastiani et al., 2010). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Genome-wide association (GWA) studies are a powerful approach for identifying novel genetic risk factors associated with human disease. A GWA study typically requires the inclusion of thousands of samples to have sufficient statistical power to detect single nucleotide polymorphisms that are associated with only modest increases in risk of disease given the heavy burden of a multiple test correction that is necessary to maintain valid statistical tests. Low statistical power and the high financial cost of performing a GWA study remains prohibitive for many scientific investigators anxious to perform such a study using their own samples. A number of remedies have been suggested to increase statistical power and decrease cost, including the utilization of free publicly available genotype data and multi-stage genotyping designs. Herein, we compare the statistical power and relative costs of alternative association study designs that use cases and screened controls to study designs that are based only on, or additionally include, free public control genotype data. We describe a novel replication-based two-stage study design, which uses free public control genotype data in the first stage and follow-up genotype data on case-matched controls in the second stage that preserves many of the advantages inherent when using only an epidemiologically matched set of controls. Specifically, we show that our proposed two-stage design can substantially increase statistical power and decrease cost of performing a GWA study while controlling the type-I error rate that can be inflated when using public controls due to differences in ancestry and batch genotype effects.
    Human Genetics 12/2010; 128(6):597-608. DOI:10.1007/s00439-010-0880-x · 4.52 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Differences in genetic background within two or more populations are an important cause of disturbance in case–control association studies. In fact, when mixing together populations of different ethnic groups, different allele frequencies between case and control samples could be due to the ancestry rather than a real association with the disease under study. This can easily lead to a large amount of false positive and negative results in association study analysis. Moreover, the growing need to put together several data sets coming from different studies in order to increase the statistical power of the analysis makes this problem particularly important in recent statistical genetics research. To overcome these problems, different correction strategies have been proposed, but currently there is no consensus about a common powerful strategy to adjust for population stratification. In this chapter, we discuss the state-of-the-art of strategies used for correcting the statistics for genome-wide association analysis by taking into account the ancestral structure of the population. After a short review of the most important methods and tools available, we will show the results obtained in two real data sets and discuss them in terms of advantages and disadvantages of each algorithm.
Show more