Population Substructure and Control Selection in Genome-Wide Association Studies

Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, Maryland, United States of America.
PLoS ONE (Impact Factor: 3.23). 02/2008; 3(7):e2551. DOI: 10.1371/journal.pone.0002551
Source: PubMed


Determination of the relevance of both demanding classical epidemiologic criteria for control selection and robust handling of population stratification (PS) represents a major challenge in the design and analysis of genome-wide association studies (GWAS). Empirical data from two GWAS in European Americans of the Cancer Genetic Markers of Susceptibility (CGEMS) project were used to evaluate the impact of PS in studies with different control selection strategies. In each of the two original case-control studies nested in corresponding prospective cohorts, a minor confounding effect due to PS (inflation factor lambda of 1.025 and 1.005) was observed. In contrast, when the control groups were exchanged to mimic a cost-effective but theoretically less desirable control selection strategy, the confounding effects were larger (lambda of 1.090 and 1.062). A panel of 12,898 autosomal SNPs common to both the Illumina and Affymetrix commercial platforms and with low local background linkage disequilibrium (pair-wise r(2)<0.004) was selected to infer population substructure with principal component analysis. A novel permutation procedure was developed for the correction of PS that identified a smaller set of principal components and achieved a better control of type I error (to lambda of 1.032 and 1.006, respectively) than currently used methods. The overlap between sets of SNPs in the bottom 5% of p-values based on the new test and the test without PS correction was about 80%, with the majority of discordant SNPs having both ranks close to the threshold. Thus, for the CGEMS GWAS of prostate and breast cancer conducted in European Americans, PS does not appear to be a major problem in well-designed studies. A study using suboptimal controls can have acceptable type I error when an effective strategy for the correction of PS is employed.

Download full-text


Available from: Zhaoming Wang, Oct 01, 2015
17 Reads
  • Source
    • "The genotypes were called using Illumina's cluster file “HumanOmniExpress-12v1_A.egt”. DNA was not available from the parents of Patients 1 and 2. The degree of relatedness for Patients 1 and 2 was determined using a set of population-informative SNPs in a data set that contained 199 samples [54]. Peripheral blood lymphocyte DNA from Patient 3 and his parents was performed using the Illumina Human OmniExpress BeadChip (Illumina). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Dubowitz syndrome is a rare disorder characterized by multiple congenital anomalies, cognitive delay, growth failure, an immune defect, and an increased risk of blood dyscrasia and malignancy. There is considerable phenotypic variability, suggesting genetic heterogeneity. We clinically characterized and performed exome sequencing and high-density array SNP genotyping on three individuals with Dubowitz syndrome, including a pair of previously-described siblings (Patients 1 and 2, brother and sister) and an unpublished patient (Patient 3). Given the siblings' history of bone marrow abnormalities, we also evaluated telomere length and performed radiosensitivity assays. In the siblings, exome sequencing identified compound heterozygosity for a known rare nonsense substitution in the nuclear ligase gene LIG4 (rs104894419, NM_002312.3:c.2440C>T) that predicts p.Arg814X (MAF:0.0002) and an NM_002312.3:c.613delT variant that predicts a p.Ser205Leufs*29 frameshift. The frameshift mutation has not been reported in 1000 Genomes, ESP, or ClinSeq. These LIG4 mutations were previously reported in the sibling sister; her brother had not been previously tested. Western blotting showed an absence of a ligase IV band in both siblings. In the third patient, array SNP genotyping revealed a de novo ∼3.89 Mb interstitial deletion at chromosome 17q24.2 (chr 17:62,068,463-65,963,102, hg18), which spanned the known Carney complex gene PRKAR1A. In all three patients, a median lymphocyte telomere length of ≤1st centile was observed and radiosensitivity assays showed increased sensitivity to ionizing radiation. Our work suggests that, in addition to dyskeratosis congenita, LIG4 and 17q24.2 syndromes also feature shortened telomeres; to confirm this, telomere length testing should be considered in both disorders. Taken together, our work and other reports on Dubowitz syndrome, as currently recognized, suggest that it is not a unitary entity but instead a collection of phenotypically similar disorders. As a clinical entity, Dubowitz syndrome will need continual re-evaluation and re-definition as its constituent phenotypes are determined.
    PLoS ONE 06/2014; 9(6):e98686. DOI:10.1371/journal.pone.0098686 · 3.23 Impact Factor
  • Source
    • "Analyses of the newly genotyped Stage 2 data (i.e., all Stage 2 studies except ANECS/SEARCH or SECGS) were adjusted for study and the first four principal components. Principal components for Stage 1 were calculated using ~7,600 independent markers (Yu et al. 2008); principal components for Stage 2 were calculated using 47,097 common SNPs on the exome chip. Of the 1,818 SNPs selected for replication in Stage 2, 1,371 loci included additional in silico data from two previously reported GWAS (Spurdle et al. 2011; Long et al. 2012) in a total of 2,121 cases and 10,209 controls from SEARCH/ANECS and SECGS studies. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Endometrial cancer (EC), a neoplasm of the uterine epithelial lining, is the most common gynecological malignancy in developed countries and the fourth most common cancer among US women. Women with a family history of EC have an increased risk for the disease, suggesting that inherited genetic factors play a role. We conducted a two-stage genome-wide association study of Type I EC. Stage 1 included 5,472 women (2,695 cases and 2,777 controls) of European ancestry from seven studies. We selected independent single-nucleotide polymorphisms (SNPs) that displayed the most significant associations with EC in Stage 1 for replication among 17,948 women (4,382 cases and 13,566 controls) in a multiethnic population (African America, Asian, Latina, Hawaiian and European ancestry), from nine studies. Although no novel variants reached genome-wide significance, we replicated previously identified associations with genetic markers near the HNF1B locus. Our findings suggest that larger studies with specific tumor classification are necessary to identify novel genetic polymorphisms associated with EC susceptibility.
    Human Genetics 10/2013; 133(2). DOI:10.1007/s00439-013-1369-1 · 4.82 Impact Factor
  • Source
    • "It is important to acknowledge that it is unnecessary to correct for population structure if the structure is not associated with the outcome of interest. It has been suggested that components should only be included if such an association is demonstrated due to the resulting loss of power (Cox and McCullagh, 1982; Yu et al., 2008). Based on the scree plot in Figure 1B, it would be advisable to include only the first two components if they are necessary with respect to the outcome. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Genome-wide association (GWA) studies have become a standard approach for discovering and validating genomic polymorphisms putatively associated with phenotypes of interest. Accounting for population structure in GWA studies is critical to attain unbiased parameter measurements and control Type I error. One common approach to accounting for population structure is to include several principal components derived from the entire autosomal dataset, which reflects population structure signal. However, knowing which components to include is subjective and generally not conclusive. We examined how phylogenetic signal from mitochondrial DNA (mtDNA) and chromosome Y (chr:Y) markers is concordant with principal component data based on autosomal markers to determine whether mtDNA and chr:Y phylogenetic data can help guide principal component selection. Using HAPMAP and other original data from individuals of multiple ancestries, we examined the relationships of mtDNA and chr:Y phylogenetic signal with the autosomal PCA using best subset logistic regression. We show that while the two approaches agree at times, this is independent of the component order and not completely represented in the Eigen values. Additionally, we use simulations to demonstrate that our approach leads to a slightly reduced Type I error rate compared to the standard approach. This approach provides preliminary evidence to support the theoretical concept that mtDNA and chr:Y data can be informative in locating the PCs that are most associated with evolutionary history of populations that are being studied, although the utility of such information will depend on the specific situation.
    Frontiers in Genetics 12/2012; 3:301. DOI:10.3389/fgene.2012.00301
Show more