Pitfalls of Merging GWAS Data: Lessons Learned in the eMERGE Network and Quality Control Procedures to Maintain High Data Quality

Center for Human Genetics Research, Department of Molecular Physiology and Biophysics, Vanderbilt University, Nashville, TN, USA.
Genetic Epidemiology (Impact Factor: 2.6). 12/2011; 35(8):887-98. DOI: 10.1002/gepi.20639
Source: PubMed


Genome-wide association studies (GWAS) are a useful approach in the study of the genetic components of complex phenotypes. Aside from large cohorts, GWAS have generally been limited to the study of one or a few diseases or traits. The emergence of biobanks linked to electronic medical records (EMRs) allows the efficient reuse of genetic data to yield meaningful genotype-phenotype associations for multiple phenotypes or traits. Phase I of the electronic MEdical Records and GEnomics (eMERGE-I) Network is a National Human Genome Research Institute-supported consortium composed of five sites to perform various genetic association studies using DNA repositories and EMR systems. Each eMERGE site has developed EMR-based algorithms to comprise a core set of 14 phenotypes for extraction of study samples from each site's DNA repository. Each eMERGE site selected samples for a specific phenotype, and these samples were genotyped at either the Broad Institute or at the Center for Inherited Disease Research using the Illumina Infinium BeadChip technology. In all, approximately 17,000 samples from across the five sites were genotyped. A unified quality control (QC) pipeline was developed by the eMERGE Genomics Working Group and used to ensure thorough cleaning of the data. This process includes examination of sample and marker quality and various batch effects. Upon completion of the genotyping and QC analyses for each site's primary study, eMERGE Coordinating Center merged the datasets from all five sites. This larger merged dataset reentered the established eMERGE QC pipeline. Based on lessons learned during the process, additional analyses and QC checkpoints were added to the pipeline to ensure proper merging. Here, we explore the challenges associated with combining datasets from different genotyping centers and describe the expansion to eMERGE QC pipeline for merged datasets. These additional steps will be useful as the eMERGE project expands to include additional sites in eMERGE-II, and also serve as a starting point for investigators merging multiple genotype datasets accessible through the National Center for Biotechnology Information in the database of Genotypes and Phenotypes. Our experience demonstrates that merging multiple datasets after additional QC can be an efficient use of genotype data despite new challenges that appear in the process.

Download full-text


Available from: Mariza Andrade, Jan 16, 2015
18 Reads
  • Source
    • "Genotype data were acquired on one of three genotyping platforms: the Illumina Human660W-Quadv1_A genotyping platform (VGER-660), the Illumina Human1M-Duo (VGER-1M), or the Illumina Omni1_QUAD (VESPA). Each dataset was separately cleaned using the quality control pipeline developed by the eMERGE Genomics Working Group [28]. This entailed identifying gender mismatches, identifying SNPs failing concordance with HapMap, batch effects, and identification of duplicate and related individuals. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The coupling of electronic medical records (EMR) with genetic data has created the potential for implementing reverse genetic approaches in humans, whereby the function of a gene is inferred from the shared pattern of morbidity among homozygotes of a genetic variant. We explored the feasibility of this approach to identify phenotypes associated with low frequency variants using Vanderbilt's EMR-based BioVU resource. We analyzed 1,658 low frequency non-synonymous SNPs (nsSNPs) with a minor allele frequency (MAF)<10% collected on 8,546 subjects. For each nsSNP, we identified diagnoses shared by at least 2 minor allele homozygotes and with an association p<0.05. The diagnoses were reviewed by a clinician to ascertain whether they may share a common mechanistic basis. While a number of biologically compelling clinical patterns of association were observed, the frequency of these associations was identical to that observed using genotype-permuted data sets, indicating that the associations were likely due to chance. To refine our analysis associations, we then restricted the analysis to 711 nsSNPs in genes with phenotypes in the On-line Mendelian Inheritance in Man (OMIM) or knock-out mouse phenotype databases. An initial comparison of the EMR diagnoses to the known in vivo functions of the gene identified 25 candidate nsSNPs, 19 of which had significant genotype-phenotype associations when tested using matched controls. Twleve of the 19 nsSNPs associations were confirmed by a detailed record review. Four of 12 nsSNP-phenotype associations were successfully replicated in an independent data set: thrombosis (F5,rs6031), seizures/convulsions (GPR98,rs13157270), macular degeneration (CNGB3,rs3735972), and GI bleeding (HGFAC,rs16844401). These analyses demonstrate the feasibility and challenges of using reverse genetics approaches to identify novel gene-phenotype associations in human subjects using low frequency variants. As increasing amounts of rare variant data are generated from modern genotyping and sequence platforms, model organism data may be an important tool to enable discovery.
    PLoS ONE 06/2014; 9(6):e100322. DOI:10.1371/journal.pone.0100322 · 3.23 Impact Factor
  • Source
    • "datasheet_infiniumhd.pdf). Genotype data were cleaned using the quality control (QC) pipeline developed by the eMERGE Genomics Working Group (Turner et al. 2011; Zuvich et al. 2011). The process includes evaluation of sample and marker call rate, gender mismatch and anomalies, duplicate and HapMap concordance, batch effects, Hardy-Weinberg equilibrium, sample relatedness, and population stratification. "
    [Show abstract] [Hide abstract]
    ABSTRACT: To identify novel genetic loci influencing inter-individual variation in red blood cell (RBC) traits in African-Americans, we conducted a genome-wide association study (GWAS) in 2315 individuals, divided into discovery (n=1904) and replication (n=411) cohorts. The traits included hemoglobin concentration (HGB), hematocrit (HCT), RBC count, mean corpuscular volume (MCV), mean corpuscular hemoglobin (MCH), and mean corpuscular hemoglobin concentration (MCHC). Patients were participants in the electronic MEdical Records and GEnomics (eMERGE) network and underwent genotyping of ~1.2 million SNPs on the Illumina Human1M-Duo array. Association analyses were performed adjusting for age, sex, site, and population stratification. Three loci previously associated with resistance to malaria - HBB (11p15.4), HBA1/HBA2 (16p13.3), and G6PD (Xq28) - were associated (P≤1×10(-6)) with RBC traits in the discovery cohort. The loci replicated in the replication cohort (P ≤ 0.02), and were significant at a genome-wide significance level (P<5×10(-8)) in the combined cohort. The proportions of variance in RBC traits explained by significant variants at these loci were as follows: rs7120391 (near HBB) 1.3% of MCHC, rs9924561 (near HBA1/A2) 5.5% of MCV, 6.9% of MCH and 2.9% of MCHC, and rs1050828 (in G6PD) 2.4% of RBC count, 2.9% of MCV, and 1.4% of MCH, respectively. We noted extended haplotype homozygosity around the non-synonymous SNP rs1050828 in G6PD, consistent with prior reports of natural selection at this locus. We were not able to replicate loci identified by a prior GWAS of RBC traits in a European ancestry cohort of similar sample size, suggesting that the genetic architecture underlying RBC traits differs by race. In conclusion, genetic variants that confer resistance to malaria are associated with RBC traits in African-Americans.
    G3-Genes Genomes Genetics 05/2013; 3(7). DOI:10.1534/g3.113.006452 · 3.20 Impact Factor
  • Source
    Public Health - Methodology, Environmental and Systems Issues, Edited by Jay Maddock, 05/2012: chapter 5: pages 77-106; InTech., ISBN: 978-953-51-0641-8
Show more