Phenotype harmonization and cross-study collaboration in GWAS consortia: the GENEVA experience

Collaborative Health Studies Coordinating Center, Department of Biostatistics, University of Washington, Seattle, Washington 98115, USA.
Genetic Epidemiology (Impact Factor: 2.6). 04/2011; 35(3):159-73. DOI: 10.1002/gepi.20564
Source: PubMed


Genome-wide association study (GWAS) consortia and collaborations formed to detect genetic loci for common phenotypes or investigate gene-environment (G*E) interactions are increasingly common. While these consortia effectively increase sample size, phenotype heterogeneity across studies represents a major obstacle that limits successful identification of these associations. Investigators are faced with the challenge of how to harmonize previously collected phenotype data obtained using different data collection instruments which cover topics in varying degrees of detail and over diverse time frames. This process has not been described in detail. We describe here some of the strategies and pitfalls associated with combining phenotype data from varying studies. Using the Gene Environment Association Studies (GENEVA) multi-site GWAS consortium as an example, this paper provides an illustration to guide GWAS consortia through the process of phenotype harmonization and describes key issues that arise when sharing data across disparate studies. GENEVA is unusual in the diversity of disease endpoints and so the issues it faces as its participating studies share data will be informative for many collaborations. Phenotype harmonization requires identifying common phenotypes, determining the feasibility of cross-study analysis for each, preparing common definitions, and applying appropriate algorithms. Other issues to be considered include genotyping timeframes, coordination of parallel efforts by other collaborative groups, analytic approaches, and imputation of genotype data. GENEVA's harmonization efforts and policy of promoting data sharing and collaboration, not only within GENEVA but also with outside collaborations, can provide important guidance to ongoing and new consortia.

Download full-text


Available from: Peter Kraft, Oct 04, 2015
85 Reads
  • Source
    • "These areas were targeted because we assumed that grant applications within each of these areas would study related topics, and that there would be a tendency for similarity in the constructs and measures. Furthermore, investigators in these areas increasingly recognize the importance of data comparability, interoperability, and integration across multiple studies (Curran and Hussong, 2009), particularly in the area of gene-environment interactions (Bennett et al., 2011; Bierut, 2011; Cornelis et al., 2010; Duncan and Keller, 2011), a research portfolio that is largely housed within the epidemiology domain. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The need for comprehensive analysis to compare and combine data across multiple studies in order to validate and extend results is widely recognized. This paper aims to assess the extent of data compatibility in the substance abuse and addiction (SAA) sciences through an examination of measure commonality, defined as the use of similar measures, across grants funded by the National Institute on Drug Abuse (NIDA) and the National Institute on Alcohol Abuse and Alcoholism (NIAAA). Data were extracted from applications of funded, active grants involving human-subjects research in four scientific areas (epidemiology, prevention, services, and treatment) and six frequently assessed scientific domains. A total of 548 distinct measures were cited across 141 randomly sampled applications. Commonality, as assessed by density (range of 0-1) of shared measurement, was examined. Results showed that commonality was low and varied by domain/area. Commonality was most prominent for (1) diagnostic interviews (structured and semi-structured) for substance use disorders and psychopathology (density of 0.88), followed by (2) scales to assess dimensions of substance use problems and disorders (0.70), (3) scales to assess dimensions of affect and psychopathology (0.69), (4) measures of substance use quantity and frequency (0.62), (5) measures of personality traits (0.40), and (6) assessments of cognitive/neurologic ability (0.22). The areas of prevention (density of 0.41) and treatment (0.42) had greater commonality than epidemiology (0.36) and services (0.32). To address the lack of measure commonality, NIDA and its scientific partners recommend and provide common measures for SAA researchers within the PhenX Toolkit.
    Drug and Alcohol Dependence 08/2014; 141. DOI:10.1016/j.drugalcdep.2014.04.029 · 3.42 Impact Factor
  • Source
    • "Governments, funders, and researchers alike have been stressing the importance of harmonization and collaborative use of data and samples in the population health and biobanking fields over the past half-decade [13-21]. However, managing and harmonizing very large amounts of data from different sources is a significant challenge [20,22-24]. Further, ethical, legal, and consent-related restrictions associated with sharing or pooling of individual-level data represent a common dilemma faced by international research projects and networks [25,26]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Individual-level data pooling of large population-based studies across research centres in international research projects faces many hurdles. The BioSHaRE (Biobank Standardisation and Harmonisation for Research Excellence in the European Union) project aims to address these issues by building a collaborative group of investigators and developing tools for data harmonization, database integration and federated data analyses. Eight population-based studies in six European countries were recruited to participate in the BioSHaRE project. Through workshops, teleconferences and electronic communications, participating investigators identified a set of 96 variables targeted for harmonization to answer research questions of interest. Using each study's questionnaires, standard operating procedures, and data dictionaries, harmonization potential was assessed. Whenever harmonization was deemed possible, processing algorithms were developed and implemented in an open-source software infrastructure to transform study-specific data into the target (i.e. harmonized) format. Harmonized datasets located on server in each research centres across Europe were interconnected through a federated database system to perform statistical analysis. Retrospective harmonization led to the generation of common format variables for 73% of matches considered (96 targeted variables across 8 studies). Authenticated investigators can now perform complex statistical analyses of harmonized datasets stored on distributed servers without actually sharing individual-level data using the DataSHIELD method. New Internet-based networking technologies and database management systems are providing the means to support collaborative, multi-center research in an efficient and secure manner. The results from this pilot project show that, given a strong collaborative relationship between participating studies, it is possible to seamlessly co-analyse internationally harmonized research databases while allowing each study to retain full control over individual-level data. We encourage additional collaborative research networks in epidemiology, public health, and the social sciences to make use of the open source tools presented herein.
    Emerging Themes in Epidemiology 11/2013; 10(1):12. DOI:10.1186/1742-7622-10-12 · 2.59 Impact Factor
  • Source
    • "Genotype data for all cohorts except DRDR2 went through an extensive process of cleaning, imputation, and quality assurance, performed by the GENEVA consortium Coordinating Center at the University of Washington [14,20,21]. The entire cleaning procedure included but was not limited to, checks for gender identity, chromosomal anomalies, sample relatedness, population structure, missing call rates, plate effects, Mendelian errors, duplicate discordance, etc. Detailed cleaning reports are publicly available for each study at the above referenced dbGaP resource. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Over 90% of adults aged 20 years or older with permanent teeth have suffered from dental caries leading to pain, infection, or even tooth loss. Although caries prevalence has decreased over the past decade, there are still about 23% of dentate adults who have untreated carious lesions in the US. Dental caries is a complex disorder affected by both individual susceptibility and environmental factors. Approximately 35-55% of caries phenotypic variation in the permanent dentition is attributable to genes, though few specific caries genes have been identified. Therefore, we conducted the first genome-wide association study (GWAS) to identify genes affecting susceptibility to caries in adults. Methods Five independent cohorts were included in this study, totaling more than 7000 participants. For each participant, dental caries was assessed and genetic markers (single nucleotide polymorphisms, SNPs) were genotyped or imputed across the entire genome. Due to the heterogeneity among the five cohorts regarding age, genotyping platform, quality of dental caries assessment, and study design, we first conducted genome-wide association (GWA) analyses on each of the five independent cohorts separately. We then performed three meta-analyses to combine results for: (i) the comparatively younger, Appalachian cohorts (N = 1483) with well-assessed caries phenotype, (ii) the comparatively older, non-Appalachian cohorts (N = 5960) with inferior caries phenotypes, and (iii) all five cohorts (N = 7443). Top ranking genetic loci within and across meta-analyses were scrutinized for biologically plausible roles on caries. Results Different sets of genes were nominated across the three meta-analyses, especially between the younger and older age cohorts. In general, we identified several suggestive loci (P-value ≤ 10E-05) within or near genes with plausible biological roles for dental caries, including RPS6KA2 and PTK2B, involved in p38-depenedent MAPK signaling, and RHOU and FZD1, involved in the Wnt signaling cascade. Both of these pathways have been implicated in dental caries. ADMTS3 and ISL1 are involved in tooth development, and TLR2 is involved in immune response to oral pathogens. Conclusions As the first GWAS for dental caries in adults, this study nominated several novel caries genes for future study, which may lead to better understanding of cariogenesis, and ultimately, to improved disease predictions, prevention, and/or treatment.
    BMC Oral Health 12/2012; 12(1):57. DOI:10.1186/1472-6831-12-57 · 1.13 Impact Factor
Show more