Is rigorous retrospective harmonization possible? Application of the DataSHaPER approach across 53 large studies

Research Institute - McGill University Health Centre, Montreal, Quebec, Canada.
International Journal of Epidemiology (Impact Factor: 9.18). 07/2011; 40(5):1314-28. DOI: 10.1093/ije/dyr106
Source: PubMed


Methods This article examines the value of using the DataSHaPER for retrospective harmonization of established studies. Using the DataSHaPER approach, the potential to generate 148 harmonized variables from the questionnaires and physical measures collected in 53 large population-based studies (6.9 million participants) was assessed. Variable and study characteristics that might influence the potential for data synthesis were also explored. Results Out of all assessment items evaluated (148 variables for each of the 53 studies), 38% could be harmonized. Certain characteristics of variables (i.e. relative importance, individual targeted, reference period) and of studies (i.e. observational units, data collection start date and mode of questionnaire administration) were associated with the potential for harmonization. For example, for variables deemed to be essential, 62% of assessment items paired could be harmonized. Conclusion The current article shows that the DataSHaPER provides an effective and flexible approach for the retrospective harmonization of information across studies. To implement data synthesis, some additional scientific, ethico-legal and technical considerations must be addressed. The success of the DataSHaPER as a harmonization approach will depend on its continuing development and on the rigour and extent of its use. The DataSHaPER has the potential to take us closer to a truly collaborative epidemiology and offers the promise of enhanced research potential generated through synthesized databases.

Download full-text


Available from: Bartha Maria Knoppers, Oct 27, 2015
  • Source
    • "In response, P 3 G sought to harmonize such variables by developing tools that would ease the integration of data across biological studies (Fortier et al., 2010). In point of fact, one of these tools, DataSHaPER, demonstrated that harmonization was possible (Fortier et al., 2011) by retrospectively assessing 53 cohorts from 21 countries, which resulted in a harmonization rate of 62% of essential variables. This made possible the " virtual " aggregation of 6.9 million individuals on any of the 148 variables, thereby creating the necessary statistical significance (power). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Over the past ten years, the Public Population Project in Genomics and Society ("P3G ") has grown as a consortium. It has expanded its range of services and resources to adapt to the ever-evolving needs of the research community. From its outset - when P3G first tackled the building of biobanks as resources as well as data cataloguing and harmonization for data integration - to its new mission and vision, it has continually developed the tools for the conceptualization and design of population biobanks from their inception to their use to their closure. In so doing, P3G has become key in fostering research infrastructures to facilitate transition to the clinic. The consortium has become a crucial stakeholder in the international scientific, ethical, legal, and social research communities.
    Applied and Translational Genomics 06/2014; 3(2). DOI:10.1016/j.atg.2014.04.004
  • Source
    • "The data harmonization and database federation methodology and infrastructure developed and piloted under BioSHaRE’s HOP is founded on the DataSHaPER (DataSchema and Harmonization Platform for Epidemiological Research) harmonization approach [22,37] and on information technology tools developed by OBiBa (Open Source Software for BioBanks) [38]. These have been recently integrated into a platform to support retrospective harmonization and integration of data [39] by the Maelstrom Research team [40]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Individual-level data pooling of large population-based studies across research centres in international research projects faces many hurdles. The BioSHaRE (Biobank Standardisation and Harmonisation for Research Excellence in the European Union) project aims to address these issues by building a collaborative group of investigators and developing tools for data harmonization, database integration and federated data analyses. Eight population-based studies in six European countries were recruited to participate in the BioSHaRE project. Through workshops, teleconferences and electronic communications, participating investigators identified a set of 96 variables targeted for harmonization to answer research questions of interest. Using each study's questionnaires, standard operating procedures, and data dictionaries, harmonization potential was assessed. Whenever harmonization was deemed possible, processing algorithms were developed and implemented in an open-source software infrastructure to transform study-specific data into the target (i.e. harmonized) format. Harmonized datasets located on server in each research centres across Europe were interconnected through a federated database system to perform statistical analysis. Retrospective harmonization led to the generation of common format variables for 73% of matches considered (96 targeted variables across 8 studies). Authenticated investigators can now perform complex statistical analyses of harmonized datasets stored on distributed servers without actually sharing individual-level data using the DataSHIELD method. New Internet-based networking technologies and database management systems are providing the means to support collaborative, multi-center research in an efficient and secure manner. The results from this pilot project show that, given a strong collaborative relationship between participating studies, it is possible to seamlessly co-analyse internationally harmonized research databases while allowing each study to retain full control over individual-level data. We encourage additional collaborative research networks in epidemiology, public health, and the social sciences to make use of the open source tools presented herein.
    Emerging Themes in Epidemiology 11/2013; 10(1):12. DOI:10.1186/1742-7622-10-12 · 2.59 Impact Factor
  • Source
    • "This approach depends crucially on the ability to combine the data across studies. Even before genetic analyses can begin , it is necessary to develop and test methods for harmonizing data across studies (Bookman et al., 2011; Cornelis et al., 2010; Fortier et al., 2011). The National Institute on Drug Abuse (NIDA) and the National Cancer Institute (NCI) recognized both the promise and the problems of developmental GEWIS when they wrote the following in the Request for Applications: Over many years, NIDA, other NIH Institutes, and other organizations have funded numerous highquality longitudinal and developmental studies that contain a wealth of data from individuals who are at risk for, or are in the course of development, progression, and desistance of substance abuse and related phenotypes. . "
    [Show abstract] [Hide abstract]
    ABSTRACT: The importance of including developmental and environmental measures in genetic studies of human pathology is widely acknowledged, but few empirical studies have been published. Barriers include the need for longitudinal studies that cover relevant developmental stages and for samples large enough to deal with the challenge of testing gene-environment-development interaction. A solution to some of these problems is to bring together existing data sets that have the necessary characteristics. As part of the National Institute on Drug Abuse-funded Gene-Environment-Development Initiative, our goal is to identify exactly which genes, which environments, and which developmental transitions together predict the development of drug use and misuse. Four data sets were used of which common characteristics include (1) general population samples, including males and females; (2) repeated measures across adolescence and young adulthood; (3) assessment of nicotine, alcohol, and cannabis use and addiction; (4) measures of family and environmental risk; and (5) consent for genotyping DNA from blood or saliva. After quality controls, 2,962 individuals provided over 15,000 total observations. In the first gene-environment analyses, of alcohol misuse and stressful life events, some significant gene-environment and gene-development effects were identified. We conclude that in some circumstances, already collected data sets can be combined for gene-environment and gene-development analyses. This greatly reduces the cost and time needed for this type of research. However, care must be taken to ensure careful matching across studies and variables.
    Twin Research and Human Genetics 03/2013; 16(02):1-11. DOI:10.1017/thg.2013.6 · 2.30 Impact Factor
Show more