Presentation

PROV@TOS, a Java Wrapper To Capture Provenance for Talend Open Studio Jobs

Authors:
To read the file of this research, you can request a copy directly from the authors.

Abstract

Introduction: Clinical studies in medicine aim to derive knowledge from growing amounts of diverse datasets. Utilisation of this data frequently necessitates data integration processes, which directly affects the quality of the research outcome. Increasing transparency and reproducibility of these processes supports trust in the outcomes and enables meta-analysis of the integration process. To achieve this, provenance – a record of the creation, transformation and all other influences regarding an object [1], [2] – can be captured and shared [3], [4], [5], establishing an integral part in complying to the FAIR principles [6]. A commonly used tool to integrate data is Talend Open Studio for Data Integration (TOS). We aimed to enhance TOS to make it provenance-aware, in order to capture fine-grained provenance without modification of the data integration pipelines themselves. Methods: To model and store provenance, W3C-PROV is applied [1]. The PROV core concept involves entities, activities, agents and their inter-relations. Although it is possible to extend PROV to tailor it to the needs of specific domains, we decided to use PROV without extensions. As part of our data curation toolset, TOS is central to our data integration pipelines, enabling us to capture fine-grained provenance at a coordination-point [7]. Using visual components and connectors TOS generates executable Java code (a job). Although the created jobs can vary in their functionality, their general structures remain similar. In order to make TOS provenance-aware, the jobs are modified to capture provenance using a data-model provided by ProvToolbox [8], making serializations like XML and RDF available. We aimed at simplicity of use: existing jobs should not be modified. Results: To capture provenance from running TOS jobs we introduce PROV@TOS, a Java wrapper that executes exported TOS jobs and stores standardized provenance information. A TOS component is modelled as an activity, input/output data are entities and influencers like the Java version are modelled as agents. Start and ending times of components are recorded in HashMaps from the Java Util library. Extended HashMaps have been implemented to store relevant provenance data which are injected using the Java Reflection API. If possible, entities will be accessed and identified using the hash of the referenced file. PROV@TOS is openly available at https://gitlab.gwdg.de/medinfpub/tos/provAtTos. Discussion: PROV@TOS has been successfully tested on different well established TOS jobs within our department in different projects [9]. We plan to put it into productive use within the UMG medical data integration center (UMG-MeDIC), making our data integration processes provenance-aware wherever TOS is used. By extending inherent TOS components, no further job modifications have to be performed. By using „plain vanilla“ W3C-PROV, tools like PROV-O-VIZ [10] were readily available to visualize the output data. Identifying entities using the file-hash enables provenance stitching within the whole integration pipeline [11]. In the future, we plan to extend this feature to a data store providing persistent unique identifiers to avoid collisions caused by identical file-hashes. The PROV@TOS based metadata extension of TOS jobs has the potential to increase transparency and reproducibility of research data. Acknowledgements: This work was supported by the German Federal Ministry of Education and Research (BMBF) within the framework of the research and funding concepts of the Medical Informatics Initiative (01ZZ1802B/HiGHmed), i:DSem (031L0024A/MyPathSem) and the DFG for the Collaborative Research Centre 1002 on Modulatory Units in Heart Failure, subproject INF.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

Article
Background Managing research data in biomedical informatics research requires solid data governance rules to guarantee sustainable operation, as it generally involves several professions and multiple sites. As every discipline involved in biomedical research applies its own set of tools and methods, research data as well as applied methods tend to branch out into numerous intermediate and output data objects, making it very difficult to reproduce research results. Objectives This article gives an overview of our implementation status applying the Findability, Accessibility, Interoperability and Reusability (FAIR) Guiding Principles for scientific data management and stewardship onto our research data management pipeline focusing on the software tools that are in use. Methods We analyzed our progress FAIRificating the whole data management pipeline, from processing non-FAIR data up to data usage. We looked at software tools for data integration, data storage, and data usage as well as how the FAIR Guiding Principles helped to choose appropriate tools for each task. Results We were able to advance the degree of FAIRness of our data integration as well as data storage solutions, but lack enabling more FAIR Guiding Principles regarding Data Usage. Existing evaluation methods regarding the FAIR Guiding Principles (FAIRmetrics) were not applicable to our analysis of software tools. Conclusion Using the FAIR Guiding Principles, we FAIRificated relevant parts of our research data management pipeline improving findability, accessibility, interoperability and reuse of datasets and research results. We aim to implement the FAIRmetrics to our data management infrastructure and—where required—to contribute to the FAIRmetrics for research data in the biomedical informatics domain as well as for software tools to achieve a higher degree of FAIRness of our research data management pipeline.
Linking multiple workflow provenance traces for interoperable collaborative science
  • P Missier
  • B Ludäscher
  • S Bowers
  • S Dey
  • A Sarkar
  • B Shrestha
Missier P, Ludäscher B, Bowers S, Dey S, Sarkar A, Shrestha B, et al. Linking multiple workflow provenance traces for interoperable collaborative science. The 5th Workshop on Workflows in Support of Large-Scale Science, New Orleans, LA, USA: IEEE; 2010. doi:10.1109/WORKS.2010.5671861.
PLUS: A provenance manager for integrated information
  • A Chapman
  • B T Blaustein
  • L Seligman
  • M D Allen
Chapman A, Blaustein BT, Seligman L, Allen MD. PLUS: A provenance manager for integrated information. 2011 IEEE International Conference on Information Reuse Integration, Las Vegas, NV, USA: IEEE; 2011, p. 269-75. doi:10.1109/IRI.2011.6009558.
  • R Hoekstra
  • P Groth
  • Prov-O-Viz
Hoekstra R, Groth P. PROV-O-Viz-Understanding the Role of Activities in Provenance. Provenance and Annotation of Data and Processes, vol. 8628, Cologne, Germany: Springer-Verlag New York, Inc.; 2015, p. 215-220. doi:10.1007/978-3-319-16462-5_18.
The FAIR Guiding Principles for scientific data management and stewardship
  • M D Wilkinson
  • M Dumontier
  • Aalbersberg Ijj
  • G Appleton
  • M Axton
  • A Baak
Wilkinson MD, Dumontier M, Aalbersberg IjJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 2016;3. doi:10.1038/sdata.2016.18.
Architecture of a Biomedical Informatics Research Data Management Pipeline
  • C R Bauer
  • N Umbach
  • B Baum
  • K Buckow
  • T Franke
  • R Grütz
Bauer CR, Umbach N, Baum B, Buckow K, Franke T, Grütz R, et al. Architecture of a Biomedical Informatics Research Data Management Pipeline. Stud Health Technol Inform 2016;228:262-6. doi:10.3233/978-1-61499-678-1-262.
ProvToolbox: Java library to create and convert W3C PROV data model representations
  • L Moreau
Moreau L. ProvToolbox: Java library to create and convert W3C PROV data model representations. 2016.
Templates as a method for implementing data provenance in decision support systems
  • V Curcin
  • E Fairweather
  • R Danger
  • D Corrigan
Curcin V, Fairweather E, Danger R, Corrigan D. Templates as a method for implementing data provenance in decision support systems. Journal of Biomedical Informatics 2017;65:1-21. doi:10.1016/j.jbi.2016.10.022.
Provenancekonzept für Datenbestände aus einer heterogenen Forschungsinfrastuktur (am Beispiel einer klinischen Forschergruppe)
  • M Parciak
Parciak M. Provenancekonzept für Datenbestände aus einer heterogenen Forschungsinfrastuktur (am Beispiel einer klinischen Forschergruppe).
Opinion paper: Data provenance challenges in biomedical research
  • B Baum
  • C R Bauer
  • T Franke
  • H Kusch
  • M Parciak
  • T Rottmann
Baum B, Bauer CR, Franke T, Kusch H, Parciak M, Rottmann T, et al. Opinion paper: Data provenance challenges in biomedical research. It-Information Technology 2017;59:191-196. doi:10.1515/itit-2016-0031.
A survey on provenance: What for? What form? What from? The VLDB
  • M Herschel
  • R Diestelkämper
  • H B Lahmar
Herschel M, Diestelkämper R, Lahmar HB. A survey on provenance: What for? What form? What from? The VLDB Journal 2017;26:881-906. doi:10.1007/s00778-017-0486-1.
A systematic review of provenance systems
  • B Pérez
  • J Rubio
  • C Sáenz-Adán
Pérez B, Rubio J, Sáenz-Adán C. A systematic review of provenance systems. Knowl Inf Syst 2018:1-49. doi:10.1007/s10115-018-1164-3.
HiGHmed-An Open Platform Approach to Enhance Care and Research across Institutional Boundaries
  • B Haarbrandt
  • B Schreiweis
  • S Rey
  • U Sax
  • S Scheithauer
  • O Rienhoff
Haarbrandt B, Schreiweis B, Rey S, Sax U, Scheithauer S, Rienhoff O, et al. HiGHmed-An Open Platform Approach to Enhance Care and Research across Institutional Boundaries. Methods of Information in Medicine 2018;57:e66-81. doi:10.3414/ME18-02-0002.