Introduction: Clinical studies in medicine aim to derive knowledge from growing amounts of diverse datasets. Utilisation of this data frequently necessitates data integration processes, which directly affects the quality of the research outcome. Increasing transparency and reproducibility of these processes supports trust in the outcomes and enables meta-analysis of the integration process. To achieve this, provenance – a record of the creation, transformation and all other influences regarding an object ,  – can be captured and shared , , , establishing an integral part in complying to the FAIR principles .
A commonly used tool to integrate data is Talend Open Studio for Data Integration (TOS). We aimed to enhance TOS to make it provenance-aware, in order to capture fine-grained provenance without modification of the data integration pipelines themselves.
Methods: To model and store provenance, W3C-PROV is applied . The PROV core concept involves entities, activities, agents and their inter-relations. Although it is possible to extend PROV to tailor it to the needs of specific domains, we decided to use PROV without extensions.
As part of our data curation toolset, TOS is central to our data integration pipelines, enabling us to capture fine-grained provenance at a coordination-point . Using visual components and connectors TOS generates executable Java code (a job). Although the created jobs can vary in their functionality, their general structures remain similar.
In order to make TOS provenance-aware, the jobs are modified to capture provenance using a data-model provided by ProvToolbox , making serializations like XML and RDF available. We aimed at simplicity of use: existing jobs should not be modified.
Results: To capture provenance from running TOS jobs we introduce PROV@TOS, a Java wrapper that executes exported TOS jobs and stores standardized provenance information. A TOS component is modelled as an activity, input/output data are entities and influencers like the Java version are modelled as agents. Start and ending times of components are recorded in HashMaps from the Java Util library. Extended HashMaps have been implemented to store relevant provenance data which are injected using the Java Reflection API. If possible, entities will be accessed and identified using the hash of the referenced file. PROV@TOS is openly available at https://gitlab.gwdg.de/medinfpub/tos/provAtTos.
Discussion: PROV@TOS has been successfully tested on different well established TOS jobs within our department in different projects . We plan to put it into productive use within the UMG medical data integration center (UMG-MeDIC), making our data integration processes provenance-aware wherever TOS is used. By extending inherent TOS components, no further job modifications have to be performed. By using „plain vanilla“ W3C-PROV, tools like PROV-O-VIZ  were readily available to visualize the output data. Identifying entities using the file-hash enables provenance stitching within the whole integration pipeline . In the future, we plan to extend this feature to a data store providing persistent unique identifiers to avoid collisions caused by identical file-hashes.
The PROV@TOS based metadata extension of TOS jobs has the potential to increase transparency and reproducibility of research data.
Acknowledgements: This work was supported by the German Federal Ministry of Education and Research (BMBF) within the framework of the research and funding concepts of the Medical Informatics Initiative (01ZZ1802B/HiGHmed), i:DSem (031L0024A/MyPathSem) and the DFG for the Collaborative Research Centre 1002 on Modulatory Units in Heart Failure, subproject INF.