Conference PaperPDF Available

Data Stewardship Wizard for Open Science

Authors:

Abstract and Figures

Every year, the amount of data (in science) grows significantly as information technologies are used more intensively in various domains of human activities. Biologists, chemists, linguists, and others are not data experts but often just regular users who need to capture and process some huge amount of data. This is where serious problems emerge-bad data management leading to losing important data, producing unverifiable results, wasting funds, and so on. Thousands of qualified data stewards will be needed in following years to deal with this issues. At the Faculty of Information Technology, CTU in Prague, we participate in the European platform ELIXIR in which we work on the Data Stewardship Wizard to help researchers and data stewards with building high-quality FAIR data management plans that are accurate and helpful to their projects. We cooperate on this challenging project with our colleagues from other ELIXIR nodes.
Content may be subject to copyright.
Data Stewardship Wizard for Open Science
Marek Such´anek1and Robert Pergl2
Faculty of Information Technology
Czech Technical University in Prague
Th´akurova 9, 160 00 Praha
{suchama41, perglr2}@fit.cvut.cz
Abstract. Every year, the amount of data (in science) grows signifi-
cantly as information technologies are used more intensively in various
domains of human activities. Biologists, chemists, linguists, and others
are not data experts but often just regular users who need to capture
and process some huge amount of data. This is where serious problems
emerge – bad data management leading to losing important data, produc-
ing unverifiable results, wasting funds, and so on. Thousands of qualified
data stewards will be needed in following years to deal with this issues.
At the Faculty of Information Technology, CTU in Prague, we partici-
pate in the European platform ELIXIR in which we work on the Data
Stewardship Wizard to help researchers and data stewards with building
high-quality FAIR data management plans that are accurate and help-
ful to their projects. We cooperate on this challenging project with our
colleagues from other ELIXIR nodes.
Keywords: Data Stewardship ·FAIR Data ·Wizard ·DM Plan
1 Introduction
Data Management comprises all disciplines related to managing data as a valu-
able resource. Lately, a new term, Data Stewardship, has started to be used to
emphasize not just doing data management during a scientific project, but to
cover capturing, processing, (long time) preserving, and re-using data. All of
these from the planning stage of the project to the stage when the project is
already finished. In organizations, there should be designated data stewards as
experts that have the knowledge how to deal with data in a correct way and
help researchers to organize their projects appropriately.
Data stewards have complicated tasks to stay well trained themselves and
supervise data management of projects in the organization, i.e. communicate
and collaborate intensively with researchers and help them with the planning.
Data Stewardship Wizard helps with this whole process resulting in successful
scientific projects. Moreover, data management plans are more and more often
required as an attachment to funding applications. Example of this practice is
the European Union. EU and many data experts define Data Management Plans
(DMPs) as a key element of good data management. [1]
2 ELIXIR
ELIXIR is an international organization with the main goal to coordinate these
resources so that they form a single infrastructure. It operates mainly in Euro-
pean states, since 2014 it has gathered 21 members (including the Czech Re-
public, but for example also Israel) and over 180 research organizations. Each
member state represents a node and there is a central point called Hub that is
located in the UK. Activities of ELIXIR are separated into five Platforms: Data,
Tools, Interoperability, Compute, and Training. [2]
ELIXIR and its nodes organize a lot of events to gather scientists and to share
knowledge. Also, ELIXIR provides funding through so-called Implementation
Studies that are an important part of activities of a particular Platform. They
are proposed by Platforms, then must be agreed with the ELIXIR Heads of
Nodes committee and approved by the ELIXIR Board. [2]
Our project targets the Interoperability and Training platforms since the
Data platform is about specific data resources. FIT CTU in Prague is part of
ELIXIR CZ. We tightly cooperate with IOCB that is the Czech ELIXIR national
node and also with our colleagues from DTL (Dutch Techcentre for Life Sciences)
that acts as the Dutch node of ELIXIR. Thanks to the ELIXIR infrastructure
and events, we have a lot of feedback and attention across Europe. [3,4]
3 FAIR Data Principles and Metrics
To do a good and precise data stewardship and to promote open science, a broad
international community has developed the FAIR Data principles. FAIR is a nice
abbreviation of four pillars: Findable, Accessible, Interoperable, and Reusable.
Each of those is representing a set of three or four principles (15 in total) that
are stating the requirements. It is clear that these principles are applicable not
just to life science datasets but to data and projects in general. [5]
The GO FAIR Initiative focuses on the development of the Internet of FAIR
Data & Services (including the European Open Science Cloud) through many
activities such as trainings, tools, certifications, and so on. The problem is the
figuring out when you can say ”my data are FAIR”, for this reason, the FAIR
Metrics are prepared so you can measure your FAIRness. [6]
There are 14 well-defined metrics developed by the FAIR Metrics Group.
Each metric is related to FAIR principle, has own identifier, name, description,
information about the measuring and validation, and examples. In most cases,
evaluation is a process of retrieving metadata from entered URI and running
validation with it. For that purpose, the Metrics Evaluator made mainly by
Mark Wilkinson is also placed together with complete metrics description and
manuals in the repository https://github.com/FAIRMetrics/Metrics. [7]
4 Project: Data Stewarship Wizard
After introducing the background, ELIXIR, and FAIR, the Data Stewardship
Wizard (https://dsw.fairdata.solutions) can be described.
4.1 Goals and Visions
The goal is to develop an open-source solution that allows data stewards to
work with the knowledge in order to build a structured questionnaire with re-
lated information. After filling-in project information into this questionnaire, the
researchers will get high-quality data management plan (DMP). Again, produced
DMP contains detailed information such as advices, contacts, and metrics.
There are two most important differences from other solutions (such as DM-
Ponline or DMPTool). First, the knowledge model is easily extensible and thus
we are able to incorporate changes as well as integrate totally new concepts and
related applications. Next, other solutions try to simplify the creation of DMPs
by reducing the number of questions. Our vision is strictly against this. We want
to ask a lot of questions so the DMP is fully tailored to the project. On the other
hand, we don’t want to overwhelm the researcher with 600 questions but show
very few at the beginning and guide through only relevant ones.
4.2 Initialization and Participants
All of this started with data stewardship mind map produced by Rob Hooft
(DTL) and related book [8] by Barend Mons (GO FAIR). At first, we encoded the
mind map in a machine-readable format – JSON files, provided JSON schema,
and published it on GitHub together with colleagues from the Netherlands. Then
we developed at our faculty the first version of the Wizard as an interactive
browser of knowledge model following our visions. New functionality was being
added continually as we were getting user feedback (registrations, saving plans,
multiple plans per account, etc.). It has been developed in Haskell with high-
quality evolvable architecture allowing such extensions easily.
4.3 New Generation Wizard
This year, the Wizard is being developed as part of an 18-month ELIXIR Im-
plementation Study of the ELIXIR Training Platform. Our team at FIT CTU
gained new two skilled members. The study also includes training materials
and setting up workshops. These days we already have a new extensible por-
tal solution with Haskell backend and Elm frontend that consists of user and
settings management, knowledge model editor and migrations, and finally, the
Data Stewardship Planner that produces DMP from a filled questionnaire.
It basically works as depicted in Figure 1. Data steward uses the editor
to produce a knowledge model (KM) packages by adding, deleting, or editing
entities such as chapters, questions, answers, references and so on. A KM can
be updated in this way and have different versions using semantic versioning.
Also, it can be based on another parent KM as a customization. If parent KM
is updated, there is a migration tool that allows updating child KMs. Finally,
in the DS Planner, a KM can be turned into the questionnaire to be answered
by a researcher. DMP can be exported to various formats. A researcher is able
to provide a feedback regarding any question so data steward can continuously
improve the KM according to needs of researchers.
DS Planner
Migration
Editor
Package
Researcher
Data
Steward Trainee
DS Plan
c
o
n
t
i
n
u
o
u
s
i
m
p
r
o
v
e
m
e
n
t
project info
knowledge
knowledge
feedback
DS Domain
Project
??
?
?
?
?
export
import
KM
Fig. 1. Data Stewardship Wizard – knowledge flow
4.4 Architecture and Deployment
Currently, there is one instance of the Wizard hosted by ELIXIR CZ, but any
organization or individual can deploy their own instance in a very simple way
thanks to Docker. Of course, the Wizard provides export and import functional-
ity of KMs as JSON files. JSON schema to describe these files is public together
with a guide on how to publicly share KMs from the Wizard. Such setup together
with the hierarchy of KMs is depicted in Figure 2.
root
v0.3.0
KM Packages
Global
Wizard
?
DS Planner
DM Plan
Answers
Local
Wizard
v1.5.1
... Publish Update
Feedback
Discuss
Browse
Learn
Certify
DS Domain Project
Export
Import
Researcher
Questions
Discuss
Fig. 2. Data Stewardship Wizard – architecture
MongoDB is used as the storage for all the data in the Wizard. It can be
also running from Docker, locally, or remotely. There are also other integrations,
for example, giving question-related feedback as GitHub issues directly from
the Wizard, sending emails, or subscribing into a newsletter. In the future, the
number of such integrations is expected to grow significantly.
4.5 Roadmap
Aside from the tasks that are defined in the Implementation Study, we are fo-
cused on the community needs and we align our development plans according
to them. We get a lot of valuable feedback from non-commercial as well as the
commercial environment through various channels. Our plans and projects are
publicly available on https://github.com/DataStewardshipWizard.
For Fall 2018, we are ready to deliver new features such as metrics (FAIR but
also others) connected to questions and answers, desirability phases of questions,
extended types of references, on-the-fly questionnaire indications, summary re-
ports, and DMP exports in various formats. In the following period, we will focus
on already mentioned integration, especially in term of evaluating the question-
naire (e.g., connecting to FAIRness evaluator or cost evaluator), but also with
other ELIXIR services like Training eSupport System (TeSS), EDAM Ontology,
Bioschemas, or FAIRsharing.org.
5 Conclusion
The Data Stewardship Wizard is smart but simple to use solution covering the
whole process of data management planning. A broad community is using, test-
ing, and exploring details of the Wizard and providing valuable feedback that we
are always very thankful for. We are happy that we do something to help people
and science in general and that people are really interested in our work. It still
will be a long or even never-ending way of improvements and adjustments, but
unpretentious thanks to the used architecture and technologies.
References
1. EUROPEAN COMMISSION: Open Science. Last accessed 20 Jun 2018,
https://ec.europa.eu/research/openscience
2. ELIXIR: About us, https://www.elixir-europe.org/about-us. Last accessed 15 Jun
2018
3. ELIXIR-CZECH: Organisation/Structure, https://www.elixir-czech.cz/about-
elixir-cz/organisation-structure. Last accessed 15 Jun 2018
4. DTL: ELIXIR-NL, https://www.dtls.nl/elixir-nl/. Last accessed 16 Jun 2018
5. WILKINSON, Mark D. et al., 2016. The FAIR Guiding Principles for scientific data
management and stewardship. Scientific Data, vol. 3
6. GO FAIR: GO FAIR Initiative, https://www.go-fair.org/go-fair-initiative/. Last ac-
cessed 17 Jun 2018
7. FAIR METRICS GROUP: FAIR Metrics, http://fairmetrics.org. Last accessed
15 Jun 2018
8. MONS, Barend, Data Stewardship for Open Science: Implementing FAIR Principles,
1st edn. CRC Press, 2018, ISBN 9781498753173
Acknowledgements: The project has been supported by ELIXIR and ELIXIR
CZ. This paper was supported by CTU research grant SGS17/211/OHK3/3T/18.
... All of the reported work is framed within a wider scope of having the DCSO adopted as the official semantic-based serialisation of the DCS application profile. "Utility and discussion" section presents a use case, through the description of the adoption of DCSO by the Data Stewardship Wizard (DSW) DMP creation tool [7]. Finally, "Conclusions" section provides a summarised review on the contents of this paper, as well as a description of the future goals for DCSO. ...
Article
Full-text available
The concept of Data Management Plan (DMP) has emerged as a fundamental tool to help researchers through the systematical management of data. The Research Data Alliance DMP Common Standard (DCS) working group developed a set of universal concepts characterising a DMP so it can be represented as a machine-actionable artefact, i.e., machine-actionable Data Management Plan (maDMP). The technology-agnostic approach of the current maDMP specification: (i) does not explicitly link to related data models or ontologies, (ii) has no standardised way to describe controlled vocabularies, and (iii) is extensible but has no clear mechanism to distinguish between the core specification and its extensions.This paper reports on a community effort to create the DMP Common Standard Ontology (DCSO) as a serialisation of the DCS core concepts, with a particular focus on a detailed description of the components of the ontology. Our initial result shows that the proposed DCSO can become a suitable candidate for a reference serialisation of the DMP Common Standard.
Chapter
Independent and preferably atomic services sending messages to each other are a significant approach of Separations of Concerns principle application. There are already standardised formats and protocols that enable easy implementation. In this paper, we go deeper and introduce evolvable and machine-actionable reports that can be sent between services. It is not just a way of encoding reports and composing them together; it allows linking semantics using technologies from semantic web and ontology engineering, mainly JSON-LD and Schema.org. We demonstrate our design on the Data Stewardship Wizard project where reports from evaluations are crucial functionality, but thanks to its versatility and extensibility, it can be used in any message-oriented software system or subsystem.
Article
Full-text available
There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders—representing academia, industry, funding agencies, and scholarly publishers—have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community.