Article

The FAIR Guiding Principles for scientific data management and stewardship

Abstract

There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders—representing academia, industry, funding agencies, and scholarly publishers—have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community.
Comment: The FAIR Guiding
Principles for scientic data
management and stewardship
Mark D. Wilkinson et al.
#
There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse
set of stakeholdersrepresenting academia, industry, funding agencies, and scholarly publishershave
come together to design and jointly endorse a concise and measureable set of principles that we refer
to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to
enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human
scholar, the FAIR Principles put specic emphasis on enhancing the ability of machines to automatically
nd and use the data, in addition to supporting its reuse by individuals. This Comment is the rst
formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar
implementations in the community.
Supporting discovery through good data management
Good data management is not a goal in itself, but rather is the key conduit leading to knowledge
discovery and innovation, and to subsequent data and knowledge integration and reuse by the
community after the data publication process. Unfortunately, the existing digital ecosystem
surrounding scholarly data publication prevents us from extracting maximum benetfromour
research investments (e.g., ref. 1). Partially in response to this, science funders, publishers and
governmental agencies are beginning to require data management and stewardship plans for data
generated in publicly funded experiments. Beyond proper collection, annotation, and archival, data
stewardship includes the notion of long-term careof valuable digital assets, with the goal that they
should be discovered and re-used for downstream investigations, either alone, or in combination with
newly generated data. The outcomes from good data management and stewardship, therefore, are
high quality digital publications that facilitate and simplify this ongoing process of discovery, evaluation,
and reuse in downstream studies. What constitutes good data managementis, however, largely
undened, and is generally left as a decision for the data or repository owner. Therefore, bringing some
clarity around the goals and desiderata of good data management and stewardship, and dening
simple guideposts to inform those who publish and/or preserve scholarly data, would be of great utility.
This article describes four foundational principlesFindability, Accessibility, Interoperability, and
Reusabilitythat serve to guide data producers and publishers as they navigate around these
obstacles, thereby helping to maximize the added-value gained by contemporary, formal scholarly
digital publishing. Importantly, it is our intent that the principles apply not only to datain the
conventional sense, but also to the algorithms, tools, and workows that led to that data. All
scholarly digital research objects
2
from data to analytical pipelinesbenet from application of
these principles, since all components of the research process must be available to ensure
transparency, reproducibility, and reusability.
There are numerous and diverse stakeholders who stand to benet from overcoming these obstacles:
researchers wanting to share, get credit, and reuse each others data and interpretations; professional
data publishers offering their services; software and tool-builders providing data analysis and
processing services such as reusable workows; funding agencies (private and public) increasingly
Correspondence and requests for materials should be addressed to B.M. (email: barend.mons@dtls.nl).
#A full list of authors and their afliations appears at the end of the paper.
OPEN
SUBJECT CATEGORIES
» Research data
» Publication
characteristics
Received: 10 December 2015
Accepted: 12 February 2016
Published: 15 March 2016
www.nature.com/scientificdata
SCIENTIFIC DATA |3:160018 |DOI: 10.1038/sdata.2016.18 1
concerned with long-term data stewardship; and a data science community mining, integrating and
analysing new and existing data to advance discovery. To facilitate the reading of this manuscript by
these diverse stakeholders, we provide denitions for common abbreviations in Box 1. Humans,
however, are not the only critical stakeholders in the milieu of scientic data. Similar problems are
encountered by the applications and computational agents that we task to undertake data retrieval
and analysis on our behalf. These computational stakeholdersare increasingly relevant, and demand
as much, or more, attention as their importance grows. One of the grand challenges of data-intensive
science, therefore, is to improve knowledge discovery through assisting both humans, and their
computational agents, in the discovery of, access to, and integration and analysis of, task-appropriate
scientic data and other scholarly digital objects.
For certain types of important digital objects, there are well-curated, deeply-integrated,
special-purpose repositories such as Genbank
3
, Worldwide Protein Data Bank (wwPDB
4
), and
UniProt
5
in the life sciences; Space Physics Data Facility (SPDF; http://spdf.gsfc.nasa.gov/) and Set of
Identications, Measurements and Bibliography for Astronomical Data (SIMBAD
6
) in the space
sciences. These foundational and critical core resources are continuously curating and capturing high-
value reference datasets and ne-tuning them to enhance scholarly output, provide support for both
human and mechanical users, and provide extensive tooling to access their content in rich, dynamic
ways. However, not all datasets or even data types can be captured by, or submitted to, these
repositories. Many important datasets emerging from traditional, low-throughput bench science dont
t in the data models of these special-purpose repositories, yet these datasets are no less important
with respect to integrative research, reproducibility, and reuse in general. Apparently in response to
this, we see the emergence of numerous general-purpose data repositories, at scales ranging from
institutional (for example, a single university), to open globally-scoped repositories such as Dataverse
7
,
FigShare (http://gshare.com), Dryad
8
, Mendeley Data (https://data.mendeley.com/), Zenodo (http://
zenodo.org/), DataHub (http://datahub.io), DANS (http://www.dans.knaw.nl/), and EUDat
9
. Such
repositories accept a wide range of data types in a wide variety of formats, generally do not attempt
to integrate or harmonize the deposited data, and place few restrictions (or requirements) on the
descriptors of the data deposition. The resulting data ecosystem, therefore, appears to be moving
away from centralization, is becoming more diverse, and less integrated, thereby exacerbating the
discovery and re-usability problem for both human and computational stakeholders.
Aspecic example of these obstacles could be imagined in the domain of gene regulation and expression
analysis. Suppose a researcher has generated a dataset of differentially-selected polyadenylation sites in
a non-model pathogenic organism grown under a variety of environmental conditions that stimulate its
pathogenic state. The researcher is interested in comparing the alternatively-polyadenylated genes in
this local dataset, to other examples of alternative-polyadenylation, and the expression levels of these
genesboth in this organism and related model organismsduring the infection process. Given that
there is no special-purpose archive for differential polyadenylation data, and no model organism
database for this pathogen, where does the researcher begin?
We will consider the current approach to this problem from a variety of data discovery and integration
perspectives. If the desired datasets existed, where might they have been published, and how would
one begin to search for them, using what search tools? The desired search would need to lter based
on specic species, specic tissues, specic types of data (Poly-A, microarray, NGS), specic
conditions (infection), and specic genesis that information (metadata) captured by the
repositories, and if so, what formats is it in, is it searchable, and how? Once the data is discovered,
can it be downloaded? In what format(s)? Can that format be easily integrated with private in-house
data (the local dataset of alternative polyadenylation sites) as well as other data publications from
third-parties and with the communitys core gene/protein data repositories? Can this integration be
Box 1|Terms and Abbreviations
BD2KBig Data 2Knowledge, is a trans-NIH initiative established to enable biomedical research as a digital research enterprise, to facilitate discovery and
support new knowledge, and to maximise community engagement.
DOIDigital Object Identier; a code used to permanently and stably identify (usually digital) objects. DOIs provide a standard mechanism for retrieval of
metadata about the object, and generally a means to access the data object itself.
FAIRFindable, Accessible, Interoperable, Reusable.
FORCE11The Future of Research Communications and e-Scholarship; a community of scholars, librarians, archivists, publishers and research funders that
has arisen organically to help facilitate the change toward improved knowledge creation and sharing, initiated in 2011.
Interoperabilitythe ability of data or tools from non-cooperating resources to integrate or work together with minimal effort.
JDDCPJoint Declaration of Data Citation Principles; Acknowledging data as a rst-class research output, and to support good research practices around
data re-use, JDDCP proposes a set of guiding principles for citation of data within scholarly literature, another dataset, or any other research object.
RDFResource Description Framework; a globally-accepted framework for data and knowledge representation that is intended to be read and interpreted
by machines.
www.nature.com/sdata/
SCIENTIFIC DATA |3:160018 |DOI: 10.1038/sdata.2016.18 2
done automatically to save time and avoid copy/paste errors? Does the researcher have permission to
use the data from these third-party researchers, under what license conditions, and who should be
cited if a data-point is re-used?
Questions such as these highlight some of the barriers to data discovery and reuse, not only for
humans, but even more so for machines; yet it is precisely these kinds of deeply and broadly
integrative analyses that constitute the bulk of contemporary e-Science. The reason that we often
need several weeks (or months) of specialist technical effort to gather the data necessary to answer
such research questions is not the lack of appropriate technology; the reason is, that we do not pay
our valuable digital objects the careful attention they deserve when we create and preserve them.
Overcoming these barriers, therefore, necessitates that all stakeholdersincluding researchers,
special-purpose, and general-purpose repositoriesevolve to meet the emergent challenges
described above. The goal is for scholarly digital objects of all kinds to become rst class citizens
in the scientic publication ecosystem, where the quality of the publicationand more importantly,
the impact of the publicationis a function of its ability to be accurately and appropriately found, re-
used, and cited over time, by all stakeholders, both human and mechanical.
With this goal in-mind, a workshop was held in Leiden, Netherlands, in 2014, named Jointly
Designing a Data Fairport. This workshop brought together a wide group of academic and private
stakeholders all of whom had an interest in overcoming data discovery and reuse obstacles. From the
deliberations at the workshop the notion emerged that, through the denition of, and widespread
support for, a minimal set of community-agreed guiding principles and practices, all stakeholders
could more easily discover, access, appropriately integrate and re-use, and adequately cite, the vast
quantities of information being generated by contemporary data-intensive science. The meeting
concluded with a draft formulation of a set of foundational principles that were subsequently
elaborated in greater detailnamely, that all research objects should be Findable, Accessible,
Interoperable and Reusable (FAIR) both for machines and for people. These are now referred to as the
FAIR Guiding Principles. Subsequently, a dedicated FAIR working group, established by several
members of the FORCE11 community
10
ne-tuned and improved the Principles. The results of these
efforts are reported here.
The signicance of machines in data-rich research environments
The emphasis placed on FAIRness being applied to both human-driven and machine-driven activities,
is a specic focus of the FAIR Guiding Principles that distinguishes them from many peer initiatives
(discussed in the subsequent section). Humans and machines often face distinct barriers when
attempting to nd and process data on the Web. Humans have an intuitive sense of semantics(the
meaning or intent of a digital object) because we are capable of identifying and interpreting a wide
variety of contextual cues, whether those take the form of structural/visual/iconic cues in the layout of
a Web page, or the content of narrative notes. As such, we are less likely to make errors in the
selection of appropriate data or other digital objects, although humans will face similar difculties if
sufcient contextual metadata is lacking. The primary limitation of humans, however, is that we are
unable to operate at the scope, scale, and speed necessitated by the scale of contemporary scientic
data and complexity of e-Science. It is for this reason that humans increasingly rely on computational
agents to undertake discovery and integration tasks on their behalf. This necessitates machines to be
capable of autonomously and appropriately acting when faced with the wide range of types, formats,
and access-mechanisms/protocols that will be encountered during their self-guided exploration of the
global data ecosystem. It also necessitates that the machines keep an exquisite record of provenance
such that the data they are collecting can be accurately and adequately cited. Assisting these agents,
therefore, is a critical consideration for all participants in the data management and stewardship
processfrom researchers and data producers to data repository hosts.
Throughout this paper, we use the phrase machine actionableto indicate a continuum of possible
states wherein a digital object provides increasingly more detailed information to an autonomously-
acting, computational data explorer. This information enables the agentto a degree dependent on
the amount of detail providedto have the capacity, when faced with a digital object never
encountered before, to: a) identify the type of object (with respect to both structure and intent), b)
determine if it is useful within the context of the agents current task by interrogating metadata and/
or data elements, c) determine if it is usable, with respect to license, consent, or other accessibility or
use constraints, and d) take appropriate action, in much the same manner that a human would.
For example, a machine may be capable of determining the data-type of a discovered digital object,
but not capable of parsing it due to it being in an unknown format; or it may be capable of processing
the contained data, but not capable of determining the licensing requirements related to the retrieval
and/or use of that data. The optimal statewhere machines fully understandand can autonomously
and correctly operate-on a digital objectmay rarely be achieved. Nevertheless, the FAIR principles
provide steps along a pathtoward machine-actionability; adopting, in whole or in part, the FAIR
www.nature.com/sdata/
SCIENTIFIC DATA |3:160018 |DOI: 10.1038/sdata.2016.18 3
principles, leads the resource along the continuum towards this optimal state. In addition, the idea of
being machine-actionable applies in two contextsrst, when referring to the contextual metadata
surrounding a digital object (what is it?), and second, when referring to the content of the digital
object itself (how do I process it/integrate it?). Either, or both of these may be machine-actionable,
and each forms its own continuum of actionability.
Finally, we wish to draw a distinction between data that is machine-actionable as a result of specic
investment in software supporting that data-type, for example, bespoke parsers that understand life
science wwPDB les or space science Space Physics Archive Search and Extract (SPASE) les, and
data that is machine-actionable exclusively through the utilization of general-purpose, open
technologies. To reiterate the earlier pointultimate machine-actionability occurs when a machine
can make a useful decision regarding data that it has not encountered before. This distinction is
important when considering both (a) the rapidly growing and evolving data environment, with new
technologies and new, more complex data-types continuously being developed, and (b) the growth of
general-purpose repositories, where the data-types likely to be encountered by an agent are
unpredictable. Creating bespoke parsers, in all computer languages, for all data-types and all
analytical tools that require those data-types, is not a sustainable activity. As such, the focus on
assisting machines in their discovery and exploration of data through application of more generalized
interoperability technologies and standards at the data/repository level, becomes a rst-priority for
good data stewardship.
The FAIR Guiding Principles in detail
Representatives of the interested stakeholder-groups, discussed above, coalesced around four core
desideratathe FAIR Guiding Principlesand limited elaboration of these, which have been rened
(Box 2) from the meetings original draft, available at (https://www.force11.org/node/6062). A
separate document that dynamically addresses community discussion relating to clarications and
explanations of the principles, and detailed guidelines for and examples of FAIR implementations, is
currently being constructed (http://datafairport.org/fair-principles-living-document-menu). The FAIR
Guiding Principles describe distinct considerations for contemporary data publishing environments
with respect to supporting both manual and automated deposition, exploration, sharing, and reuse.
While there have been a number of recent, often domain-focused publications advocating for specic
improvements in practices relating to data management and archival
1,11,12
, FAIR differs in that it
describes concise, domain-independent, high-level principles that can be applied to a wide range of
scholarly outputs. Throughout the Principles, we use the phrase (meta)datain cases where the
Principle should be applied to both metadata and data.
The elements of the FAIR Principles are related, but independent and separable. The Principles dene
characteristics that contemporary data resources, tools, vocabularies and infrastructures should
exhibit to assist discovery and reuse by third-parties. By minimally dening each guiding principle, the
barrier-to-entry for data producers, publishers and stewards who wish to make their data holdings
FAIR is purposely maintained as low as possible. The Principles may be adhered to in any combination
and incrementally, as data providerspublishing environments evolve to increasing degrees of
FAIRness. Moreover, the modularity of the Principles, and their distinction between data and
metadata, explicitly support a wide range of special circumstances. One such example is highly
sensitive or personally-identiable data, where publication of rich metadata to facilitate discovery,
including clear rules regarding the process for accessing the data, provides a high degree of FAIRness
even in the absence of FAIR publication of the data itself. A second example involves the publication
Box 2|The FAIR Guiding Principles
To be Findable:
F1. (meta)data are assigned a globally unique and persistent identier
F2. data are described with rich metadata (dened by R1below)
F3. metadata clearly and explicitly include the identier of the data it describes
F4. (meta)data are registered or indexed in a searchable resource
To be Accessible:
A1. (meta)data are retrievable by their identier using a standardized communications protocol
A1.1the protocol is open, free, and universally implementable
A1.2the protocol allows for an authentication and authorization procedure, where necessary
A2. metadata are accessible, even when the data are no longer available
To be Interoperable:
I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
I2. (meta)data use vocabularies that follow FAIR principles
I3. (meta)data include qualied references to other (meta)data
To be Reusable:
R1. meta(data) are richly described with a plurality of accurate and relevant attributes
R1.1. (meta)data are released with a clear and accessible data usage license
R1.2. (meta)data are associated with detailed provenance
R1.3. (meta)data meet domain-relevant community standards
www.nature.com/sdata/
SCIENTIFIC DATA |3:160018 |DOI: 10.1038/sdata.2016.18 4
of non-data research objects. Analytical workows, for example, are a critical component of the
scholarly ecosystem, and their formal publication is necessary to achieve both transparency and
scientic reproducibility. The FAIR principles can equally be applied to these non-data assets, which
need to be identied, described, discovered, and reused in much the same manner as data.
Specic exemplar efforts that provide varying levels of FAIRness are detailed later in this document.
Additional issues, however, remain to be addressed. First, when community-endorsed vocabularies or
other (meta)data standards do not include the attributes necessary to achieve rich annotation, there
are two possible solutions: either publish an extension of an existing, closely related vocabulary, orin
the extreme casecreate and explicitly publish a new vocabulary resource, following FAIR principles
(I2). Second, to explicitly identify the standard chosen when more than one vocabulary or other
(meta)data standard is available, and given that for instance in the life sciences there are over 600
content standards, the BioSharing registry (https://biosharing.org/) can be of use as it describes the
standards in detail, including versions where applicable.
The Principles precede implementation
These high-level FAIR Guiding Principles precede implementation choices, and do not suggest any
specic technology, standard, or implementation-solution; moreover, the Principles are not,
themselves, a standard or a specication. They act as a guide to data publishers and stewards to
assist them in evaluating whether their particular implementation choices are rendering their digital
research artefacts Findable, Accessible, Interoperable, and Reusable. We anticipate that these high
level principles will enable a broad range of integrative and exploratory behaviours, based on a wide
range of technology choices and implementations. Indeed, many repositories are already
implementing various aspects of FAIR using a variety of technology choices and several examples
are detailed in the next section; examples include Scientic Data itself and how narrative data articles
are anchored to a progressively FAIR structured metadata.
Examples of FAIRness, and the resulting value-added
Dataverse
7
: Dataverse is an open-source data repository software installed in dozens of institutions
globally to support public community repositories or institutional research data repositories. Harvard
Dataverse, with more than 60,000 datasets, is the largest of the current Dataverse repositories, and is
open to all researchers from all research elds. Dataverse generates a formal citation for each deposit,
following the standard dened by Altman and King
13
. Dataverse makes the Digital Object Identier
(DOI), or other persistent identiers (Handles), public when the dataset is published (F). This resolves
to a landing page, providing access to metadata, data les, dataset terms, waivers or licenses, and
version information, all of which is indexed and searchable (F,A, and R). Deposits include
metadata, data les, and any complementary les (such as documentation or code) needed to
understand the data and analysis (R). Metadata is always public, even if the data are restricted or
removed for privacy issues (F,A). This metadata is offered at three levels, extensively supporting the
Iand RFAIR principles: 1) data citation metadata, which maps to DataCite schema or Dublin Core
Terms, 2) domain-specic metadata, which when possible maps to metadata standards used within a
scientic domain, and 3)le-level metadata, which can be deep and extensive for tabular data les
(including column-level metadata). Finally, Dataverse provides public machine-accessible interfaces to
search the data, access the metadata and download the data les, using a token to grant access when
data les are restricted (A).
FAIRDOM (http://fair-dom.org/about): integrates the SEEK
14
and openBIS
15
platforms to produce a
FAIR data and model management facility for Systems Biology. Individual research assets (or
aggregates of data and models) are identied with unique and persistent HTTP URLs, which can be
registered with DOIs for publication (F). Assets can be accessed over the Web in a variety of formats
appropriate for individuals and/or their computers (RDF, XML) (I). Research assets are annotated with
rich metadata, using community standards, formats and ontologies (I). The metadata is stored as
RDF to enable interoperability and assets can be downloaded for reuse (R).
ISA
16
: is a community-driven metadata tracking framework to facilitate standards-compliant
collection, curation, management and reuse of life science datasets. ISA provides progressively FAIR
structured metadata to Nature Scientic Datas Data Descriptor articles, and many GigaScience data
papers, and underpins the EBI MetaboLights database among other data resources. At the heart is a
general-purpose, extensible ISA model, originally only available as a tabular representation but
subsequently enhanced as an RDF-based representation
17
, and JSON serializations to enable the I
and R, becoming FAIRwhen published as linked data (http://elixir-uk.org/node-events/201cisa-as-a-
fair-research-object201d-hack-the-spec-event-1) and complementing other research objects
18
.
Open PHACTS
19
: Open PHACTS is a data integration platform for information pertaining to drug
discovery. Access to the platform is mediated through a machine-accessible interface
20
which
provides multiple representations that are both human (HTML) and machine readable (RDF, JSON,
www.nature.com/sdata/
SCIENTIFIC DATA |3:160018 |DOI: 10.1038/sdata.2016.18 5
XML, CSV, etc), providing the Afacet of FAIRness. The interface allows multiple URLs to be used to
access information about a particular entity through a mappings service (Fand A). Thus, a user can
provide a ChEMBL URL to retrieve information sourced from, for example, Chemspider or DrugBank.
Each call provides a canonical URL in its response (Aand I). All data sources used are described
using standardized dataset descriptions, following the global VoID standard, with rich provenance (R
and I). All interface features are described using RDF following the Linked Data API specication (A).
Finally, a majority of the datasets are described using community agreed upon ontologies (I).
wwPDB
4,21
: wwPDB is a special-purpose, intensively-curated data archive that hosts information
about experimentally-determined 3D structures of proteins and nucleic acids. All wwPDB entries are
stably hosted on an FTP server (A) and represented in machine-readable formats (text and XML); the
latter are machine-actionable using the metadata provided by the wwPDB conforming to the
Macromolecular Information Framework (mmCIF
22
), a data standard of the International Union of
Crystallography (IUCr) (F,Ifor humans, F,Ifor IUCr-aware machines). The wwPDB metadata
contains cross-references to common identiers such as PubMed and NCBI Taxonomy, and their
wwPDB metadata are described in data dictionaries and schema documents (http://mmcif.wwpdb.org
and http://pdbml.wwpdb.org) which conform to the IUCr data standard for the chemical and
structural biology domains (R). A variety of software tools are available to interpret both wwPDB
data and meta-data (I,Rfor humans, I,Rfor machines with this software). Each entry is
represented by a DOI (F,Afor humans and machines). The DOI resolves to a zipped le which
requires special software for further interrogation/interpretation. Other wwPDB access points
2325
provide access to wwPDB records through URLs that are likely to be stable in the long-term (F), and
all data and metadata is searchable through one or more of the wwPDB-afliated websites (F)
UniProt
26
: UniProt is a comprehensive resource for protein sequence and annotation data. All entries
are uniquely identied by a stable URL, that provides access to the record in a variety of formats
including a web page, plain-text, and RDF (Fand A). The record contains rich metadata (F) that is
both human-readable (HTML) and machine-readable (text and RDF), where the RDF formatted
response utilizes shared vocabularies and ontologies such as UniProt Core, FALDO, and ECO (I).
Interlinking with more than 150 different databases, every UniProt record has extensive links into, for
example, PubMed, enabling rich citation. These links are machine-actionable in the RDF
representation (R). Finally, in the RDF representation, the UniProt Core Ontology explicitly types
all records, leaving no ambiguityneither for humans nor machinesabout what the data represents
(R), enabling fully-automated retrieval of records and cross-referencing information.
In addition to, and in support of, communities and resources that are already pursuing FAIR
objectives, the Data Citation Implementation Group of Force11 has published specic technical
recommendations for how to implement many of the principles
27
, with a particular focus on
identiers and their resolution, persistence, and metadata accessibility especially related to citation.
In addition, the Skunkworksgroup that emerged from the Lorentz Workshop has been creating
software supporting infrastructures
28
that are, end-to-end, compatible with FAIR principles, and can
be implemented over existing repositories. These code modules have a particular focus on metadata
publication and searchability, compatibility in cases of strict privacy considerations, and the extremely
difcult problem of data and metadata interoperability (manuscript in preparation). Finally, there are
several emergent projects, some listed in Box 3, for which FAIR is a key objective. These projects may
provide valuable advice and guidance for those wishing to become more FAIR.
FAIRness is a prerequisite for proper data management and data stewardship
The ideas within the FAIR Guiding Principles reect, combine, build upon and extend previous work by
both the Concept Web Alliance (https://conceptweblog.wordpress.com/) partners, who focused on
machine-actionability and harmonization of data structures and semantics, and by the scientic and
scholarly organizations that developed the Joint Declaration of Data Citation Principles (JDDCP
29
),
Box 3|Emergent community/collaborative initiatives with FAIR as a core focus or activity
bioCADDIE (https://biocaddie.org): The NIH BD2K biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) consortium works to develop a
Data Discovery Index (DDI) prototype, which is set to be as transformative and impactful for data as PubMed for the biomedical literature
30
. The DDI focuses
on nding (F) and accessing (A) the datasets stored across different sources, and progressively works to identify relevant metadata
31
(I) and maps them to
community standards (R), linking to BioSharing.
CEDAR
32
: The Center for Expanded Data Annotation and Retrieval (CEDAR) is an NIH BD2K funded center of excellence to develop tools and technologies
that reduce the burden of authoring and enhancing metadata that meet community-based standards. CEDAR will enable the creation of metadata templates
that implement community based standards for experimental metadata, from BioSharing (https://biosharing.org), and that will be uniquely identiable and
retrievable with HTTP URIs, and annotated with vocabularies and ontologies drawn from BioPortal (http://bioportal.bioontology.org) (F,A,I,R). These
templates will guide users to create rich metadata with unique and stable HTTP identiers (F) that can be retrieved using HTTP (A) and accessible in a
variety of formats (JSON-LD, TURTLE, RDF/XML, CSV, etc) (I). These metadata will use community standards, as dened by the template, and include
provenance and data usage (R).
These two projects, among others, provide tools and or collaborative opportunities for those who wish to improve the FAIRness of their data.
www.nature.com/sdata/
SCIENTIFIC DATA |3:160018 |DOI: 10.1038/sdata.2016.18 6
who focused on primary scholarly data being made citable, discoverable and available for reuse, so as
to be capable of supporting more rigorous scholarship. An attempt to dene the similarities and
overlaps between the FAIR Principles and the JDDCP is provided at (https://www.force11.org/node/
6062). The FAIR Principles are also complementary to the Data Seal of Approval(DSA) (http://
datasealofapproval.org/media/ler_public/2013/09/27/guidelines_2014-2015.pdf) in that they share
the general aim to render data re-usable for users other than those who originally generated them.
While the DSA focuses primarily on the responsibilities and conduct of data producers and
repositories, FAIR focuses primarily on the data itself. Clearly, the broader community of stakeholders
is coalescing around a set of common, dovetailed visions spanning all facets of the scholarly data
publishing ecosystem.
The end result, when implemented, will be more rigorous management and stewardship of these
valuable digital resources, to the benet of the entire academic community. As stated at the outset,
good data management and stewardship is not a goal in itself, but rather a pre-condition supporting
knowledge discovery and innovation. Contemporary e-Science requires data to be Findable,
Accessible, Interoperable, and Reusable in the long-term, and these objectives are rapidly becoming
expectations of agencies and publishers. We demonstrate, therefore, that the FAIR Data Principles
provide a set of mileposts for data producers and publishers. They guide the implementation of
the most basic levels of good Data Management and Stewardship practice, thus helping researchers
adhere to the expectations and requirements of their funding agencies. We call on all data producers
and publishers to examine and implement these principles, and actively participate with the
FAIR initiative by joining the Force11 working group. By working together towards shared, common
goals, the valuable data produced by our community will gradually achieve the critical goals
of FAIRness.
References
1. Roche, D. G., Kruuk, L. E. B., Lanfear, R. & Binning, S. A. Public Data Archiving in Ecology and Evolution: How Well Are
We Doing? PLOS Biol. 13, e1002295 (2015).
2. Bechhofer, S. et al. Research Objects: Towards Exchange and Reuse of Digital Knowledge. Nat. Preced. doi:10.1038/
npre.2010.4626.1 (2010).
3. Benson, D. A. et al. GenBank. Nucleic Acids Res. 41, D36D42 (2013).
4. Berman, H., Henrick, K. & Nakamura, H. Announcing the worldwide Protein Data Bank. Nat. Struct. Biol. 10, 980980 (2003).
5. The Uniprot Consortium. UniProt: a hub for protein information. Nucleic Acids Res. 43, D204D212 (2015).
6. Wenger, M. et al. The SIMBAD astronomical database-The CDS reference database for astronomical objects. Astron. Astrophys.
Suppl. Ser. 143, 922 (2000).
7. Crosas, M. "The Dataverse Network
®
: An Open-Source Application for Sharing, Discovering and Preserving Data". D-Lib Mag 17
(1),p2 (2011).
8. White, H. C., Carrier, S., Thompson, A., Greenberg, J. & Scherle, R. The Dryad data repository: A Singapore framework metadata
architecture in a DSpace environment. Univ. Göttingen, p157 (2008).
9. Lecarpentier, D. et al. EUDAT: A New Cross-Disciplinary Data Infrastructure for Science. Int. J. Digit. Curation 8,
279287 (2013).
10. Martone, M. E. FORCE11: Building the Future for Research Communications and e-Scholarship. Bioscience 65, 635 (2015).
11. White, E. et al. Nine simple ways to make it easier to (re)use your data. Ideas Ecol. Evol. 6(2013).
12. Sandve, G. K., Nekrutenko, A., Taylor, J. & Hovig, E. Ten Simple Rules for Reproducible Computational Research. PLoS Comput.
Biol. 9, e1003285 (2013).
13. Altman, M. & King, G. in D-Lib Magazine 13, no. 3/4 (2007).
14. Wolstencroft, K. et al. SEEK: a systems biology data and model management platform. BMC Syst. Biol. 9, 33 (2015).
15. Bauch, A. et al. openBIS: a exible framework for managing and analyzing complex data in biology research. BMC Bioinformatics
12, 468 (2011).
16. Sansone, S.-A. et al. Toward interoperable bioscience data. Nat. Genet. 44, 121126 (2012).
17. González-Beltrán, A., Maguire, E., Sansone, S.-A. & Rocca-Serra, P. linkedISA: semantic representation of ISA-Tab experimental
metadata. BMC Bioinformatics 15, S4 (2014).
18. González-Beltrán, A. et al. From Peer-Reviewed to Peer-Reproduced in Scholarly Publishing: The Complementary Roles of Data
Models and Workows in Bioinformatics. PLoS ONE 10, e0127612 (2015).
19. Harland, L. Open PHACTS: A Semantic Knowledge Infrastructure for Public and Commercial Drug Discovery Research. Knowl.
Eng. Knowl. Manag. Lect. Notes Comput. Sci. 7603/2012, 17 (2012).
20. Groth, P. et al. API-centric Linked Data integration: The Open PHACTS Discovery Platform case study. Web Semant. Sci. Serv.
Agents World Wide Web 29, 1218 (2014).
21. Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235242 (2000).
22. Bourne, P. E., Berman, H. M., Watenpaugh, K., Westbrook, J. D. & Fitzgerald, P. M. D. The macromolecular crystallographic
information le (mmCIF). Meth. Enzym 277, 571590 (1997).
23. Rose, P. W. et al. The RCSB Protein Data Bank: views of structural biology for basic and applied research and education. Nucleic
Acids Res. 43, D345D356 (2015).
24. Kinjo, A. R. et al. Protein Data Bank Japan (PDBj): maintaining a structural data archive and resource description
framework format. Nucleic Acids Res. 40, D453D460 (2012).
25. Gutmanas, A. et al. PDBe: Protein Data Bank in Europe. Nucleic Acids Res. 42, D285D291 (2014).
26. UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res. 43, D204D212 (2015).
27. Starr, J. et al. Achieving human and machine accessibility of cited data in scholarly publications. PeerJ Comput. Sci. 1, e1 (2015).
28. Wilkinson, M., Dumontier, M. & Durbin, P. DataFairPort: The Perl libraries version 0.231 doi:10.5281/zenodo.33584 (2015).
29. Data Citation Synthesis Group: Joint Declaration of Data Citation Principles. San Diego CA: FORCE11. https://www.force11.org/
datacitation (2014).
30. Ohno-machado, L. et al. NIH BD2K bioCADDIE white paperData Discovery Index. http://dx.doi.org/10.6084/m9.g-
share.1362572 (2015).
www.nature.com/sdata/
SCIENTIFIC DATA |3:160018 |DOI: 10.1038/sdata.2016.18 7
31. NIH BD2K bioCADDIE WG3 Members. WG3-MetadataSpecications: NIH BD2K bioCADDIE Data Discovery Index WG3
Metadata Specication v1 doi:10.5281/zenodo.28019 (2015).
32. Musen, M. A. et al. The center for expanded data annotation and retrieval. J. Am. Med. Informatics Assoc. 22, 11481152 (2015).
Acknowledgements
The original Lorentz Workshop Jointly Designing a Data FAIRportwas organized by Barend Mons in
collaboration with and co-sponsored by the Lorentz center, The Dutch Techcenter for the Life Sciences
and the Netherlands eScience Center. The principles and themes described in this manuscript represent
the signicant voluntary contributions and participation of the authors at, and/or subsequent to, this
workshop and from the wider Force11, BD2K and ELIXIR communities. We also acknowledge and thank
the organizers and backers of the NBDC/DBCLS BioHackathon 2015, where several of the authors made
signicant revisions to the FAIR Principles.
Author Contributions
M.W. was the primary author of the manuscript, and participated extensively in the drafting and editing
of the FAIR Principles. M.D. was signicantly involved in the drafting of the FAIR Principles. B.M.
conceived of the FAIR Data Initiative, contributed extensively to the drafting of the principles, and to this
manuscript text. All other authors are listed alphabetically, and contributed to the manuscript either by
their participation in the initial workshop and/or by editing or commenting on the manuscript text.
Additional Information
Competing nancial interests: M.A. is the Nature GeneticsEditor in Chief; S.A.S. is Scientic Datas
Honorary Academic Editor and consultant.
How to cite this article: Wilkinson, M. D. et al. The FAIR Guiding Principles for scientic data
management and stewardship. Sci. Data 3:160018 doi: 10.1038/sdata.2016.18 (2016).
This work is licensed under a Creative Commons Attribution 4.0 International License. The
images or other third party material in this article are included in the articles Creative
Commons license, unless indicated otherwise in the credit line; if the material is not included under the
Creative Commons license, users will need to obtain permission from the license holder to reproduce the
material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0
Mark D. Wilkinson
1
, Michel Dumontier
2
, IJsbrand Jan Aalbersberg
3
, Gabrielle Appleton
3
,
Myles Axton
4
, Arie Baak
5
, Niklas Blomberg
6
, Jan-Willem Boiten
7
,
Luiz Bonino da Silva Santos
8
, Philip E. Bourne
9
, Jildau Bouwman
10
, Anthony J. Brookes
11
,
Tim Clark
12
, Mercè Crosas
13
, Ingrid Dillo
14
, Olivier Dumon
3
, Scott Edmunds
15
,
Chris T. Evelo
16
, Richard Finkers
17
, Alejandra Gonzalez-Beltran
18
, Alasdair J.G. Gray
19
,
Paul Groth
3
, Carole Goble
20
, Jeffrey S. Grethe
21
, Jaap Heringa
22
, Peter A.C. t Hoen
23
,
Rob Hooft
24
, Tobias Kuhn
25
, Ruben Kok
22
, Joost Kok
26
, Scott J. Lusher
27
,
Maryann E. Martone
28
, Albert Mons
29
, Abel L. Packer
30
, Bengt Persson
31
,
Philippe Rocca-Serra
18
, Marco Roos
32
, Rene van Schaik
33
, Susanna-Assunta Sansone
18
,
Erik Schultes
34
, Thierry Sengstag
35
, Ted Slater
36
, George Strawn
37
, Morris A. Swertz
38
,
Mark Thompson
32
, Johan van der Lei
39
, Erik van Mulligen
39
, Jan Velterop
40
,
Andra Waagmeester
41
, Peter Wittenburg
42
, Katherine Wolstencroft
43
, Jun Zhao
44
& Barend Mons
45,46,47
1
Center for Plant Biotechnology and Genomics, Universidad Politécnica de Madrid, Madrid 28223, Spain.
2
Stanford University, Stanford 94305-5411, USA.
3
Elsevier, Amsterdam 1043 NX, The Netherlands.
4
Nature
Genetics, New York 10004-1562, USA.
5
Euretos and Phortos Consultants, Rotterdam 2741 CA, The Netherlands.
6
ELIXIR, Wellcome Genome Campus, Hinxton CB10 1SA, UK.
7
Lygature, Eindhoven 5656 AG, The Netherlands.
8
Vrije Universiteit Amsterdam, Dutch Techcenter for Life Sciences, Amsterdam 1081 HV, The Netherlands.
9
Ofce of the Director, National Institutes of Health, Rockville 20892, USA.
10
TNO, Zeist 3700 AJ, The
Netherlands.
11
Department of Genetics, University of Leicester, Leicester LE17RH, UK.
12
Harvard Medical
School, Boston, Massachusetts MA 02115, USA.
13
Harvard University, Cambridge, Massachusetts MA 02138,
USA.
14
Data Archiving and Networked Services (DANS), The Hague 2593 HW, The Netherlands.
15
GigaScience,
Beijing Genomics Institute, Shenzhen 518083, China.
16
Department of Bioinformatics, Maastricht University,
Maastricht 6200 MD, The Netherlands.
17
Wageningen UR Plant Breeding, Wageningen 6708 PB, The
Netherlands.
18
Oxford e-Research Center, University of Oxford, Oxford OX13QG, UK.
19
Heriot-Watt University,
Edinburgh EH14 4AS, UK.
20
School of Computer Science, University of Manchester, Manchester M13 9PL, UK.
21
Center for Research in Biological Systems, School of Medicine, University of California San Diego, La Jolla,
California 92093-0446, USA.
22
Dutch Techcenter for the Life Sciences, Utrecht 3501 DE, The Netherlands.
23
Department of Human Genetics, Leiden University Medical Center, Dutch Techcenter for the Life Sciences,
Leiden 2300 RC, The Netherlands.
24
Dutch TechCenter for Life Sciences and ELIXIR-NL, Utrecht 3501 DE, The
www.nature.com/sdata/
SCIENTIFIC DATA |3:160018 |DOI: 10.1038/sdata.2016.18 8
Netherlands.
25
VU University Amsterdam, Amsterdam 1081 HV, The Netherlands.
26
Leiden Center of Data
Science, Leiden University, Leiden 2300 RA, The Netherlands.
27
Netherlands eScience Center, Amsterdam 1098
XG, The Netherlands.
28
National Center for Microscopy and Imaging Research, UCSD, San Diego 92103, USA.
29
Phortos Consultants, San Diego 92011, USA.
30
SciELO/FAPESP Program, UNIFESP Foundation, São Paulo
05468-901, Brazil.
31
Bioinformatics Infrastructure for Life Sciences (BILS), Science for Life Laboratory, Dept of
Cell and Molecular Biology, Uppsala University, S-751 24, Uppsala, Sweden.
32
Leiden University Medical Center,
Leiden 2333 ZA, The Netherlands.
33
Bayer CropScience, Gent Area 1831, Belgium.
34
Leiden Institute for
Advanced Computer Science, Leiden University Medical Center, Leiden 2300 RA, The Netherlands.
35
Swiss
Institute of Bioinformatics and University of Basel, Basel 4056, Switzerland.
36
Cray, Inc., Seattle 98164, USA.
37
Unafliated.
38
University Medical Center Groningen (UMCG), University of Groningen, Groningen 9713 GZ, The
Netherlands.
39
Erasmus MC, Rotterdam 3015 CE, The Netherlands.
40
Independent Open Access and Open
Science Advocate, Guildford GU13PW, UK.
41
Micelio, Antwerp 2180, Belgium.
42
Max Planck Compute and Data
Facility, MPS, Garching 85748, Germany.
43
Leiden Institute of Advanced Computer Science, Leiden University,
Leiden 2333 CA, The Netherlands.
44
Department of Computer Science, Oxford University, Oxford OX13QD, UK.
45
Leiden University Medical Center, Leiden and Dutch TechCenter for Life Sciences, Utrecht 2333 ZA, The
Netherlands.
46
Netherlands eScience Center, Amsterdam 1098 XG, The Netherlands.
47
Erasmus MC, Rotterdam
3015 CE, The Netherlands.
www.nature.com/sdata/
SCIENTIFIC DATA |3:160018 |DOI: 10.1038/sdata.2016.18 9
... The principle of metadata findability introduced by GO FAIR metric group [33] was further exemplified by Wilkinson et al. [58] by the metric measuring the existence of the globally unique and persistent ID for the data included in the metadata, registering in a searchable resource. 7. ...
... Wilkinson et al. [58] defined accessibility as the availability of the standardized access protocol, accessibility of the metadata without data, authentication, and authorization. ...
Article
Full-text available
The mission of biobanks is to provide biological material and data for medical research. Reproducible medical studies of high quality require material and data with established quality. Metadata, defined as data that provides information about other data, represents the content of biobank collections, particularly which data accompanies the stored samples and which quality the available data features. The quality of biobank metadata themselves, however, is currently neither properly defined nor investigated in depth. We list the properties of biobanks that are most important for metadata quality management and emphasize both the role of biobanks as data brokers, which are responsible not for the quality of the data itself but for the quality of its representation, and the importance of supporting the search for biobank collections when the sample data is not accessible. Based on an intensive review of metadata definitions and definitions of quality characteristics, we establish clear definitions of metadata quality attributes and their metrics in a design science approach. In particular, we discuss the quality measures accuracy, completeness, coverage, consistency, timeliness, provenance, reliability, accessibility, and conformance to expectations together with their respective metrics. These definitions are intended as a foundation for establishing metadata quality management systems for biobanks.
... We have conducted a comprehensive literature review of published descriptions of UP sedimentary combustion features between ~ 47,5 ka and ~ 13 ka BP. For the guidelines for our review, we adhered to FAIR (Findability, Accessibility, Interoperability, Reusability) guiding principles for scientific data management and stewardship, commonly used in the natural sciences (Wilkinson et al. 2016). Data was collected from May 2019 until July 2022 using open access or widely accessible online repositories and search engines, including Google Scholar, Web of Science, Academia and Research-Gate. ...
Article
Full-text available
Pyrotechnology, the ability for hominins to use fire as a tool, is considered to be one of the most important behavioural adaptations in human evolution. While several studies have focused on identifying the emergence of fire use and later Middle Palaeolithic Neanderthal combustion features, far fewer have focused on modern human fire use. As a result, we currently have more data characterizing the hominin fire use prior to 50,000 years before present (BP), than we do for Upper Palaeolithic of Europe. Here we review the available data on Upper Palaeolithic fire evidence between 48,000 and 13,000 years BP to understand the evolution of modern human pyrotechnology. Our results suggest regional clustering of feature types during the Aurignacian and further demonstrate a significant change in modern human fire use, namely in terms of the intensification and structural variation between 35,000 and 28,000 years BP. This change also corresponds to the development and spread of the Gravettian technocomplex throughout Europe and may correspond to a shift in the perception of fire. Additionally, we also show a significant lack of available high-resolution data on combustion features during the height of last glacial maximum. Furthermore, we highlight the need for more research into the effects of syn- and post-depositional processes on archaeological combustion materials and a need for more standardization of descriptions in the published literature. Overall, our review shows a significant and complex developmental process for Upper Palaeolithic fire use which in many ways mirrors the behavioural evolution of modern humans seen in other archaeological mediums.
... The FAIR principle 28 (see Wilkinson et al., 2016 for details) for data management assumes that data produced are: ...
Technical Report
Full-text available
During the last decades, digitalisation of representation, data and interactive processes underpinning current practices of infrastructures and biodiversity management have taken different tracks leading to the development of specific knowledge that now has to be mainstreamed in order to render transport infrastructure sustainable with the smallest possible impact on biodiversity. In this document, we explore opportunities for both sectors offered by the development of the operative continuum between Geographic Information Systems (GIS), Building Information Model (BIM) and Digital Twin (DT) implemented by transport and/or biodiversity infrastructure developers and managers. Such a continuum would require a Common Data Environment (CDE) which still have to be defined in a context where biodiversity theme is almost absent from the BIM environment. Thanks to the survey performed by the BISON project among stakeholders from transport infrastructure and biodiversity sectors, we showed that the digital technology subject of transport infrastructures as well as biodiversity management is still a topic which seems to be mainly handled independently by a small group of experts, researchers and practitioners from both the sectors. In addition, this shortage seems to be shared among the Member States and their related stakeholder network due to a limited permeability between the transport infrastructure, the biodiversity and the information technology sectors. This report thus points out the main digital technologies which uses tend to emerge in order to manage transport infrastructures as well as biodiversity. In this respect, the report follows the data value-chain and identifies at each step the main digital technologies involved, their current use, and what gaps and barriers are hindering their spread in the market, if relevant. Therefore, it identifies the future main trends in terms of new technologies, or changes in their use. These discussions are not turned only toward the benefits for the transport infrastructure sector or the biodiversity one but rather focused on the opportunities offered by the mainstreaming of biodiversity issues within all the infrastructure management life-cycle1. First, the deliverable addresses the general aspect of data collection. In this respect, the first technical section focuses on sensors issue with two complementary and non-exclusive scopes. Sensors are initially considered in a mobile context, where they are embedded in vehicles (satellites, common vehicles, drones, etc.) and are recording data along the vehicle trajectory permitting for large-scale recording or places difficult to access. Second, sensors are considered to be static and to monitor the infrastructure or biodiversity assets they have been aimed at tracking. These static sensors are thus expected to be connected and part of the Internet of Things (IoT) to operate as a network. Such a functioning offers the opportunity for long-term continuous monitoring of the transport infrastructure and its environmental assets. Growing especially in the environmental sector, citizen-based data is the subject of the third part of the data collection topic. Citizen-based data are largely used for biodiversity monitoring and should be considered with the mainstreaming of biodiversity in transport infrastructure. For now, citizen-based approaches are rare in the transport infrastructure management sector and including these new approaches might open several new challenges. This section continues with a part dedicated to modelling with a focus on engineering models which aim to produce realistic simulated data to solve engineering problems. Being largely used in the industry sector and in civil engineering, ecological models are developing but their use for solving biodiversity questions occurring in the context of transport infrastructure management is still quite rare. To close the data acquisition aspect, a transversal section dedicated to artificial intelligence (AI) techniques intends to highlight their catalytic effects when implemented with the different data collection techniques addressed in the deliverable. We, therefore, conclude that both biodiversity and transport sectors use these tools and data for specific purposes which can often be mutualised and offer opportunities for cost-efficient improvement of transport infrastructure and biodiversity management. The second technical section is turned toward data management and sharing issue. We show that transferring knowledge and know-how from the BIM sector, especially regarding processes, constitute a large field of research, development and innovation. This expansion of the BIM application field should be developed and promoted in order to ensure interoperability between the two current silos represented by the biodiversity management on one side, and the transport infrastructure management on the other whilst they are more and more intertwined. The two main keys to address interoperability problems are data structure and exchange file format interoperability between software. Regarding data structure, incorporating BIM-related concepts and methods developed in the industry or in the real estate management are a necessary step. We thus propose to develop good practices inspired by BIM processes which can be applied to data collection as well as data sharing at the EU scale in the context of mainstreaming biodiversity in transport infrastructure. Finally, this section makes a focus on the central challenge encountered to address data spatial and temporal heterogeneity, which is relevant for mainstreaming biodiversity in transport infrastructure management. Particularly, this section discusses some data interoperability challenges. They are related to managing data at large scale with 2D GIS commonly used for biodiversity management and linear transport infrastructure and with BIM with regards to the development of DT tools. Such an interoperability issue must also be put in perspective of the development of smart sustainable cities which have to be connected with transport and/or biodiversity actual and digital infrastructures. After addressing data collection and their management issues, the deliverable explores some integrative applications which are expected to emerge from the development of digital tools, allowing for the integration of biodiversity themes into transport infrastructure management. Thus, the report pledges for the development of an integrative GIS/BIM/DT continuum able to properly integrate biodiversity management into the complete life-cycle of transport infrastructures to ensure their sustainability and prevent them from being a source of biodiversity loss. Thus, this section first addresses the opportunities in terms of development of a practitioner community offered by the joint work of the biodiversity management, the transport infrastructure management and the computer science communities. Therefore, the section addresses the topic of the software development required to ensure data interoperability and collaboration between actors of the mainstreaming of biodiversity in transport infrastructures. We finally explore emerging expected practices offered by the development of inclusive GIS/BIM/DT for biodiversity and transport infrastructure as the integration of biodiversity into the life-cycle assessment of transport infrastructure, the development of virtual and enhanced reality for infrastructure management and relation with citizens or regulating administration, etc. Such an integrative continuum would otherwise require massive research, development, innovation and capacity building as it constitutes a new activity sector at the crossroad between civil engineering, ecology and computer science. Digital technologies are energy and resource consuming. We, thus, provide recommendations to ensure the sustainability of the mainstreaming of biodiversity in transport infrastructure in a digital environment. Similarly, some specific data security recommendations are provided to avoid specific risks associated with the biodiversity data and prevent illegal trade of protected species for instance.
Article
Background On 21–24 June 2022, the European Food Safety Authority, together with the European Centre for Disease Prevention and Control, the European Chemicals Agency, the European Environment Agency, the European Medicines Agency, and the Joint Research Centre of the European Commission, held the “ONE – Health, Environment & Society – Conference 2022”. Scope and approach The conference brought together experts and stakeholders to reflect on how scientific advice related to food safety and nutrition will need to develop to respond to a fast-changing world. The event also explored how institutions that provide such advice should best prepare for the challenges ahead, and how they can contribute to policy targets and societal demands for safe, nutritious and sustainable food. Key findings and conclusions Overall, participants concluded that food safety assessments must be further advanced to remain fit for purpose and increase their relevance to society. To address the growing complexity in science and society, new ways of working that connect and integrate knowledge, data and expertise across a wide range of disciplines, sectors and actors must be embraced. One Health provides a valuable conceptual framework for advancing food safety assessments by ensuring the delivery of more integrated, cross-sectoral and collaborative health assessments. These assessments may help to better inform policies that support the transition towards a sustainable food system. As such, One Health could serve as a steppingstone to sustainable food. Urgent action is now required to define how the One Health principles can be implemented in food safety and nutrition.
Preprint
Full-text available
Deep-sea fishery in the Mediterranean Sea was historically driven by the commercial profitability of deepwater red shrimps and understanding spatio-temporal dynamics of fishing is key to comprehensively evaluate the status of these profitable resources and prevent stock collapse. A four-year time series of AIS-based observed monthly patterns and related frequency of trawling disturbance are provided with a resolution of 0.01°*0.01°, accounting for the spatial extent and temporal variability in deep water bottom contact fisheries during the period 2015–2018. The dataset was estimated from 370 fishing vessels that were found to perform trawling in deep water (400 m–800 m) during the study period, and they represent a significant part of the real fleet exploiting this fishing grounds in the study area. The reconstructed deep-water trawling effort dataset is available at: https://doi.org/10.17882/89150 (Pulcinella et al., 2022). This large-scale and high-resolution dataset may help researchers of many scientific fields, as well as those involved in fishery management and in the update of existing management plans for deep-water red shrimp fisheries as foreseen in relevant General Fisheries Commission for the Mediterranean (GFCM) recommendations.
Article
In the field of neuroscience, the core of the cohort study project consists of collection, analysis, and sharing of multi-modal data. Recent years have witnessed a host of efficient and high-quality toolkits published and employed to improve the quality of multi-modal data in the cohort study. In turn, gleaning answers to relevant questions from such a conglomeration of studies is a time-consuming task for cohort researchers. As part of our efforts to tackle this problem, we propose a hierarchical neuroscience knowledge base that consists of projects/organizations, multi-modal databases, and toolkits, so as to facilitate researchers' answer searching process. We first classified studies conducted for the topic “Frontiers in Neuroinformatics” according to the multi-modal data life cycle, and from these studies, information objects as projects/organizations, multi-modal databases, and toolkits have been extracted. Then, we map these information objects into our proposed knowledge base framework. A Python-based query tool has also been developed in tandem for quicker access to the knowledge base, (accessible at https://github.com/Romantic-Pumpkin/PDT_fninf). Finally, based on the constructed knowledge base, we discussed some key research issues and underlying trends in different stages of the multi-modal data life cycle.
Article
Aesthetic aspects of drinking water, such as Taste and Odor (T&O), have significant effects on consumer perceptions and acceptability. Solving unpleasant water T&O episodes in water supplies is challenging, since it requires expertise and know-how in diagnosis, evaluation of impacts and implementation of control measures. We present gaps, challenges and perspectives to advance water T&O science and technology, by identifying key areas in sensory and chemical analysis, risk assessment and water treatment, as articulated by WaterTOP (COST Action CA18225), an interdisciplinary European and international network of researchers, experts, and stakeholders.
Article
Recent years have seen a substantial increase in the application of machine learning (ML) for automated analysis of nondestructive examination (NDE) data. One of the applications of interest is the use of ML for the analysis of data from in-service inspection of welds in nuclear power and other industries. These types of inspections are performed in accordance with criteria described in the ASME Boiler and Pressure Vessel Code and require the use of reliable NDE techniques. The rapid growth in ML methods and the diversity of possible approaches indicate a need to assess the current capabilities of ML and automated data analysis for NDE and identify any gaps or shortcomings in current ML technologies as applied to the automated analysis of NDE data. In particular, there is a need to determine the impact of ML on the NDE reliability. This paper discusses the findings from a literature survey on the current state of ML for the automated analysis of data from ultrasonic NDE of weld flaws. It discusses an overview of ultrasonic NDE as used for weld inspections in the nuclear power and other industries. Data sets and ML models used in the literature are summarized, along with a generally applicable workflow for ML. Findings on the capabilities, limitations and potential gaps in feature selection, data selection, and ML model optimization are discussed. The paper identified several needs for quantifying and validating the performance of ML methods for ultrasonic NDE, including the need for common data sets.
Preprint
Full-text available
Artificial Neural Networks (ANN) are already heavily involved in methods and applications for frequent tasks in the field of computational chemistry such as representation of potential energy surfaces (PES) and spectroscopic predictions. This perspective provides an overview of the foundations of neural network-based full-dimensional potential energy surfaces, their architectures, underlying concepts, their representation and applications to chemical systems. Methods for data generation and training procedures for PES construction are discussed and means for error assessment and refinement through transfer learning are presented. A selection of recent results illustrates the latest improvements regarding accuracy of PES representations and system size limitations in dynamics simulations, but also NN application enabling direct prediction of physical results without dynamics simulations. The aim is to provide an overview for the current state-of-the-art NN approaches in computational chemistry and also to point out the current challenges in enhancing reliability and applicability of NN methods on larger scale.
Article
Artificial intelligence (AI) has emerged as a fundamental component of global agricultural research that is poised to impact on many aspects of plant science. In digital phenomics, AI is capable of learning intricate structure and patterns in large datasets. We provide a perspective and primer on AI applications to phenome research. We propose a novel human-centric explainable AI (X-AI) system architecture consisting of data architecture, technology infrastructure, and AI architecture design. We clarify the difference between post hoc models and 'interpretable by design' models. We include guidance for effectively using an interpretable by design model in phenomic analysis. We also provide directions to sources of tools and resources for making data analytics increasingly accessible. This primer is accompanied by an interactive online tutorial.
Article
Full-text available
Policies that mandate public data archiving (PDA) successfully increase accessibility to data underlying scientific publications. However, is the data quality sufficient to allow reuse and reanalysis? We surveyed 100 datasets associated with nonmolecular studies in journals that commonly publish ecological and evolutionary research and have a strong PDA policy. Out of these datasets, 56% were incomplete, and 64% were archived in a way that partially or entirely prevented reuse. We suggest that cultural shifts facilitating clearer benefits to authors are necessary to achieve high-quality PDA and highlight key guidelines to help authors increase their data's reuse potential and compliance with journal data policies.
Article
Full-text available
Motivation: Reproducing the results from a scientific paper can be challenging due to the absence of data and the computational tools required for their analysis. In addition, details relating to the procedures used to obtain the published results can be difficult to discern due to the use of natural language when reporting how experiments have been performed. The Investigation/Study/Assay (ISA), Nanopublications (NP), and Research Objects (RO) models are conceptual data modelling frameworks that can structure such information from scientific papers. Computational workflow platforms can also be used to reproduce analyses of data in a principled manner. We assessed the extent by which ISA, NP, and RO models, together with the Galaxy workflow system, can capture the experimental processes and reproduce the findings of a previously published paper reporting on the development of SOAPdenovo2, a de novo genome assembler. Results: Executable workflows were developed using Galaxy, which reproduced results that were consistent with the published findings. A structured representation of the information in the SOAPdenovo2 paper was produced by combining the use of ISA, NP, and RO models. By structuring the information in the published paper using these data and scientific workflow modelling frameworks, it was possible to explicitly declare elements of experimental design, variables, and findings. The models served as guides in the curation of scientific information and this led to the identification of inconsistencies in the original published paper, thereby allowing its authors to publish corrections in the form of an errata. Availability: SOAPdenovo2 scripts, data, and results are available through the GigaScience Database: http://dx.doi.org/10.5524/100044; the workflows are available from GigaGalaxy: http://galaxy.cbiit.cuhk.edu.hk; and the representations using the ISA, NP, and RO models are available through the SOAPdenovo2 case study website http://isa-tools.github.io/soapdenovo2/. Contact: philippe.rocca-serra@oerc.ox.ac.uk and susanna-assunta.sansone@oerc.ox.ac.uk.
Article
Full-text available
Background Systems biology research typically involves the integration and analysis of heterogeneous data types in order to model and predict biological processes. Researchers therefore require tools and resources to facilitate the sharing and integration of data, and for linking of data to systems biology models. There are a large number of public repositories for storing biological data of a particular type, for example transcriptomics or proteomics, and there are several model repositories. However, this silo-type storage of data and models is not conducive to systems biology investigations. Interdependencies between multiple omics datasets and between datasets and models are essential. Researchers require an environment that will allow the management and sharing of heterogeneous data and models in the context of the experiments which created them. Results The SEEK is a suite of tools to support the management, sharing and exploration of data and models in systems biology. The SEEK platform provides an access-controlled, web-based environment for scientists to share and exchange data and models for day-to-day collaboration and for public dissemination. A plug-in architecture allows the linking of experiments, their protocols, data, models and results in a configurable system that is available 'off the shelf'. Tools to run model simulations, plot experimental data and assist with data annotation and standardisation combine to produce a collection of resources that support analysis as well as sharing. Underlying semantic web resources additionally extract and serve SEEK metadata in RDF (Resource Description Format). SEEK RDF enables rich semantic queries, both within SEEK and between related resources in the web of Linked Open Data. Conclusion The SEEK platform has been adopted by many systems biology consortia across Europe. It is a data management environment that has a low barrier of uptake and provides rich resources for collaboration. This paper provides an update on the functions and features of the SEEK software, and describes the use of the SEEK in the SysMO consortium (Systems biology for Micro-organisms), and the VLN (virtual Liver Network), two large systems biology initiatives with different research aims and different scientific communities.
Conference Paper
Technology advances in the last decade have led to a “digital revolution” in biomedical research. Much greater volumes of data can be generated in much less time, transforming the way researchers work [1]. Yet, for those seeking to develop new drugs to treat human disease, the task of assembling a coherent picture of existing knowledge from molecular biology to clinical investigation, can be daunting and frustrating. Individual electronic resources remain mostly disconnected, making it difficult to follow information between them. Those that contain similar types of data can describe them very differently, compounding the confusion. It can also be difficult to understand exactly where specific facts or data points originated or how to judge their quality or reliability. Finally, scientists routinely wish to ask questions that the system does not allow, or ask questions that span multiple different resources. Often the result of this is to simply abandon the enquiry, significantly diminishing the value to be gained from existing knowledge. Within pharmaceutical companies, such concerns have led to majorprogrammes in data integration; downloading, parsing, mapping, transforming and presenting public, commercial and private data. Much of this work is redundant between companies and significant resources could be saved by collaboration [2]. In an industry facing major economic pressures [3], the idea of combining forces to “get more for less” is very attractive and is arguably the only feasible route to dealing with the exponentially growing information landscape.
Chapter
The Protein Data Bank (PDB) is the single, freely available, global archive of structural data for biological macromolecules. It is maintained by the wwPDB consortium consisting of the Research Collaboratory for Structural Bioinformatics (RCSB PDB), the Protein Data Bank in Europe (PDBe), the PDB Japan (PDBj) and the BioMagResBank (BMRB). This chapter describes the organization of the wwPDB, the systems in place for data deposition, annotation and distribution, and a summary of the services provided by the wwPDB member sites. Keywords: Protein Data Bank