Figure - available from: PLOS One
This content is subject to copyright.
Metadata standards in the (life) sciences obtained from re3data [57] and RDA metadata standards catalog [58] The number in brackets denotes the number of repositories supporting the standard (provided in re3data).
Source publication
The increasing amount of publicly available research data provides the opportunity to link and integrate data in order to create and prove novel hypotheses, to repeat experiments or to compare recent data to data collected at a different time or place. However, recent studies have shown that retrieving relevant data for data reuse is a time-consumi...
Citations
... Users might find data difficult to understand due to lack of standardized documentation and formatting, challenging data integration and analysis, and incomplete or unclear metadata (Reichman et al., 2011). Indeed, inadequately captured information in metadata is a significant barrier for data retrieval (Löffler et al., 2021). Moreover, considerable efforts are required to compile and analyze data, because they are spread over heterogeneous repositories with varying metadata standards (e.g., Vlah et al., 2023), and use different terminology and scales (Reichman et al., 2011). ...
Free use and redistribution of data (i.e., Open Data) increases the reproducibility, transparency, and pace of aquatic sciences research. However, barriers to both data users and data providers may limit the adoption of Open Data practices. Here, we describe common Open Data challenges faced by data users and data providers within the aquatic sciences community (i.e., oceanography, limnology, hydrology, and others). These challenges were synthesized from literature, authors’ experiences, and a broad survey of 174 data users and data providers across academia, government agencies, industry, and other sectors. Through this work, we identified seven main challenges: 1) metadata shortcomings, 2) variable data quality and reusability, 3) open data inaccessibility, 4) lack of standardization, 5) authorship and acknowledgement issues 6) lack of funding, and 7) unequal barriers around the globe. Our key recommendation is to improve resources to advance Open Data practices. This includes dedicated funds for capacity building, hiring and maintaining of skilled personnel, and robust digital infrastructures for preparation, storage, and long-term maintenance of Open Data. Further, to incentivize data sharing we reinforce the need for standardized best practices to handle data acknowledgement and citations for both data users and data providers. We also highlight and discuss regional disparities in resources and research practices within a global perspective.
... KEYWORDS research data; research data management; research data repository-metadata management of research data is essential for securing grants, funding productivity, ensuring the future use of research data, and enabling collaboration [across the disciplines] (Doucette & Fyfe, 2013). The rising amount of publicly available research data allows linking and integrating data to make and prove novel hypotheses, repeat experiments, or compare recent data to data collected at a different time or place (Löffler et al., 2021). Concurrently, research practices underline increasing datafication, creating new incentives for researchers to share their data and metadata. ...
... Publicly available research data allow researchers to compare recent data with previously deposited data to determine its relevance (Löffler et al., 2021). Furthermore, RDRs are one of the main actors in such a process (Assante et al., 2016). ...
Metadata is vital for information storage and retrieval from a database or repository. In the case of Research Data Repositories (RDRs), metadata can be a potent tool for describing and identifying data. Further, producing the metadata is indispensable for fostering data reuse. A filtered view of the registry of research data repositories, re3data.org, depicts no uniform pat-
terns or standards for metadata in the case of RDRs, and the metadata elements and practices differ from RDR to RDR. The present study describes the features of a select number of RDRs and analyzes their metadata practices: Harvard Dataverse, Dryad, Figshare, Zenodo, and the Open Science Framework (OSF). It further examines the total number of metadata ele-
ments, common metadata elements, required metadata elements, and item-level metadata. Results indicate that even though Harvard Dataverse has the most metadata elements, Dryad provides rich metadata concerning item level. This study suggests a common metadata framework, richer metadata elements, and more features to make the research data’s interoperability possible from one RDR to another.
... Vast archives of papers in scientific journals and preserved specimens in natural history museums contain invaluable information on biological diversity. However, this information has proven to be inaccessible for large data analysis [57] and its automatic extraction has been historically limited [58], thus hampering research into ongoing human-caused biodiversity crises. The efficient access and extraction of this information would give scientists access to a vast amount of data. ...
In this paper, we describe the capabilities and constraints of Large Language Models (LLMs) within disparate academic disciplines, aiming to delineate their strengths and limitations with precision. We examine how LLMs augment scientific inquiry, offering concrete examples such as accelerating literature review by summarizing vast numbers of publications, enhancing code development through automated syntax correction, and refining the scientific writing process. Simultaneously, we articulate the challenges LLMs face, including their reliance on extensive and sometimes biased datasets, and the potential ethical dilemmas stemming from their use. Our critical discussion extends to the varying impacts of LLMs across fields, from the natural sciences, where they help model complex biological sequences, to the social sciences, where they can parse large-scale qualitative data. We conclude by offering a nuanced perspective on how LLMs can be both a boon and a boundary to scientific progress.
... All genetic information of any organism is stored in its nucleic acids, DNA and RNA, which can be extracted from individuals of a given species, from entire communities, or even from environmental samples (e.g., from the water column or from a sediment core). These nucleic acid extracts can then be subjected to a variety of high-throughput sequencing approaches, ranging from whole (meta)genome sequencing to (meta)transcriptomic and epigenetic analysis (e.g., Mason et al., 2017). These and related approaches generate huge amounts of nucleotide data, which require quality checking prior to taxonomic or functional analysis. ...
... As a prerequisite for big data analysis, curated and highquality databases are essential, but the lack of metadata standards often presents issues (Stow et al., 2018), especially when the inability to taxonomically assign organisms hampers the usability of databases in, for example, biodiversity studies (Bayraktarov et al., 2019;Loeffler et al., 2021). Many data archives also suffer from the fact that submitters do not always adhere even to minimal sequence metadata standards. ...
This paper was initiated by a multidisciplinary Topic Workshop in the frame of the Deutsche Forschungsgemeinschaft Priority Program 1158 “Antarctic Research with Comparative Investigations in Arctic Ice Areas”, and hence it represents only the national view without claiming to be complete but is intended to provide awareness and suggestions for the current discussion on so-called big data in many scientific fields. The importance of the polar regions and their essential role for the Earth system are both undoubtedly recognized. However, dramatic changes in the climate and environment have been observed first in the Arctic and later in Antarctica over the past few decades. While important data have been collected and observation networks have been built in Antarctica and the Southern Ocean, this is a relatively data-scarce region due to the challenges of remote data acquisition, expensive labor, and harsh environmental conditions. There are many approaches crossing multiple scientific disciplines to better understand Antarctic processes; to evaluate ongoing climatic and environmental changes and their manifold ecological, physical, chemical, and geological consequences; and to make (improved) predictions. Together, these approaches generate very large, multivariate data sets, which can be broadly classified as “Antarctic big data”. For these large data sets, there is a pressing need for improved data acquisition, curation, integration, service, and application to support fundamental scientific research. Based on deficiencies in crossing disciplines and to attract further interest in big data in Antarctic sciences, this article will (i) describe and evaluate the current status of big data in various Antarctic-related scientific disciplines, (ii) identify current gaps, (iii) and provide solutions to fill these gaps.
... Hervey et al. (2020) studied how users use facet search interfaces in searching for geospatial data and the relationship between facet search interfaces and metadata. Löffler et al. (2021) identified several important search interests among biodiversity researchers with the intent of studying related data portals. They found that existing metadata standards in those portals lacked the complexity needed in biodiversity work. ...
... According to the DDI, the acronym also stands for three aims of the standard: Document, Discover, Interoperate (http://www.ddialliance.org). The DDI standard addresses metadata from questionnaires and surveys in the social, behavioral, economic and health sciences (Löffler et al., 2021;Nie et al., 2021). Löffler et al. (2021) also pointed out that users in biodiversity fields use specific information categories like materials, chemicals, and biological and chemical processes to find relevant datasets. ...
... The DDI standard addresses metadata from questionnaires and surveys in the social, behavioral, economic and health sciences (Löffler et al., 2021;Nie et al., 2021). Löffler et al. (2021) also pointed out that users in biodiversity fields use specific information categories like materials, chemicals, and biological and chemical processes to find relevant datasets. Those information categories should be covered in research data portals. ...
... Reference bioimaging data were generated from raw microscopic images and linked to technical and expressive metadata using standardized semantics [41][42][43][44]. When extracting phenotypic traits from bioimaging data, it is possible to estimate both quantitative traits (i.e., leaf and stem area, length, width of leaves, stems and plants, specific leaf area, specific stem density) and qualitative traits (i.e., growth stature, vegetative propagule, or leaf shape and type) by combining elemental analysis with machine-learning-driven image analysis and computer vision [45,46]. ...
Integrative taxonomy is a fundamental part of biodiversity and combines traditional morphology with additional methods such as DNA sequencing or biochemistry. Here, we aim to establish untargeted metabolomics for use in chemotaxonomy. We used three thallose liverwort species Riccia glauca, R. sorocarpa, and R. warnstorfii (order Marchantiales, Ricciaceae) with Lunularia cruciata (order Marchantiales, Lunulariacea) as an outgroup. Liquid chromatography high-resolution mass-spectrometry (UPLC/ESI-QTOF-MS) with data-dependent acquisition (DDA-MS) were integrated with DNA marker-based sequencing of the trnL-trnF region and high-resolution bioimaging. Our untargeted chemotaxonomy methodology enables us to distinguish taxa based on chemophenetic markers at different levels of complexity: (1) molecules, (2) compound classes, (3) compound superclasses, and (4) molecular descriptors. For the investigated Riccia species, we identified 71 chemophenetic markers at the molecular level, a characteristic composition in 21 compound classes, and 21 molecular descriptors largely indicating electron state, presence of chemical motifs, and hydrogen bonds. Our untargeted approach revealed many chemophenetic markers at different complexity levels that can provide more mechanistic insight into phylogenetic delimitation of species within a clade than genetic-based methods coupled with traditional morphology-based information. However, analytical and bioinformatics analysis methods still need to be better integrated to link the chemophenetic information at multiple scales.
... However, practical guidance on semantic annotation of biodiversity literature are few and far between and usually refer to English-language text corpora with a focus on taxonomy (see, e.g., Sautter et al., 2007). Beyond mere taxonomic tagging, more recent workflows also cover a much broader thematic range of biodiversity entities, but do not allow multi-label annotation, that is, (possibly) assigning more than one annotation tag to an annotation unit [Thessen et al., 2018, Nguyen and Ananiadou, 2019, Löffler et al., 2020. Enhancing content retrieval and information fusion by multi-label annotation has since found its way into the biomedical domain, e.g. to detect multiple core scientific concepts at the sentence level or multi-functional genes in cancer pathways (Ravenscroft et al. [2016], Guan et al. [2018]). ...
Biodiversity information is contained in countless digitized and unprocessed scholarly texts. Although automated extraction of these data has been gaining momentum for years, there are still innumerable text sources that are poorly accessible and require a more advanced range of methods to extract relevant information. To improve the access to semantic biodiversity information, we have launched the BIOfid project (www.biofid.de) and have developed a portal to access the semantics of German language biodiversity texts, mainly from the 19th and 20th century. However , to make such a portal work, a couple of methods had to be developed or adapted first. In particular, text-technological information extraction methods were needed, which extract the required information from the texts. Such methods draw on machine learning techniques, which in turn are trained by learning data. To this end, among others, we gathered the BIOfid text corpus, which is a cooperatively built resource, developed by biologists, text technologists, and linguists. A special feature of BIOfid is its multiple annotation approach, which takes into account both general and biology-specific classifications, and by this means goes beyond previous, typically taxon-or ontology-driven proper name detection. We describe the design decisions and the genuine Annotation Hub Framework underlying the BIOfid annotations and present agreement results. The tools used to create the annotations are introduced, and the use of the data in the semantic portal is described. Finally, some general lessons, in particular with multiple annotation projects, are drawn.
... Soil data uses include a broad range of applications such as ecology, biogeochemistry (Iversen et al., 2017;Wieder et al., 2021b), soil engineering, soil taxonomy and classification, geochemistry (Nave et al., 2016;Hengl et al., 2017;Lawrence et al., 2020), micrometeorology (Cheah et al., 2018), agronomy (Lyons et al., 2020), and geomorphology. Datasets, defined as "a collection of scientific data including primary data and metadata organized and formatted for a particular purpose" (Löffler et al., 2021), are assembled by an equally diverse range of organizations. These organizations include government agencies, academic collaborations, nongovernmental organizations, and industry, reflecting a wide range of generators and users including farmers, land managers, students, technicians, scientists, and policy makers. ...
... For brevity, we assume that datasets of interest have already been identified. However, searchability of data repositories is an active area of research (Pampel et al., 2013;Löffler et al., 2021). We briefly review several common approaches to data acquisition, harmonization, curation, and publication. ...
In the age of big data, soil data are more available and richer than ever, but – outside of a few large soil survey resources – they remain largely unusable for informing soil management and understanding Earth system processes beyond the original study. Data science has promised a fully reusable research pipeline where data from past studies are used to contextualize new findings and reanalyzed for new insight. Yet synthesis projects encounter challenges at all steps of the data reuse pipeline, including unavailable data, labor-intensive transcription of datasets, incomplete metadata, and a lack of communication between collaborators. Here, using insights from a diversity of soil, data, and climate scientists, we summarize current practices in soil data synthesis across all stages of database creation: availability, input, harmonization, curation, and publication. We then suggest new soil-focused semantic tools to improve existing data pipelines, such as ontologies, vocabulary lists, and community practices. Our goal is to provide the soil data community with an overview of current practices in soil data and where we need to go to fully leverage big data to solve soil problems in the next century.
... Plantin et al. 37 stressed how data repositories, as exemplified by Figshare, help to reconfigure the logic of our traditional, paper-based scholarly communication system and introduce research data into this system. There is also empirical evidence that data repositories positively affect data search and sharing behavior [38][39][40] and that research with data deposited in repositories can receive more citations than research without 41 . ...
As the importance of research data gradually grows in sciences, data sharing has come to be encouraged and even mandated by journals and funders in recent years. Following this trend, the data availability statement has been increasingly embraced by academic communities as a means of sharing research data as part of research articles. This paper presents a quantitative study of which mechanisms and repositories are used to share research data in PLOS ONE articles. We offer a dynamic examination of this topic from the disciplinary and temporal perspectives based on all statements in English-language research articles published between 2014 and 2020 in the journal. We find a slow yet steady growth in the use of data repositories to share data over time, as opposed to sharing data in the paper or supplementary materials; this indicates improved compliance with the journal's data sharing policies. We also find that multidisciplinary data repositories have been increasingly used over time, whereas some disciplinary repositories show a decreasing trend. Our findings can help academic publishers and funders to improve their data sharing policies and serve as an important baseline dataset for future studies on data sharing activities.
... Most of this work is so far focused on improving the FAIRness of biodiversity data. It includes work on improvement of discoverability of data by better, semantic descriptions (Löffler et al. 2021, Pfaff et al. 2017. These investigations have shown which categories of concepts (e.g., organism, environment, process, event) are relevant to biodiversity research. ...
Developing a precise argument is not an easy task. In real-world argumentation scenarios, arguments presented in texts (e.g. scientific publications) often constitute the end result of a long and tedious process. A lot of work on computational argumentation has focused on analyzing and aggregating these products of argumentation processes, i.e. argumentative texts. In this project, we adopt a complementary perspective: we aim to develop an argumentation machine that supports users during the argumentation process in a scientific context, enabling them to follow ongoing argumentation in a scientific community and to develop their own arguments. To achieve this ambitious goal, we will focus on a particular phase of the scientific argumentation process, namely the initial phase of claim or hypothesis development. According to argumentation theory, the starting point of an argument is a claim, and also data that serves as a basis for the claim. In scientific argumentation, a carefully developed and thought-through hypothesis (which we see as Toulmin's "claim'' in a scientific context) is often crucial for researchers to be able to conduct a successful study and, in the end, present a new, high-quality finding or argument. Thus, an initial hypothesis needs to be specific enough that a researcher can test it based on data, but, at the same time, it should also relate to previous general claims made in the community. We investigate how argumentation machines can (i) represent concrete and more abstract knowledge on hypotheses and their underlying concepts, (ii) model the process of hypothesis refinement, including data as a basis of refinement, and (iii) interactively support a user in developing her own hypothesis based on these resources. This project will combine methods from different disciplines: natural language processing, knowledge representation and semantic web, philosophy of science and -- as an example for a scientific domain -- invasion biology. Our starting point is an existing resource in invasion biology that organizes and relates core hypotheses in the field and associates them to meta-data for more than 1000 scientific publications, which was developed over the course of several years based on manual analysis. This network, however, is currently static (i.e. needs substantial manual curation to be extended to incorporate new claims) and, moreover, is not easily accessible for users who miss specific background and domain knowledge in invasion biology. Our goal is to develop (i) a semantic model for representing knowledge on concepts and hypotheses, such that also non-expert users can use the network; (ii) a tool that automatically computes links from publication abstracts (and data) to these hypotheses; and (iii) an interactive system that supports users in refining their initial, potentially underdeveloped hypothesis.