Figure - available from: PLOS One
This content is subject to copyright.
Metadata standards in the (life) sciences obtained from re3data [57] and RDA metadata standards catalog [58] The number in brackets denotes the number of repositories supporting the standard (provided in re3data).
Source publication
The increasing amount of publicly available research data provides the opportunity to link and integrate data in order to create and prove novel hypotheses, to repeat experiments or to compare recent data to data collected at a different time or place. However, recent studies have shown that retrieving relevant data for data reuse is a time-consumi...
Citations
... Findable biodiversity data are the first pillar of FAIR data sharing (Costello et al., 2013;Costello and Wieczorek, 2014;Reyserhove et al., 2020). Biodiversity scientists and institutions in SEA countries should ensure that biodiversity data are easily discoverable through standardised and harmonised metadata formats (e.g., Darwin Core standard (Wieczorek et al., 2012)) and indexing protocols that can quickly locate relevant information (Löffler et al., 2021a). Furthermore, governmental funding bodies might consider mandating that the research they support adheres to standardised formats, ensuring that the data are both discoverable and accessible. ...
... Making biodiversity data accessible allows for more data utilisation and applications that will facilitate more robust evidence-based decision-making and maximise the utilisation of existing data for conservation planning exercises and management initiatives (e.g., developing species distribution models) (Cayuela et al., 2009;Gonzalez et al., 2023;Orr et al., 2022). As diverse stakeholders spread across various sectors and geographical regions, accessible biodiversity data would promote transparency, foster public engagement, and empower local communities to actively participate in harmonised conservation efforts throughout the country (Costello et al., 2013;Lannom et al., 2020;Löffler et al., 2021a). Furthermore, ensuring that data are available in user-friendly formats and language translations improves inclusivity and allows for greater participation in biodiversity conservation activities. ...
The tropical Southeast Asian region, with its unique geographical features, is home to a multitude of distinct species that are affected by various human and natural activities. The availability of biodiversity data is crucial for understanding species distribution and responses to environmental changes to develop effective conservation priorities. In this perspective paper, I examined the patterns and trends of biodiversity in Southeast Asia within the Global Biodiversity Information Facility (GBIF) and highlighted important gaps, priorities, and opportunities for the region. Thailand accounted for 28 % of GBIF occurrence records in Southeast Asia, followed by Indonesia (19 %), Malaysia (18 %), and the Philippines (13 %). A significant portion of biodiversity data comes from citizen science platforms, such as eBird (56 %) and iNaturalists (6 %), highligthing the significance of public in data mobilisation. Nonetheless, the biodiversity data for five of the 11 Southeast Asian countries are poorly represented by domestic researchers, with approximately 41 % of the region's GBIF occurrence data contributed by researchers or institutions from outside Southeast Asia. Furthermore, over the past 24 years (2000–2024), at least 30 % of terrestrial vertebrate occurrence records in Southeast Asia overlap with Protected Areas (PAs). In Southeast Asia, where species often span borders, I argue that open and FAIR data sharing should be considered standard practices in the biodiversity research community, integrated into biodiversity agendas, and funding policies. Consequently, I propose the open-NOTE steps (Normalise, Organise, Train, and Engage), as a practical framework to promote open and FAIR data sharing in Southeast Asia and beyond.
... Either elaborate tests or literature research are required to ensure good analytical results, which is associated with high costs for analysis consumables and equipment as well as time expenditure. The lack of relevant settings and data for actual scientific usage is an occurring problem in case of literature research and repository search [18,19]. For example, the adoption of a method from another publication for validation purposes is not possible if essential settings (such as separation column diameter or length and eluent flow rate) are absent in the publication. ...
... Users might find data difficult to understand due to lack of standardized documentation and formatting, challenging data integration and analysis, and incomplete or unclear metadata (Reichman et al., 2011). Indeed, inadequately captured information in metadata is a significant barrier for data retrieval (Löffler et al., 2021). Moreover, considerable efforts are required to compile and analyze data, because they are spread over heterogeneous repositories with varying metadata standards (e.g., Vlah et al., 2023), and use different terminology and scales (Reichman et al., 2011). ...
Free use and redistribution of data (i.e., Open Data) increases the reproducibility, transparency, and pace of aquatic sciences research. However, barriers to both data users and data providers may limit the adoption of Open Data practices. Here, we describe common Open Data challenges faced by data users and data providers within the aquatic sciences community (i.e., oceanography, limnology, hydrology, and others). These challenges were synthesized from literature, authors’ experiences, and a broad survey of 174 data users and data providers across academia, government agencies, industry, and other sectors. Through this work, we identified seven main challenges: 1) metadata shortcomings, 2) variable data quality and reusability, 3) open data inaccessibility, 4) lack of standardization, 5) authorship and acknowledgement issues 6) lack of funding, and 7) unequal barriers around the globe. Our key recommendation is to improve resources to advance Open Data practices. This includes dedicated funds for capacity building, hiring and maintaining of skilled personnel, and robust digital infrastructures for preparation, storage, and long-term maintenance of Open Data. Further, to incentivize data sharing we reinforce the need for standardized best practices to handle data acknowledgement and citations for both data users and data providers. We also highlight and discuss regional disparities in resources and research practices within a global perspective.
... KEYWORDS research data; research data management; research data repository-metadata management of research data is essential for securing grants, funding productivity, ensuring the future use of research data, and enabling collaboration [across the disciplines] (Doucette & Fyfe, 2013). The rising amount of publicly available research data allows linking and integrating data to make and prove novel hypotheses, repeat experiments, or compare recent data to data collected at a different time or place (Löffler et al., 2021). Concurrently, research practices underline increasing datafication, creating new incentives for researchers to share their data and metadata. ...
... Publicly available research data allow researchers to compare recent data with previously deposited data to determine its relevance (Löffler et al., 2021). Furthermore, RDRs are one of the main actors in such a process (Assante et al., 2016). ...
Metadata is vital for information storage and retrieval from a database or repository. In the case of Research Data Repositories (RDRs), metadata can be a potent tool for describing and identifying data. Further, producing the metadata is indispensable for fostering data reuse. A filtered view of the registry of research data repositories, re3data.org, depicts no uniform pat-
terns or standards for metadata in the case of RDRs, and the metadata elements and practices differ from RDR to RDR. The present study describes the features of a select number of RDRs and analyzes their metadata practices: Harvard Dataverse, Dryad, Figshare, Zenodo, and the Open Science Framework (OSF). It further examines the total number of metadata ele-
ments, common metadata elements, required metadata elements, and item-level metadata. Results indicate that even though Harvard Dataverse has the most metadata elements, Dryad provides rich metadata concerning item level. This study suggests a common metadata framework, richer metadata elements, and more features to make the research data’s interoperability possible from one RDR to another.
... Vast archives of papers in scientific journals and preserved specimens in natural history museums contain invaluable information on biological diversity. However, this information has proven to be inaccessible for large data analysis [57] and its automatic extraction has been historically limited [58], thus hampering research into ongoing human-caused biodiversity crises. The efficient access and extraction of this information would give scientists access to a vast amount of data. ...
In this paper, we describe the capabilities and constraints of Large Language Models (LLMs) within disparate academic disciplines, aiming to delineate their strengths and limitations with precision. We examine how LLMs augment scientific inquiry, offering concrete examples such as accelerating literature review by summarizing vast numbers of publications, enhancing code development through automated syntax correction, and refining the scientific writing process. Simultaneously, we articulate the challenges LLMs face, including their reliance on extensive and sometimes biased datasets, and the potential ethical dilemmas stemming from their use. Our critical discussion extends to the varying impacts of LLMs across fields, from the natural sciences, where they help model complex biological sequences, to the social sciences, where they can parse large-scale qualitative data. We conclude by offering a nuanced perspective on how LLMs can be both a boon and a boundary to scientific progress.
... All genetic information of any organism is stored in its nucleic acids, DNA and RNA, which can be extracted from individuals of a given species, from entire communities, or even from environmental samples (e.g., from the water column or from a sediment core). These nucleic acid extracts can then be subjected to a variety of high-throughput sequencing approaches, ranging from whole (meta)genome sequencing to (meta)transcriptomic and epigenetic analysis (e.g., Mason et al., 2017). These and related approaches generate huge amounts of nucleotide data, which require quality checking prior to taxonomic or functional analysis. ...
... As a prerequisite for big data analysis, curated and highquality databases are essential, but the lack of metadata standards often presents issues (Stow et al., 2018), especially when the inability to taxonomically assign organisms hampers the usability of databases in, for example, biodiversity studies (Bayraktarov et al., 2019;Loeffler et al., 2021). Many data archives also suffer from the fact that submitters do not always adhere even to minimal sequence metadata standards. ...
This paper was initiated by a multidisciplinary Topic Workshop in the frame of the Deutsche Forschungsgemeinschaft Priority Program 1158 “Antarctic Research with Comparative Investigations in Arctic Ice Areas”, and hence it represents only the national view without claiming to be complete but is intended to provide awareness and suggestions for the current discussion on so-called big data in many scientific fields. The importance of the polar regions and their essential role for the Earth system are both undoubtedly recognized. However, dramatic changes in the climate and environment have been observed first in the Arctic and later in Antarctica over the past few decades. While important data have been collected and observation networks have been built in Antarctica and the Southern Ocean, this is a relatively data-scarce region due to the challenges of remote data acquisition, expensive labor, and harsh environmental conditions. There are many approaches crossing multiple scientific disciplines to better understand Antarctic processes; to evaluate ongoing climatic and environmental changes and their manifold ecological, physical, chemical, and geological consequences; and to make (improved) predictions. Together, these approaches generate very large, multivariate data sets, which can be broadly classified as “Antarctic big data”. For these large data sets, there is a pressing need for improved data acquisition, curation, integration, service, and application to support fundamental scientific research. Based on deficiencies in crossing disciplines and to attract further interest in big data in Antarctic sciences, this article will (i) describe and evaluate the current status of big data in various Antarctic-related scientific disciplines, (ii) identify current gaps, (iii) and provide solutions to fill these gaps.
... Hervey et al. (2020) studied how users use facet search interfaces in searching for geospatial data and the relationship between facet search interfaces and metadata. Löffler et al. (2021) identified several important search interests among biodiversity researchers with the intent of studying related data portals. They found that existing metadata standards in those portals lacked the complexity needed in biodiversity work. ...
... According to the DDI, the acronym also stands for three aims of the standard: Document, Discover, Interoperate (http://www.ddialliance.org). The DDI standard addresses metadata from questionnaires and surveys in the social, behavioral, economic and health sciences (Löffler et al., 2021;Nie et al., 2021). Löffler et al. (2021) also pointed out that users in biodiversity fields use specific information categories like materials, chemicals, and biological and chemical processes to find relevant datasets. ...
... The DDI standard addresses metadata from questionnaires and surveys in the social, behavioral, economic and health sciences (Löffler et al., 2021;Nie et al., 2021). Löffler et al. (2021) also pointed out that users in biodiversity fields use specific information categories like materials, chemicals, and biological and chemical processes to find relevant datasets. Those information categories should be covered in research data portals. ...
... Reference bioimaging data were generated from raw microscopic images and linked to technical and expressive metadata using standardized semantics [41][42][43][44]. When extracting phenotypic traits from bioimaging data, it is possible to estimate both quantitative traits (i.e., leaf and stem area, length, width of leaves, stems and plants, specific leaf area, specific stem density) and qualitative traits (i.e., growth stature, vegetative propagule, or leaf shape and type) by combining elemental analysis with machine-learning-driven image analysis and computer vision [45,46]. ...
Integrative taxonomy is a fundamental part of biodiversity and combines traditional morphology with additional methods such as DNA sequencing or biochemistry. Here, we aim to establish untargeted metabolomics for use in chemotaxonomy. We used three thallose liverwort species Riccia glauca, R. sorocarpa, and R. warnstorfii (order Marchantiales, Ricciaceae) with Lunularia cruciata (order Marchantiales, Lunulariacea) as an outgroup. Liquid chromatography high-resolution mass-spectrometry (UPLC/ESI-QTOF-MS) with data-dependent acquisition (DDA-MS) were integrated with DNA marker-based sequencing of the trnL-trnF region and high-resolution bioimaging. Our untargeted chemotaxonomy methodology enables us to distinguish taxa based on chemophenetic markers at different levels of complexity: (1) molecules, (2) compound classes, (3) compound superclasses, and (4) molecular descriptors. For the investigated Riccia species, we identified 71 chemophenetic markers at the molecular level, a characteristic composition in 21 compound classes, and 21 molecular descriptors largely indicating electron state, presence of chemical motifs, and hydrogen bonds. Our untargeted approach revealed many chemophenetic markers at different complexity levels that can provide more mechanistic insight into phylogenetic delimitation of species within a clade than genetic-based methods coupled with traditional morphology-based information. However, analytical and bioinformatics analysis methods still need to be better integrated to link the chemophenetic information at multiple scales.
... However, practical guidance on semantic annotation of biodiversity literature are few and far between and usually refer to English-language text corpora with a focus on taxonomy (see, e.g., Sautter et al., 2007). Beyond mere taxonomic tagging, more recent workflows also cover a much broader thematic range of biodiversity entities, but do not allow multi-label annotation, that is, (possibly) assigning more than one annotation tag to an annotation unit [Thessen et al., 2018, Nguyen and Ananiadou, 2019, Löffler et al., 2020. Enhancing content retrieval and information fusion by multi-label annotation has since found its way into the biomedical domain, e.g. to detect multiple core scientific concepts at the sentence level or multi-functional genes in cancer pathways (Ravenscroft et al. [2016], Guan et al. [2018]). ...
Biodiversity information is contained in countless digitized and unprocessed scholarly texts. Although automated extraction of these data has been gaining momentum for years, there are still innumerable text sources that are poorly accessible and require a more advanced range of methods to extract relevant information. To improve the access to semantic biodiversity information, we have launched the BIOfid project (www.biofid.de) and have developed a portal to access the semantics of German language biodiversity texts, mainly from the 19th and 20th century. However , to make such a portal work, a couple of methods had to be developed or adapted first. In particular, text-technological information extraction methods were needed, which extract the required information from the texts. Such methods draw on machine learning techniques, which in turn are trained by learning data. To this end, among others, we gathered the BIOfid text corpus, which is a cooperatively built resource, developed by biologists, text technologists, and linguists. A special feature of BIOfid is its multiple annotation approach, which takes into account both general and biology-specific classifications, and by this means goes beyond previous, typically taxon-or ontology-driven proper name detection. We describe the design decisions and the genuine Annotation Hub Framework underlying the BIOfid annotations and present agreement results. The tools used to create the annotations are introduced, and the use of the data in the semantic portal is described. Finally, some general lessons, in particular with multiple annotation projects, are drawn.
... Soil data uses include a broad range of applications such as ecology, biogeochemistry (Iversen et al., 2017;Wieder et al., 2021b), soil engineering, soil taxonomy and classification, geochemistry (Nave et al., 2016;Hengl et al., 2017;Lawrence et al., 2020), micrometeorology (Cheah et al., 2018), agronomy (Lyons et al., 2020), and geomorphology. Datasets, defined as "a collection of scientific data including primary data and metadata organized and formatted for a particular purpose" (Löffler et al., 2021), are assembled by an equally diverse range of organizations. These organizations include government agencies, academic collaborations, nongovernmental organizations, and industry, reflecting a wide range of generators and users including farmers, land managers, students, technicians, scientists, and policy makers. ...
... For brevity, we assume that datasets of interest have already been identified. However, searchability of data repositories is an active area of research (Pampel et al., 2013;Löffler et al., 2021). We briefly review several common approaches to data acquisition, harmonization, curation, and publication. ...
In the age of big data, soil data are more available and richer than ever, but – outside of a few large soil survey resources – they remain largely unusable for informing soil management and understanding Earth system processes beyond the original study. Data science has promised a fully reusable research pipeline where data from past studies are used to contextualize new findings and reanalyzed for new insight. Yet synthesis projects encounter challenges at all steps of the data reuse pipeline, including unavailable data, labor-intensive transcription of datasets, incomplete metadata, and a lack of communication between collaborators. Here, using insights from a diversity of soil, data, and climate scientists, we summarize current practices in soil data synthesis across all stages of database creation: availability, input, harmonization, curation, and publication. We then suggest new soil-focused semantic tools to improve existing data pipelines, such as ontologies, vocabulary lists, and community practices. Our goal is to provide the soil data community with an overview of current practices in soil data and where we need to go to fully leverage big data to solve soil problems in the next century.