Figure - available from: PLOS One
This content is subject to copyright.
Annotator’s agreement with QUALITY correction overall and for one term, two terms, three terms and more per artifact

Annotator’s agreement with QUALITY correction overall and for one term, two terms, three terms and more per artifact

Source publication
Article
Full-text available
The increasing amount of publicly available research data provides the opportunity to link and integrate data in order to create and prove novel hypotheses, to repeat experiments or to compare recent data to data collected at a different time or place. However, recent studies have shown that retrieving relevant data for data reuse is a time-consumi...

Citations

... The parameters and types of data constitute critical categories of information that greatly affect effective data management. Inadequately defined metadata structures can impede data retrieval and exploration, thereby providing insufficient support for researchers in their endeavours [39,40]. Two primary standards for biodiversity informatics are widely recognised and utilised by major networks: Darwin Core [18] and Access to Biological Collections Data [41]. ...
Article
Full-text available
Numerous institutions engaged in the management of Natural History Collections (NHC) are embracing the opportunity to digitise their holdings. The primary objective is to enhance the accessibility of specimens for interested individuals and to integrate them into the global community by contributing to an international specimen database. This initiative demands a comprehensive digitisation process and the development of an IT infrastructure that adheres to stringent functionality, reliability, and security standards. This endeavour focuses on the procedural and operational dimensions associated with accurately storing and managing taxonomic, biogeographic, and ecological data about biological specimens digitised within a conventional NHC framework. The authors suggest categorising the IT challenges into four distinct areas: requirements, digitisation, design, and technology. Each category discusses several selected topics, highlighting often underestimated essentials for implementing the NHC system. The presented analysis is supported by numerous examples of specific implementations, enabling a better understanding of the given topic. This document serves as a resource for teams developing their systems for online collections, offering post factum insights derived from implementation experiences.
... Findable biodiversity data are the first pillar of FAIR data sharing (Costello et al., 2013;Costello and Wieczorek, 2014;Reyserhove et al., 2020). Biodiversity scientists and institutions in SEA countries should ensure that biodiversity data are easily discoverable through standardised and harmonised metadata formats (e.g., Darwin Core standard (Wieczorek et al., 2012)) and indexing protocols that can quickly locate relevant information (Löffler et al., 2021a). Furthermore, governmental funding bodies might consider mandating that the research they support adheres to standardised formats, ensuring that the data are both discoverable and accessible. ...
... Making biodiversity data accessible allows for more data utilisation and applications that will facilitate more robust evidence-based decision-making and maximise the utilisation of existing data for conservation planning exercises and management initiatives (e.g., developing species distribution models) (Cayuela et al., 2009;Gonzalez et al., 2023;Orr et al., 2022). As diverse stakeholders spread across various sectors and geographical regions, accessible biodiversity data would promote transparency, foster public engagement, and empower local communities to actively participate in harmonised conservation efforts throughout the country (Costello et al., 2013;Lannom et al., 2020;Löffler et al., 2021a). Furthermore, ensuring that data are available in user-friendly formats and language translations improves inclusivity and allows for greater participation in biodiversity conservation activities. ...
Article
Full-text available
The tropical Southeast Asian region, with its unique geographical features, is home to a multitude of distinct species that are affected by various human and natural activities. The availability of biodiversity data is crucial for understanding species distribution and responses to environmental changes to develop effective conservation priorities. In this perspective paper, I examined the patterns and trends of biodiversity in Southeast Asia within the Global Biodiversity Information Facility (GBIF) and highlighted important gaps, priorities, and opportunities for the region. Thailand accounted for 28 % of GBIF occurrence records in Southeast Asia, followed by Indonesia (19 %), Malaysia (18 %), and the Philippines (13 %). A significant portion of biodiversity data comes from citizen science platforms, such as eBird (56 %) and iNaturalists (6 %), highligthing the significance of public in data mobilisation. Nonetheless, the biodiversity data for five of the 11 Southeast Asian countries are poorly represented by domestic researchers, with approximately 41 % of the region's GBIF occurrence data contributed by researchers or institutions from outside Southeast Asia. Furthermore, over the past 24 years (2000–2024), at least 30 % of terrestrial vertebrate occurrence records in Southeast Asia overlap with Protected Areas (PAs). In Southeast Asia, where species often span borders, I argue that open and FAIR data sharing should be considered standard practices in the biodiversity research community, integrated into biodiversity agendas, and funding policies. Consequently, I propose the open-NOTE steps (Normalise, Organise, Train, and Engage), as a practical framework to promote open and FAIR data sharing in Southeast Asia and beyond.
... Either elaborate tests or literature research are required to ensure good analytical results, which is associated with high costs for analysis consumables and equipment as well as time expenditure. The lack of relevant settings and data for actual scientific usage is an occurring problem in case of literature research and repository search [18,19]. For example, the adoption of a method from another publication for validation purposes is not possible if essential settings (such as separation column diameter or length and eluent flow rate) are absent in the publication. ...
... Users might find data difficult to understand due to lack of standardized documentation and formatting, challenging data integration and analysis, and incomplete or unclear metadata (Reichman et al., 2011). Indeed, inadequately captured information in metadata is a significant barrier for data retrieval (Löffler et al., 2021). Moreover, considerable efforts are required to compile and analyze data, because they are spread over heterogeneous repositories with varying metadata standards (e.g., Vlah et al., 2023), and use different terminology and scales (Reichman et al., 2011). ...
Article
Full-text available
Free use and redistribution of data (i.e., Open Data) increases the reproducibility, transparency, and pace of aquatic sciences research. However, barriers to both data users and data providers may limit the adoption of Open Data practices. Here, we describe common Open Data challenges faced by data users and data providers within the aquatic sciences community (i.e., oceanography, limnology, hydrology, and others). These challenges were synthesized from literature, authors’ experiences, and a broad survey of 174 data users and data providers across academia, government agencies, industry, and other sectors. Through this work, we identified seven main challenges: 1) metadata shortcomings, 2) variable data quality and reusability, 3) open data inaccessibility, 4) lack of standardization, 5) authorship and acknowledgement issues 6) lack of funding, and 7) unequal barriers around the globe. Our key recommendation is to improve resources to advance Open Data practices. This includes dedicated funds for capacity building, hiring and maintaining of skilled personnel, and robust digital infrastructures for preparation, storage, and long-term maintenance of Open Data. Further, to incentivize data sharing we reinforce the need for standardized best practices to handle data acknowledgement and citations for both data users and data providers. We also highlight and discuss regional disparities in resources and research practices within a global perspective.
... However, the effectiveness of these systems depends on the user's ability to evaluate the relevance of the retrieved data (Koesten et al., 2017). Data retrieval systems provide users with metadata, structured information that describes salient attributes of the data, to inform their evaluation of its relevance (Brickley et al., 2019;Löffler et al., 2021). Within these systems' search results interfaces, metadata includes elements that describe the data's attributes related to content (e.g., title and abstract), provenance (e.g., creator and repository), and popularity (e.g., download frequency and citation frequency). ...
Article
Full-text available
Integrating diverse cues from metadata to make sense of retrieved data during relevance evaluation is a crucial yet challenging task for data searchers. However, this integrative task remains underexplored, impeding the development of effective strategies to address metadata's shortcomings in supporting this task. To address this issue, this study proposes the “Integrative Use of Metadata for Data Sense‐Making” (IUM‐DSM) model. This model provides an initial framework for understanding the integrative tasks performed by data searchers, focusing on their integration patterns and associated challenges. Experimental data were analyzed using an interpretable deep learning‐based prediction approach to validate this model. The findings offer preliminary support for the model, revealing that data searchers engage in integrative tasks to utilize metadata effectively for data sense‐making during relevance evaluation. They construct coherent mental representations of retrieved data by integrating systematic and heuristic cues from metadata through two distinct patterns: within‐category integration and across‐category integration. This study identifies key challenges: within‐category integration entails comparing, classifying, and connecting systematic or heuristic cues, while across‐category integration necessitates considerable effort to integrate cues from both categories. To support these integrative tasks, this study proposes strategies for mitigating these challenges by optimizing metadata layouts and developing intelligent data retrieval systems.
... KEYWORDS research data; research data management; research data repository-metadata management of research data is essential for securing grants, funding productivity, ensuring the future use of research data, and enabling collaboration [across the disciplines] (Doucette & Fyfe, 2013). The rising amount of publicly available research data allows linking and integrating data to make and prove novel hypotheses, repeat experiments, or compare recent data to data collected at a different time or place (Löffler et al., 2021). Concurrently, research practices underline increasing datafication, creating new incentives for researchers to share their data and metadata. ...
... Publicly available research data allow researchers to compare recent data with previously deposited data to determine its relevance (Löffler et al., 2021). Furthermore, RDRs are one of the main actors in such a process (Assante et al., 2016). ...
Article
Full-text available
Metadata is vital for information storage and retrieval from a database or repository. In the case of Research Data Repositories (RDRs), metadata can be a potent tool for describing and identifying data. Further, producing the metadata is indispensable for fostering data reuse. A filtered view of the registry of research data repositories, re3data.org, depicts no uniform pat- terns or standards for metadata in the case of RDRs, and the metadata elements and practices differ from RDR to RDR. The present study describes the features of a select number of RDRs and analyzes their metadata practices: Harvard Dataverse, Dryad, Figshare, Zenodo, and the Open Science Framework (OSF). It further examines the total number of metadata ele- ments, common metadata elements, required metadata elements, and item-level metadata. Results indicate that even though Harvard Dataverse has the most metadata elements, Dryad provides rich metadata concerning item level. This study suggests a common metadata framework, richer metadata elements, and more features to make the research data’s interoperability possible from one RDR to another.
... Vast archives of papers in scientific journals and preserved specimens in natural history museums contain invaluable information on biological diversity. However, this information has proven to be inaccessible for large data analysis [57] and its automatic extraction has been historically limited [58], thus hampering research into ongoing human-caused biodiversity crises. The efficient access and extraction of this information would give scientists access to a vast amount of data. ...
Preprint
Full-text available
In this paper, we describe the capabilities and constraints of Large Language Models (LLMs) within disparate academic disciplines, aiming to delineate their strengths and limitations with precision. We examine how LLMs augment scientific inquiry, offering concrete examples such as accelerating literature review by summarizing vast numbers of publications, enhancing code development through automated syntax correction, and refining the scientific writing process. Simultaneously, we articulate the challenges LLMs face, including their reliance on extensive and sometimes biased datasets, and the potential ethical dilemmas stemming from their use. Our critical discussion extends to the varying impacts of LLMs across fields, from the natural sciences, where they help model complex biological sequences, to the social sciences, where they can parse large-scale qualitative data. We conclude by offering a nuanced perspective on how LLMs can be both a boon and a boundary to scientific progress.
... All genetic information of any organism is stored in its nucleic acids, DNA and RNA, which can be extracted from individuals of a given species, from entire communities, or even from environmental samples (e.g., from the water column or from a sediment core). These nucleic acid extracts can then be subjected to a variety of high-throughput sequencing approaches, ranging from whole (meta)genome sequencing to (meta)transcriptomic and epigenetic analysis (e.g., Mason et al., 2017). These and related approaches generate huge amounts of nucleotide data, which require quality checking prior to taxonomic or functional analysis. ...
... As a prerequisite for big data analysis, curated and highquality databases are essential, but the lack of metadata standards often presents issues (Stow et al., 2018), especially when the inability to taxonomically assign organisms hampers the usability of databases in, for example, biodiversity studies (Bayraktarov et al., 2019;Loeffler et al., 2021). Many data archives also suffer from the fact that submitters do not always adhere even to minimal sequence metadata standards. ...
Article
Full-text available
This paper was initiated by a multidisciplinary Topic Workshop in the frame of the Deutsche Forschungsgemeinschaft Priority Program 1158 “Antarctic Research with Comparative Investigations in Arctic Ice Areas”, and hence it represents only the national view without claiming to be complete but is intended to provide awareness and suggestions for the current discussion on so-called big data in many scientific fields. The importance of the polar regions and their essential role for the Earth system are both undoubtedly recognized. However, dramatic changes in the climate and environment have been observed first in the Arctic and later in Antarctica over the past few decades. While important data have been collected and observation networks have been built in Antarctica and the Southern Ocean, this is a relatively data-scarce region due to the challenges of remote data acquisition, expensive labor, and harsh environmental conditions. There are many approaches crossing multiple scientific disciplines to better understand Antarctic processes; to evaluate ongoing climatic and environmental changes and their manifold ecological, physical, chemical, and geological consequences; and to make (improved) predictions. Together, these approaches generate very large, multivariate data sets, which can be broadly classified as “Antarctic big data”. For these large data sets, there is a pressing need for improved data acquisition, curation, integration, service, and application to support fundamental scientific research. Based on deficiencies in crossing disciplines and to attract further interest in big data in Antarctic sciences, this article will (i) describe and evaluate the current status of big data in various Antarctic-related scientific disciplines, (ii) identify current gaps, (iii) and provide solutions to fill these gaps.
... Hervey et al. (2020) studied how users use facet search interfaces in searching for geospatial data and the relationship between facet search interfaces and metadata. Löffler et al. (2021) identified several important search interests among biodiversity researchers with the intent of studying related data portals. They found that existing metadata standards in those portals lacked the complexity needed in biodiversity work. ...
... According to the DDI, the acronym also stands for three aims of the standard: Document, Discover, Interoperate (http://www.ddialliance.org). The DDI standard addresses metadata from questionnaires and surveys in the social, behavioral, economic and health sciences (Löffler et al., 2021;Nie et al., 2021). Löffler et al. (2021) also pointed out that users in biodiversity fields use specific information categories like materials, chemicals, and biological and chemical processes to find relevant datasets. ...
... The DDI standard addresses metadata from questionnaires and surveys in the social, behavioral, economic and health sciences (Löffler et al., 2021;Nie et al., 2021). Löffler et al. (2021) also pointed out that users in biodiversity fields use specific information categories like materials, chemicals, and biological and chemical processes to find relevant datasets. Those information categories should be covered in research data portals. ...
... Reference bioimaging data were generated from raw microscopic images and linked to technical and expressive metadata using standardized semantics [41][42][43][44]. When extracting phenotypic traits from bioimaging data, it is possible to estimate both quantitative traits (i.e., leaf and stem area, length, width of leaves, stems and plants, specific leaf area, specific stem density) and qualitative traits (i.e., growth stature, vegetative propagule, or leaf shape and type) by combining elemental analysis with machine-learning-driven image analysis and computer vision [45,46]. ...
Article
Full-text available
Integrative taxonomy is a fundamental part of biodiversity and combines traditional morphology with additional methods such as DNA sequencing or biochemistry. Here, we aim to establish untargeted metabolomics for use in chemotaxonomy. We used three thallose liverwort species Riccia glauca, R. sorocarpa, and R. warnstorfii (order Marchantiales, Ricciaceae) with Lunularia cruciata (order Marchantiales, Lunulariacea) as an outgroup. Liquid chromatography high-resolution mass-spectrometry (UPLC/ESI-QTOF-MS) with data-dependent acquisition (DDA-MS) were integrated with DNA marker-based sequencing of the trnL-trnF region and high-resolution bioimaging. Our untargeted chemotaxonomy methodology enables us to distinguish taxa based on chemophenetic markers at different levels of complexity: (1) molecules, (2) compound classes, (3) compound superclasses, and (4) molecular descriptors. For the investigated Riccia species, we identified 71 chemophenetic markers at the molecular level, a characteristic composition in 21 compound classes, and 21 molecular descriptors largely indicating electron state, presence of chemical motifs, and hydrogen bonds. Our untargeted approach revealed many chemophenetic markers at different complexity levels that can provide more mechanistic insight into phylogenetic delimitation of species within a clade than genetic-based methods coupled with traditional morphology-based information. However, analytical and bioinformatics analysis methods still need to be better integrated to link the chemophenetic information at multiple scales.