
Leyla Jael CastroZB MED - Information Centre for Life Sciences | ZBMED
Leyla Jael Castro
Doctor of Engineering
About
122
Publications
23,873
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
10,755
Citations
Introduction
Additional affiliations
January 2010 - March 2011
Education
January 2007 - December 2007
January 2004 - September 2005
January 1992 - September 1997
Publications
Publications (122)
Supervised machine learning (ML) is used extensively in biology and deserves closer scrutiny. The Data Optimization Model Evaluation (DOME) recommendations aim to enhance the validation and reproducibility of ML research by establishing standards for key aspects such as data handling and processing, optimization, evaluation, and model interpretabil...
The term Research Software Engineer, or RSE, emerged a little over 10 years ago as a way to represent individuals working in the research community but focusing on software development. The term has been widely adopted and there are a number of high-level definitions of what an RSE is. However, the roles of RSEs vary depending on the institutional...
Camera traps and passive acoustic devices are particularly useful in providing non-invasive methods to document wildlife diversity, ecology, behavior, and conservation. The application of autonomous Internet of Things (IoT) sensors is constantly developing and opens up new application possibilities for research and nature conservation such as taxon...
Supervised machine learning (ML) is used extensively in biology and deserves closer scrutiny. The DOME recommendations aim to enhance the validation and reproducibility of ML research by establishing standards for key aspects such as data handling and processing, optimization, evaluation, and model interpretability. The recommendations help to ensu...
This report provides an overview of our activities and accomplishments concerning machine-actionable Software Management Plans (SMPs) and the Software Management Wizard (SMW) during the ELIXIR BioHackathon Europe 2023. ELIXIR acknowledges the critical role of effective software management in facilitating sustainable and reproducible research outcom...
This paper presents the work executed on BioHackrXiv during the international ELIXIR BioHackathon Europe in Paris, France, 2022. BioHackrXiv is a scholarly publication service for BioHackathons and codefests that target biology and the biomedical sciences in the spirit of pre-publishing platforms.
Nowadays scientists massively produce diverse datasets in many communities. They need to combine them to answer scientific or novel questions. To do so, these diverse computational resources need first to be found by search engines. Bioschemas provides a simple and lightweight mechanism to annotate online resources in a standardized way and expose...
As part of the BioHackathon Europe 2023, we here report on the progress of the hacking team preparing a resource index and knowledge graph based on the JSON-LD Bioschemas markup from several resources in the life- and natural sciences, predominantly from the fields of plant- and (bio)chemistry research. This preliminary analysis will allow us to be...
As part of the BioHackathon Europe 2023, we here report from the progress of the hackathon project #15: "Enabling FAIR Digital Objects with RO-Crate, Signposting and Bioschemas". We added Signposting to three existing resources, and made a Chrome browser extension to show Signposting headers. We added RO-Crate to two existing resources, and explore...
Reproducible research and open science practices have the potential to accelerate scientific progress by allowing others to reuse research outputs, and by promoting rigorous research that is more likely to yield trustworthy results. However, these practices are uncommon in many fields, so there is a clear need for training that helps and encourages...
The term Research Software Engineer, or RSE, emerged a little over 10 years ago as a way to represent individuals working in the research community but focusing on software development. The term has been widely adopted and there are a number of high-level definitions of what an RSE is. However, the roles of RSEs vary depending on the institutional...
Schema.org is a controlled vocabulary that makes it easier for web pages to describe their actual content in a semantic, structured and machine-processable way. It is recognized by major search engines and data aggregators, making it easier for researchers to expose metadata describing their research outcomes. Here we present how Schema.org is used...
Research data is on its way to be recognized as a first-class citizen in research; however, and despite its importance for science, software still has a long way to go. Recent initiatives are paving the way, including FAIR for Research Software and Software Management Plans. A step further towards machine-actionability is adding a structured metada...
The FAIR principles were introduced to enhance data reuse by providing guidelines for effective data management practices. In the broader context of research, assets encompass not only data but also artifacts such as code, software, and publications. FAIRifying these artifacts is as essential as FAIRifying data, given the increasing complexity of c...
The collection of metadata for research data is an important aspect in the FAIR principles. The schema.org and Bioschemas initiatives created a vocabulary to embed markup for many different types, including BioChemEntity, ChemicalSubstance, Gene, MolecularEntity, Protein, and others relevant in the Natural and Life Sciences with immediate benefits...
RO-Crates makes it easier to package research digital objects together with their metadata so both dependencies and context can be captured. Combined with FAIR good practices such as the use of persistent identifiers, inclusion of license, clear object provenance, and adherence to community standards, RO-crates provides a way to increase FAIRness i...
Although FAIR Research Data Principles are targeted at and implemented by different communities, research disciplines, and research stakeholders (data stewards, curators, etc.), there is no conclusive way to determine the level of FAIRness intended or required to make research artefacts (including, but not limited to, research data) Findable, Acces...
Machine learning (ML) methods are becoming ever more prevalent across all domains of lifesciences. However, a key component of effective ML is the availability of large datasets thatare diverse and representative. In the context of health systems, with significant heterogeneityof clinical phenotypes and diversity of healthcare systems, there exists...
Stand-alone life science training events and e-learning solutions are among the most sought-after modes of training because they address both point-of-need learning and the limited timeframes available for “upskilling.” Yet, finding relevant life sciences training courses and materials is challenging because such resources are not marked up for int...
Across disciplines, researchers increasingly recognize that open science and reproducible research practices may accelerate scientific progress by allowing others to reuse research outputs and by promoting rigorous research that is more likely to yield trustworthy results. While initiatives, training programs, and funder policies encourage research...
Bioschemas is a grassroots community effort to improve FAIRness of resources in the Life sciences by defining specific Life Science metadata schemas and exposing that metadata from resources that have adopted it. Now that some initial types have been adopted directly into schema.org, an improved mechanism is required to reignite community engagemen...
This article details a correction to the article: Caracciolo, C., Aubin, S., Jonquet, C., Amdouni, E., David, R., Garcia, L., Whitehead, B., Roussey, C., Stellato, A. and Villa, F., 2020. 39 Hints to Facilitate the Use of Semantics for Data on Agriculture and Nutrition. 'Data Science Journal', 19(1), p.47. DOI: http://doi.org/10.5334/dsj-2020-047
In this paper we present the work executed on BioHackrXiv during the international ELIXIR BioHackathon in Barcelona, Spain, 2021.
NFDI4DataScience (NFDI4DS) is a consortium founded to support researchers in all stages of the research data lifecycle in order to conduct their research in line with the FAIR principles. The infrastructure developed targets researchers from a wide range of disciplines working in the field of data science and artificial intelligence. NFDI4DS contri...
The ever-increasing amount of research output through scientific articles requires means to enable transparency and a better understanding of key entities of the research lifecycle, referred to as research artifacts, such as methods, software, datasets, etc. Research Knowledge Graphs (RKG) make research artifacts findable, accessible, interoperable...
The Ontology Alignment Evaluation Initiative (OAEI) aims at comparing ontology matching systems on precisely defined test cases. These test cases can be based on ontologies of different levels of complexity and use different evaluation modalities. The OAEI 2023 campaign offered 15 tracks and was attended by 16 participants. This paper is an overall...
Although FAIR Research Data Principles are targeted at and implemented by different communities, research disciplines, and research stakeholders (data stewards, curators, etc.), there is no conclusive way to determine the level of FAIRness intended or required to make research artefacts (including, but not limited to, research data) Findable, Acces...
Although FAIR Research Data Principles are targeted at and implemented by different communities, research disciplines, and research stakeholders (data stewards, curators, etc.), there is no conclusive way to determine the level of FAIRness intended or required to make research artefacts (including, but not limited to, research data) Findable, Acces...
biohackrxiv.org is a scholarly publication service forBioHackathons and Codefests where papers are generated from Markdowntemplates where the header is a YAML/JSON record that includes thetitle, authors, affiliations and tags. Many projects in BioHackathons are about using FAIR data. Because the current setup is lacking in the findable (F) andacces...
Stand-alone life science training events and e-learning solutions are amongst the most sought-after modes of training because they address both point-of-need learning and the limited timeframes available for 'upskilling'. Yet, finding relevant life sciences training courses and materials is challenging because such resources are not marked up for I...
The COVID-19 crisis demonstrates a critical requirement for rapid and efficient sharing of data to facilitate the global response to this and future pandemics. Our project aims are to enhance interoperability between health and research data by mapping Phenopackets and OMOP schemas, and representing COVID-19 metadata using the FAIR principles to en...
Involving users in early phases of software development has become a common strategy as it enables developers to consider user needs from the beginning. Once a system is in production, new opportunities to observe, evaluate and learn from users emerge as more information becomes available. Gathering information from users to continuously evaluate t...
Research software is a fundamental and vital part of research, yet significant challenges to discoverability, productivity, quality, reproducibility, and sustainability exist. Improving the practice of scholarship is a common goal of the open science, open source, and FAIR (Findable, Accessible, Interoperable and Reusable) communities and research...
Background
The FAIR principles (Wilkinson et al. 2016) are fundamental for data discovery, sharing, consumption and reuse; however their broad interpretation and many ways to implement can lead to inconsistencies and incompatibility (Jacobsen et al. 2020).
The European Open Science Cloud (EOSC) has been instrumental in maturing and encouraging FAIR...
RO-Crate (Soiland-Reyes et al. 2022) is a lightweight method to package research outputs along with their metadata, based on Linked Data principles (Bizer et al. 2009) and W3C standards. RO-Crate provides a flexible mechanism for researchers archiving and publishing rich data packages (or any other research outcome) by capturing their dependencies...
In academic research virtually every field has increased its use of digital and computational technology, leading to new scientific discoveries, and this trend is likely to continue. Reliable and efficient scholarly research requires researchers to be able to validate and extend previously generated research results. In the digital era, this implie...
Academic research requires careful handling of data plus any means to collect, transform and publish it, activities commonly supported by research software (from scripts to end-user applications). Data Management Plans (DMPs) are nowadays commonly requested by funders as part of good research practices. A DMP describes the data management lifecycle...
The increased number of data repositories has greatly increased the availability of open data. To enable broad discovery and access to research dataset, some data repositories have begun leveraging data discovery services from commercial search engines by embedding structured metadata markup in dataset web landing pages using vocabularies from Sche...
The Ontology Alignment Evaluation Initiative (OAEI) aims at comparing ontology matching systems on
precisely defined test cases. These test cases can be based on ontologies of different levels of complexity
and use different evaluation modalities. The OAEI 2022 campaign offered 14 tracks and was attended by
18 participants. This paper is an overall...
The concept of Data Management Plan (DMP) has emerged as a fundamental tool to help researchers through the systematical management of data. The Research Data Alliance DMP Common Standard (DCS) working group developed a set of universal concepts characterising a DMP so it can be represented as a machine-actionable artefact, i.e., machine-actionable...
The concept of Data Management Plan (DMP) has emerged as a fundamental tool to help researchers through the systematical management of data. The Research Data Alliance DMP Common Standard (DCS) working group developed a core set of universal concepts characterising a DMP in the pursuit of producing a DMP as a machine-actionable information artefact...
The Living Labs for Academic Search (LiLAS) lab aims to strengthen the concept of user-centric living labs for academic search. The methodological gap between real-world and lab-based evaluation should be bridged by allowing lab participants to evaluate their retrieval approaches in two real-world academic search systems from life sciences and soci...
Background
Drug repurposing can improve the return of investment as it finds new uses for existing drugs. Literature-based analyses exploit factual knowledge on drugs and diseases, e.g. from databases, and combine it with information from scholarly publications. Here we report the use of the Open Discovery Process on scientific literature to identi...
An increasing number of researchers support reproducibility by including pointers to and descriptions of datasets, software and methods in their publications. However, scientific articles may be ambiguous, incomplete and difficult to process by automated systems. In this paper we introduce RO-Crate, an open, community-driven, and lightweight approa...
The common standard for machine-actionable Data Management Plans (DMPs) allows for automatic exchange, integration, and validation of information provided in DMPs. In this paper, we report on the hackathon organised by the Research Data Alliance in which a group of 89 participants from 21 countries worked collaboratively on use cases exploring the...
Data Management Plans are now considered a key element of Open Science. They describe the data management life cycle for the data to be collected, processed and/or generated within the lifetime of a particular project or activity. A Software Manag ement Plan (SMP) plays the same role but for software. Beyond its management perspective, the main adv...
The Living Labs for Academic Search (LiLAS) lab aims to strengthen the concept of user-centric living labs for academic search. The methodological gap between real-world and lab-based evaluation should be bridged by allowing lab participants to evaluate their retrieval approaches in two real-world academic search systems from life sciences and soci...
An increasing number of researchers support reproducibility by including pointers to and descriptions of datasets, software and methods in their publications. However, scientific articles may be ambiguous, incomplete and difficult to process by automated systems. In this paper we introduce RO-Crate, an open, community-driven, and lightweight approa...
One of the recurring questions when it comes to BioHackathons is how to measure their impact, especially when funded and/or supported by the public purse (e.g., research agencies, research infrastructures, grants). In order to do so, we first need to understand the outcomes from a BioHackathon, which can include software, code, publications, new or...
Background: The coronavirus disease 2019 (COVID-19) global pandemic required a rapid and effective response. This included ethical and legally appropriate sharing of data. The European Commission (EC) called upon the Research Data Alliance (RDA) to recruit experts worldwide to quickly develop recommendations and guidelines for COVID-related data sh...
The systemic challenges of the COVID-19 pandemic require cross-disciplinary collaboration in a global and timely fashion. Such collaboration needs open research practices and the sharing of research outputs, such as data and code, thereby facilitating research and research reproducibility and timely collaboration beyond borders. The Research Data A...
Knowledge graphs have successfully been adopted by academia, governement and industry to represent large scale knowledge bases. Open and collaborative knowledge graphs such as Wikidata capture knowledge from different domains and harmonize them under a common format, making it easier for researchers to access the data while also supporting Open Sci...
Meta-evaluation studies of system performances in controlled offline evaluation campaigns, like TREC and CLEF, show a need for innovation in evaluating IR-systems. The field of academic search is no exception to this. This might be related to the fact that relevance in academic search is multi-layered and therefore the aspect of user-centric evalua...
In this paper, we report on the outputs and adoption of the Agrisemantics Working Group of the Research Data Alliance (RDA), consisting of a set of recommendations to facilitate the adoption of semantic technologies and methods for the purpose of data interoperability in the field of agriculture and nutrition. From 2016 to 2019, the group gathered...
In this paper, we report on the outputs and adoption of the Agrisemantics Working Group of the Research Data Alliance (RDA), consisting of a set of recommendations to facilitate the adoption of semantic technologies and methods for the purpose of data interoperability in the field of agriculture and nutrition. From 2016 to 2019, the group gathered...
The aim of the UniProt Knowledgebase is to provide users with a comprehensive, high-quality and freely accessible set of protein sequences annotated with functional information. In this article, we describe significant updates that we have made over the last two years to the resource. The number of sequences in UniProtKB has risen to approximately...
The systemic challenges of the COVID-19 pandemic require cross-disciplinary collaboration in a global and timely fashion. Such collaboration needs open research practices and the sharing of research outputs, such as data and code, thereby facilitating research and research reproducibility and timely collaboration beyond borders. The Research Data A...
Copy number variations (CNVs) are major causative contributors both in the genesis of genetic diseases and human neoplasias. While “High-Throughput” sequencing technologies are increasingly becoming the primary choice for genomic screening analysis, their ability to efficiently detect CNVs is still heterogeneous and remains to be developed. The aim...
Academic Search is a timeless challenge that the field of Information Retrieval has been dealing with for many years. Even today, the search for academic material is a broad field of research that recently started working on problems like the COVID-19 pandemic. However, test collections and specialized data sets like CORD-19 only allow for system-o...
This report is based on the discussions and presentations that took place at the Workshop on Sustainable Software Sustainability (www.software.ac.uk/wosss19) in April 2019 (WOSSS19). It captures the state of the art for a range of Software Sustainability themes that were brought up by the organisers and attendees of the workshop.
Author summary
Everything we do today is becoming more and more reliant on the use of computers. The field of biology is no exception; but most biologists receive little or no formal preparation for the increasingly computational aspects of their discipline. In consequence, informal training courses are often needed to plug the gaps; and the demand...
Glycoinformatics plays a major role in glycobiology research, and the development of a comprehensive glycoinformatics knowledgebase is critical. This application note describes the GlyGen data model, processing workflow and the data access interfaces featuring programmatic use case example queries based on specific biological questions. The GlyGen...
Validating RDF data becomes necessary in order to ensure data compliance against the conceptualization model it follows, e.g., schema or ontology behind the data, and improve data consistency and completeness. There are different approaches to validate RDF data, for instance, JSON schema, particularly for data in JSONLD format, as well as Shape Exp...
UniProt continues to support the ongoing process of making scientific data FAIR. Here we contribute to this process with a FAIRness assessment of our UniProtKB dataset followed by a critical reflection on the challenges and future directions of the adoption and validation of the FAIR principles and metrics.
Publishing databases in the Resource Description Framework (RDF) model is becoming widely accepted to maximize the syntactic and semantic interoperability of open data in life sciences. Here we report advancements made in the 6th and 7th annual BioHackathons which were held in Tokyo and Miyagi respectively. This review consists of two major section...
The total number of scholarly publications grows day by day, making it necessary to explore and use simple yet effective ways to expose their metadata. Schema.org supports adding structured metadata to web pages via markup, making it easier for data providers but also for search engines to provide the right search results. Bioschemas is based on th...
This report is based on the discussions and presentations that took place at the Workshop on Sustainable Software Sustainability in April 2019 in The Hague (WOSSS19). It captures the state of the art for a range of software sustainability themes that were brought up by the organisers and attendees of the workshop.
A significant portion of biomedical literature is represented in a manner that makes it difficult for consumers to find or aggregate content through a computational query. One approach to facilitate reuse of the scientific literature is to structure this information as linked data using standardized web technologies. In this paper we present the se...
Using Blockchain in order to preserve the digital life cycle