Tarek Al Mustafa’s research while affiliated with Friedrich Schiller University Jena and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (5)


Reflections from the 2025 EcoHack: AI & LLM Hackathon for Applications in Evidence-based Ecological Research & Practice
  • Preprint
  • File available

April 2025

·

87 Reads

·

Tarek Al Mustafa

·

·

[...]

·

This paper presents outcomes from the inaugural "EcoHack: AI & LLM Hackathon for Applications in Evidence-based Ecological Research & Practice," which convened participants from across Europe and beyond, culminating in 11 team submissions. These submissions highlighted six broad application areas of AI for ecology: (1) AI-enhanced decision support and automation, (2) scientific search and communication, (3) knowledge extraction and reasoning, (4) AI for ecological modeling, forecasting, and simulation, (5) causal inference and ecological reasoning, and (6) AI for biodiversity monitoring and conservation. Each team’s project is summarized in a consolidated table—complete with links to source code—and described in brief papers in the appendix. Beyond summarizing technical results, this paper offers insights into the hackathon’s hybrid structure, featuring an in-person gathering in Bielefeld, Germany, alongside a global online hub that facilitated both local and virtual engagement. Throughout the event, participants showcased how large language models (LLMs) can serve as both robust tools for diverse machine learning tasks and flexible platforms for rapidly prototyping novel research applications. These efforts underscore the importance of stronger technological bridges among stakeholders in ecology, including practitioners, local farmers, and policymakers. Overall, the EcoHack outcomes highlight the transformative potential of AI in driving scientific discovery and fostering interdisciplinary collaboration in ecology.

Download

Enhancing Domain-Specific Encoder Models with LLM-Generated Data: How to Leverage Ontologies, and How to Do Without Them

March 2025

·

5 Reads

We investigate the use of LLM-generated data for continual pretraining of encoder models in specialized domains with limited training data, using the scientific domain of invasion biology as a case study. To this end, we leverage domain-specific ontologies by enriching them with LLM-generated data and pretraining the encoder model as an ontology-informed embedding model for concept definitions. To evaluate the effectiveness of this method, we compile a benchmark specifically designed for assessing model performance in invasion biology. After demonstrating substantial improvements over standard LLM pretraining, we investigate the feasibility of applying the proposed approach to domains without comprehensive ontologies by substituting ontological concepts with concepts automatically extracted from a small corpus of scientific abstracts and establishing relationships between concepts through distributional statistics. Our results demonstrate that this automated approach achieves comparable performance using only a small set of scientific abstracts, resulting in a fully automated pipeline for enhancing domain-specific understanding of small encoder models that is especially suited for application in low-resource settings and achieves performance comparable to masked language modeling pretraining on much larger datasets.


Figure 1: Distribution of papers in our corpus with abstracts and with full text over the past 20 years.
Figure 2: Distribution of papers, in our corpus, by abstract and full-text availability across the top ten publishers.
Mining for Species, Locations, Habitats, and Ecosystems from Scientific Papers in Invasion Biology: A Large-Scale Exploratory Study with Large Language Models

March 2025

·

13 Reads

This paper presents an exploratory study that harnesses the capabilities of large language models (LLMs) to mine key ecological entities from invasion biology literature. Specifically, we focus on extracting species names, their locations, associated habitats, and ecosystems, information that is critical for understanding species spread, predicting future invasions, and informing conservation efforts. Traditional text mining approaches often struggle with the complexity of ecological terminology and the subtle linguistic patterns found in these texts. By applying general-purpose LLMs without domain-specific fine-tuning, we uncover both the promise and limitations of using these models for ecological entity extraction. In doing so, this study lays the groundwork for more advanced, automated knowledge extraction tools that can aid researchers and practitioners in understanding and managing biological invasions.


Figure 1: Distribution of papers in our corpus with abstracts and with full text over the past 20 years.
Figure 2: Distribution of papers, in our corpus, by abstract and full-text availability across the top ten publishers.
Mining for Species, Locations, Habitats, and Ecosystems from Scientific Papers in Invasion Biology: A Large-Scale Exploratory Study with Large Language Models

January 2025

·

15 Reads

·

1 Citation

This paper presents an exploratory study that harnesses the capabilities of large language models (LLMs) to mine key ecological entities from invasion biology literature. Specifically, we focus on extracting species names, their locations, associated habitats, and ecosystems, information that is critical for understanding species spread, predicting future invasions, and informing conservation efforts. Traditional text mining approaches often struggle with the complexity of ecological terminology and the subtle linguistic patterns found in these texts. By applying general-purpose LLMs without domain-specific fine-tuning, we uncover both the promise and limitations of using these models for ecological entity extraction. In doing so, this study lays the groundwork for more advanced, automated knowledge extraction tools that can aid researchers and practitioners in understanding and managing biological invasions.


Fig. 1 Simple Provenance Example with two Entities, an Activity, and an Agent
Fig. 4 Visualisation of Approach 1 with an example KG. Nodes attached to the left of sub-KG show the KG itself. Nodes on the right, attached to prov-sub-KG, contain provenance information
Fig. 5 Visualisation of provenance data in the KG. Shown here are data points associated with the phases of KG generation
Managing Provenance Data in Knowledge Graph Management Platforms

February 2024

·

100 Reads

Datenbank-Spektrum

Knowledge Graphs (KGs) present factual information about domains of interest. They are used in a wide variety of applications and in different domains, serving as powerful backbones for organizing and extracting knowledge from complex data. In both industry and academia, a variety of platforms have been proposed for managing Knowledge Graphs. To use the full potential of KGs within these platforms, it is essential to have proper provenance management to understand where certain information in a KG stems from. This plays an important role in increasing trust and supporting open science principles. It enables reproducibility and updatability of KGs. In this paper, we propose a framework for provenance management of KG generation within a web portal. We present how our framework captures, stores, and retrieves provenance information. Our provenance representation is aligned with the standardized W3C Provenance Ontology. Through our framework, we can rerun the KG generation process over the same or different source data. With this, we support four applications: reproducibility, altered rerun, undo operation, and provenance retrieval. In summary, our framework aligns with the core principles of open science. By promoting transparency and reproducibility, it enhances the reliability and trustworthiness of research outcomes.

Citations (1)


... As citizen science initiatives develop and species trait characterization become more comprehensive, there is potential for our RAG pipeline to harness those improvements to better assist in identification and characterization of unknown species. Additionally, LLMs have recently been recruited to mine and curate data sources for biodiversity research (D'Souza et al., 2025) which would automate data curation and could soon boost the relevancy of RAG systems in Arthropoda taxonomy. ...

Reference:

Taxonomic Reasoning for Rare Arthropods: Combining Dense Image Captioning and RAG for Interpretable Classification
Mining for Species, Locations, Habitats, and Ecosystems from Scientific Papers in Invasion Biology: A Large-Scale Exploratory Study with Large Language Models