
Sören Auer- Prof. Dr.
- Professor at Leibniz Universität Hannover
Sören Auer
- Prof. Dr.
- Professor at Leibniz Universität Hannover
Working on organizing the flood of research with the Open Research Knowledge Graph: https://www.orkg.org/
About
681
Publications
321,735
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
24,135
Citations
Introduction
My key research question is: "How can we digitize the work and information flows in science and technology?" I serve as director of TIB German National Library of Science and Technology ‒ Leibniz Information Centre for Science and Technology. My research interests include social and semantic technologies, knowledge representation, engineering & management, usability, agile methodologies as well as databases and information systems.
Current institution
Additional affiliations
July 2017 - present
July 2017 - present
September 2011 - August 2012
Education
September 2003 - October 2006
September 1997 - June 1998
October 1995 - February 2000
Publications
Publications (681)
This whitepaper gives an overview on aims and architecture of the Industrial Data Space. Additionally, some use cases and the Industrial Data Space Association are introduced.
The management and analysis of large-scale datasets – described with the term Big Data – involves the three classic dimensions volume, velocity and variety. While the former two are well supported by a plethora of software components, the variety dimension is still rather neglected. We present the BDE platform – an easy-to-deploy, easy-to-use and a...
In the engineering and manufacturing domain, there is currently an atmosphere of departure to a new era of digitized production. In different regions, initiatives in these directions are known under different names, such as industrie du futur in France, industrial internet in the US or Industrie 4.0 in Germany. While the vision of digitizing produc...
The search for information on the Web of Data is becoming increasingly difficult due to its dramatic growth. Especially novice users need to acquire both knowledge about the underlying ontology structure and proficiency in formulating formal queries (e. g. SPARQL queries) to retrieve information from Linked Data sources. So as to simplify and autom...
With Linked Data, a very pragmatic approach towards achieving the vision of the Semantic Web has recently gained much traction. The term Linked Data refers to a set of best practices for publishing and interlinking structured data on the Web. While many standards, methods and technologies developed within by the Semantic Web community are applicabl...
Extracting structured information from unstructured text is crucial for modeling real-world processes, but traditional schema mining relies on semi-structured data, limiting scalability. This paper introduces schema-miner, a novel tool that combines large language models with human feedback to automate and refine schema extraction. Through an itera...
Ontology Alignment (OA) is fundamental for achieving semantic interoperability across diverse knowledge systems. We present OntoAligner, a comprehensive, modular, and robust Python toolkit for ontology alignment, designed to address current limitations with existing tools faced by practitioners. Existing tools are limited in scalability, modularity...
Every scientific discovery starts with an idea inspired by prior work, interdisciplinary concepts, and emerging challenges. Recent advancements in large language models (LLMs) trained on scientific corpora have driven interest in AI-supported idea generation. However, generating context-aware, high-quality, and innovative ideas remains challenging....
Scientific hypothesis generation is a fundamentally challenging task in research, requiring the synthesis of novel and empirically grounded insights. Traditional approaches rely on human intuition and domain expertise, while purely large language model (LLM) based methods often struggle to produce hypotheses that are both innovative and reliable. T...
This article presents the implementation of a neuro-symbolic system within the Open Research Knowledge Graph (ORKG), a platform designed to collect and organize scientific knowledge in a structured, machine-readable format. Our approach leverages the strengths of symbolic knowledge representation to encode complex relationships and domain-specific...
A major challenge in scholarly information retrieval is the semantic description of research contributions. While Generative Artificial Intelligence (AI) can assist in this regard, we need minimally invasive approaches for engaging users in the process. We introduce an innovative approach to annotating research articles directly within the browser...
Paywalls, licenses and copyright rules often restrict the broad dissemination and reuse of scientific knowledge. We take the position that it is both legally and technically feasible to extract the scientific knowledge in scholarly texts. Current methods, like text embeddings, fail to reliably preserve factual content, and simple paraphrasing may n...
Detecting hallucinations in Large Language Models (LLMs) remains a critical challenge for their reliable deployment in real-world applications. To address this, we introduce SelfCheckAgent, a novel framework integrating three different agents: the Symbolic Agent, the Specialized Detection Agent, and the Contextual Consistency Agent. These agents pr...
The number of published scholarly articles is growing at a significant rate, making scholarly knowledge organization increasingly important. Various approaches have been proposed to organize scholarly information, including describing scholarly knowledge semantically leveraging knowledge graphs. Transforming unstructured knowledge, presented within...
We are witnessing major advancements largely due to the capabilities of LLMs. LLMs have not only redefined the boundaries of text generation and comprehension but have paved the way for innovative applications in the synthesis of scientific research. The LLMs4Synthesis framework was introduced to automate and accurately generate scientific synthesi...
Domain-specific named entity recognition (NER) on Computer Science (CS) scholarly articles is an information extraction task that is arguably more challenging for the various annotation aims that can hamper the task and has been less studied than NER in the general domain. Given that significant progress has been made on NER, we anticipate that sch...
There is currently a gap between the natural language expression of scholarly publications and their structured semantic content modeling to enable intelligent content search. With the volume of research growing exponentially every year, a search feature operating over semantically structured content is compelling. The SemEval-2021 Shared Task NLPC...
We describe a rule-based approach for the automatic acquisition of salient scientific entities from Computational Linguistics (CL) scholarly article titles. Two observations motivated the approach: (i) noting salient aspects of an article’s contribution in its title; and (ii) pattern regularities capturing the salient terms that could be expressed...
Background • The rapid advancements in Large Language Models (LLMs) have opened new avenues for automating complex tasks in AI research. • This work investigates the efficacy of different LLMs-Mistral 7B, Llama-2, GPT-4-Turbo and GPT-4.o in extracting leaderboards from AI research articles. • We explored three types of contextual inputs to the mode...
We present a large-scale empirical investigation of the zero-shot learning phenomena in a specific recognizing textual entailment (RTE) task category, i.e., the automated mining of leaderboards for Empirical AI Research. The prior reported state-of-the-art models for leaderboards extraction formulated as an RTE task in a non-zero-shot setting are p...
In this talk about our research paper, we champion the use of structured and semantic content representation of discourse-based scholarly communication, inspired by tools like Wikipedia infoboxes or structured Amazon product descriptions. These representations provide users with a concise overview, aiding scientists in navigating the dense academic...
This presentation is a walk through of the Large Language Models for Ontology Learning (LLMs4OL) Challenge 2024 at the 23rd International Semantic Web Conference (ISWC). This talk defines the challenge, introduces our benchmark dataset, presents the teams, showcases the leaderboard results, and offers takeaway insights,
Purpose: Finding scholarly articles is a time-consuming and cumbersome activity, yet crucial for conducting science. Due to the growing number of scholarly articles, new scholarly search systems are needed to effectively assist researchers in finding relevant literature. Methodology: We take a neuro-symbolic approach to scholarly search and explora...
Mathematical reasoning has proven to be a critical yet challenging task for large language models (LLMs), as they often struggle with complex multi-step problems. To address these limitations, we introduce the Monte Carlo Nash Equilibrium Self-Refine Tree (MC-NEST) algorithm, an enhancement of the Monte Carlo Tree Self-Refine (MCTSr) approach. By i...
Exponentially increasing interrelated scholarly knowledge is being published on multiple scholarly communication infrastructures. Retrieving data from a single scholarly communication infrastructure is not sufficient to meet users complex requirements. Moreover, the manual linking of scholarly knowledge to produce interrelated outputs is a cumberso...
The reuse of research software is central to research efficiency and academic exchange. The application of software enables researchers to reproduce, validate, and expand upon study findings. The analysis of open-source code aids in the comprehension, comparison, and integration of approaches. Often, however, no further use occurs because relevant...
The growing volume of biomedical scholarly document abstracts presents an increasing challenge in efficiently retrieving accurate and relevant information. To address this, we introduce a novel approach that integrates an optimized topic modelling framework, OVB-LDA, with the BI-POP CMA-ES optimization technique for enhanced scholarly document abst...
Academic Search Engines (ASEs) are crucial for navigating the vast landscape of scholarly literature. Traditionally, these engines rely on keyword-based search, supplemented by predefined facets encompassing metadata such as research field, publication year, type, authors, and language. However, ASEs are limited in their ability to generate dynamic...
Ontology learning (OL) from unstructured data has evolved significantly, with recent advancements integrating large language models (LLMs) to enhance various aspects of the process. The paper introduces the LLMs4OL 2024 datasets, developed to benchmark and advance research in OL using LLMs. The LLMs4OL 2024 dataset as a key component of the LLMs4OL...
We are pleased to present the proceedings of the “1st Large Language Models for Ontology Learning Challenge (LLMs4OL 2024)”, held at the 23rd International Semantic Web Conference (ISWC). This challenge marks a significant a dvancement in utilizing Large Language Models (LLMs) for Ontology Learning (OL)—a key Semantic Web component that facilitates...
This paper outlines the LLMs4OL 2024, the first edition of the Large Language Models for Ontology Learning Challenge. LLMs4OL is a community development initiative collocated with the 23rd International Semantic Web Conference (ISWC) to explore the potential of Large Language Models (LLMs) in Ontology Learning (OL), a vital process for enhancing th...
In response to the growing complexity and volume of scientific literature, this paper introduces the LLMs4Synthesis framework, designed to enhance the capabilities of Large Language Models (LLMs) in generating high-quality scientific syntheses. This framework addresses the need for rapid, coherent, and contextually rich integration of scientific in...
This paper outlines the LLMs4OL 2024, the first edition of the Large Language Models for Ontology Learning Challenge. LLMs4OL is a community development initiative collocated with the 23rd International Semantic Web Conference (ISWC) to explore the potential of Large Language Models (LLMs) in Ontology Learning (OL), a vital process for enhancing th...
The increasing amount of published scholarly articles, exceeding 2.5 million yearly, raises the challenge for researchers in following scientific progress. Integrating the contributions from scholarly articles into a novel type of cognitive knowledge graph (CKG) will be a crucial element for accessing and organizing scholarly knowledge, surpassing...
This paper introduces a scholarly Question Answering (QA) system on top of the NFDI4DataScience Gateway, employing a Retrieval Augmented Generation-based (RAG) approach. The NFDI4DS Gateway, as a foundational framework, offers a unified and intuitive interface for querying various scientific databases using federated search. The RAG-based scholarly...
Food information engineering relies on statistical and AI techniques (e.g., symbolic, connectionist, and neurosymbolic AI) for collecting, storing, processing, diffusing, and putting food information in a form exploitable by humans and machines. Food information is collected manually and automatically. Once collected, food information is organized...
Authoring survey or review articles still requires significant tedious manual effort, despite many advancements in research knowledge management having the potential to improve efficiency, reproducibility, and reuse. However, these advancements bring forth an increasing number of approaches, tools, and systems, which often cover only specific stage...
Our study explores how well the state-of-the-art Large Language Models (LLMs), like GPT-4 and Mistral, can assess the quality of scientific summaries or, more fittingly, scientific syntheses, comparing their evaluations to those of human annotators. We used a dataset of 100 research questions and their syntheses made by GPT-4 from abstracts of five...
This paper introduces a scholarly Question Answering (QA) system on top of the NFDI4DataScience Gateway, employing a Retrieval Augmented Generation-based (RAG) approach. The NFDI4DS Gateway, as a foundational framework, offers a unified and intuitive interface for querying various scientific databases using federated search. The RAG-based scholarly...
The rapid advancements in Large Language Models (LLMs) have opened new avenues for automating complex tasks in AI research. This paper investigates the efficacy of different LLMs-Mistral 7B, Llama-2, GPT-4-Turbo and GPT-4.o in extracting leaderboard information from empirical AI research articles. We explore three types of contextual inputs to the...
This paper explores the impact of context selection on the efficiency of Large Language Models (LLMs) in generating Artificial Intelligence (AI) research leaderboards, a task defined as the extraction of (Task, Dataset, Metric, Score) quadruples from scholarly articles. By framing this challenge as a text generation objective and employing instruct...
This presentation is a walk-through of the work of the LLMs4OM: Matching Ontologies with Large Language Models. This framework utilizes two modules for retrieval and matching, respectively, enhanced by zero-shot prompting across three ontology representations: concept, concept-parent, and concept-children. Through comprehensive evaluations using 20...
Ontology Matching (OM), is a critical task in knowledge integration, where aligning heterogeneous ontologies facilitates data interoperability and knowledge sharing. Traditional OM systems often rely on expert knowledge or predictive models, with limited exploration of the potential of Large Language Models (LLMs). We present the LLMs4OM framework,...
The ORKG has opened a new era in the way scholarly knowledge is curated, managed, and disseminated. By transforming vast arrays of unstructured narrative text
into structured, machine-processable knowledge, the ORKG has emerged as an
essential service with sophisticated functionalities. Over the past five years, our
team has developed the ORKG into...
Everyone agrees on the importance of objective scientific information. However, relevant scientific documents tend to be inherently difficult to find and understand either because of intricate terminology or the potential absence of prior knowledge among their readers. Can we improve accessibility for everyone? This paper introduces the SimpleText...
In this paper, we champion the use of structured and semantic content representation of discourse-based scholarly communication, inspired by tools like Wikipedia infoboxes or structured Amazon product descriptions. These representations provide users with a concise overview, aiding scientists in navigating the dense academic landscape. Our novel au...
The Open Research Knowledge Graph (ORKG) is a digital library for machine-actionable scholarly knowledge, with a focus on structured research comparisons obtained through expert crowdsourcing. While the ORKG has attracted a community of more than 1,000 users, the curated data has not been subject to an in-depth quality assessment so far. Here, prop...
0000−0003−3975−5374] , Kheir Eddine Farfar 1[0000−0002−0366−4596] , Allard Oelen 1[0000−0001−9924−9153] , Oliver Karras 1[0000−0001−5336−6899] , and Sören Auer 1,2[0000−0002−0698−2864] Abstract. One of the pillars of the scientific method is reproducibility-the ability to replicate the results of a prior study if the same procedures are followed. A...
The value of structured scholarly knowledge for research and society at large is well understood, but producing scholarly knowledge (i.e., knowledge traditionally published in articles) in structured form remains a challenge. We propose an approach for automatically extracting scholarly knowledge from published software packages by static analysis...
Multiple studies have investigated bibliometric features and uncategorized scholarly documents for the influential scholarly document prediction task. In this paper, we describe our work that attempts to go beyond bibliometric metadata to predict influential scholarly documents. Furthermore, this work also examines the influential scholarly documen...
Search engines these days can serve datasets as search results. Datasets get picked up by search technologies based on structured descriptions on their official web pages, informed by metadata ontologies such as the Dataset content type of schema.org. Despite this promotion of the content type dataset as a first-class citizen of search results, a v...
This presentation is a walk-through of the Large Language Models for Ontology Learning (LLMs4OL) Paradigm.
We propose the LLMs4OL approach, which utilizes Large Language Models (LLMs) for Ontology Learning (OL). LLMs have shown significant advancements in natural language processing, demonstrating their ability to capture complex language patterns in different knowledge domains. Our LLMs4OL paradigm investigates the following hypothesis: Can LLMs effect...
[Background.] Empirical research in requirements engineering (RE) is a constantly evolving topic, with a growing number of publications. Several papers address this topic using literature reviews to provide a snapshot of its “current” state and evolution. However, these papers have never built on or updated earlier ones, resulting in overlap and re...
Search engines these days can serve datasets as search results. Datasets get picked up by search technologies based on structured descriptions on their official web pages, informed by metadata ontologies such as the Dataset content type of this http URL. Despite this promotion of the content type dataset as a first-class citizen of search results,...
The Open Research Knowledge Graph (ORKG) is an Open Science digital infrastructure for the production, curation, publication, and reuse of machine-actionable scholarly knowledge. Built on top of the RDF data model and extensible ontologies, the ORKG provides a common vocabulary for researchers to describe their research contributions and data, impr...
The NFDI4Energy consortium aims to establish new services filling a variety of needs for the energy system research community, from making FAIR research data easily accessible to promoting collaboration among community entities. Seven Task Areas (TAs) have been defined to achieve the consortium’s objectives, each with a specific focus. Task Area 4...
The amount of research articles produced every day is overwhelming: scholarly knowledge is getting harder to communicate and easier to get lost. A possible solution is to represent the information in knowledge graphs: structures representing knowledge in networks of entities, their semantic types, and relationships between them. But this solution h...
Recent investigations have explored prompt-based training of transformer language models for new text genres in low-resource settings. This approach has proven effective in transferring pre-trained or fine-tuned models to resource-scarce environments. This work presents the first results on applying prompt-based training to transformers for scholar...
The amount of research articles produced every day is overwhelming: scholarly knowledge is getting harder to communicate and easier to get lost. A possible solution is to represent the information in knowledge graphs: structures representing knowledge in networks of entities, their semantic types, and relationships between them. But this solution h...
Recent investigations have explored prompt-based training of transformer language models for new text genres in low-resource settings. This approach has proven effective in transferring pre-trained or fine-tuned models to resource-scarce environments. This work presents the first results on applying prompt-based training to transformers for scholar...
We propose the LLMs4OL approach, which utilizes Large Language Models (LLMs) for Ontology Learning (OL). LLMs have shown significant advancements in natural language processing, demonstrating their ability to capture complex language patterns in different knowledge domains. Our LLMs4OL paradigm investigates the following hypothesis: \textit{Can LLM...
[Background.] Empirical research in requirements engineering (RE) is a constantly evolving topic, with a growing number of publications. Several papers address this topic using literature reviews to provide a snapshot of its "current" state and evolution. However, these papers have never built on or updated earlier ones, resulting in overlap and re...
Complex research problems are increasingly addressed by interdisciplinary, collaborate research projects generating large amounts of heterogeneous amounts of data. The overarching processing, analysis and availability of data are critical success factors for these research efforts. Data repositories enable long term availability of such data for th...
The reuse of research software is central to research efficiency and academic exchange. The application of software enables researchers with varied backgrounds to reproduce, validate, and expand upon study findings. Furthermore, the analysis of open source code aids in the comprehension, comparison, and integration of approaches. Often, however, no...
The purpose of this work is to describe the orkg-Leaderboard software designed to extract leaderboards defined as task–dataset–metric tuples automatically from large collections of empirical research papers in artificial intelligence (AI). The software can support both the main workflows of scholarly publishing, viz. as LaTeX files or as PDF files....
Recent developments in the context of semantic technologies have given rise to ontologies for modelling scientific information in various fields of science. Over the past years, we have been engaged in the development of the Science Knowledge Graph Ontologies (SKGO), a set of ontologies for modelling research findings in various fields of science....
The overall AI trend of creating neuro-symbolic systems is reflected in the Semantic Web community with an increased interest in the development of systems that rely on both Semantic Web resources and Machine Learning components (SWeMLS, for short). However, understanding trends and best practices in this rapidly growing field is hampered by a lack...
There have been many recent investigations into prompt-based training of transformer language models for new text genres in low-resource settings. The prompt-based training approach has been found to be effective in generalizing pre-trained or fine-tuned models for transfer to resource-scarce settings. This work, for the first time, reports results...
The purpose of this work is to describe the Orkg-Leaderboard software designed to extract leaderboards defined as Task-Dataset-Metric tuples automatically from large collections of empirical research papers in Artificial Intelligence (AI). The software can support both the main workflows of scholarly publishing, viz. as LaTeX files or as PDF files....
Knowledge graphs have gained increasing popularity in the last decade in science and technology. However, knowledge graphs are currently relatively simple to moderate semantic structures that are mainly a collection of factual statements. Question answering (QA) benchmarks and systems were so far mainly geared towards encyclopedic knowledge graphs...
The rapid growth of research publications has placed great demands on digital libraries (DL) for advanced information management technologies. To cater to these demands, techniques relying on knowledge-graph structures are being advocated. In such graph-based pipelines, inferring semantic relations between related scientific concepts is a crucial s...
Multiple studies have investigated bibliometric features and uncategorized scholarly documents for the influential scholarly document prediction task. In this paper, we describe our work that attempts to go beyond bibliometric metadata to predict influential scholarly documents. Furthermore, this work also examines the influential scholarly documen...
Due to the growing number of scholarly publications, finding relevant articles becomes increasingly difficult. Scholarly knowledge graphs can be used to organize the scholarly knowledge presented within those publications and represent them in machine-readable formats. Natural language processing (NLP) provides scalable methods to automatically ext...
We present a large-scale empirical investigation of the zero-shot learning phenomena in a specific recognizing textual entailment (RTE) task category, i.e. the automated mining of leaderboards for Empirical AI Research. The prior reported state-of-the-art models for leaderboards extraction formulated as an RTE task, in a non-zero-shot setting, are...
In line with the general trend in artificial intelligence research to create intelligent systems that combine learning and symbolic components, a new sub-area has emerged that focuses on combining machine learning (ML) components with techniques developed by the Semantic Web (SW) community - Semantic Web Machine Learning (SWeML for short). Due to i...
Data lakes are typically lightly curated and as such prone to data quality problems and inconsistencies. In particular, duplicate tables are common in most repositories. The goal of duplicate table detection is to identify those tables that display the same data.Comparing tables is generally quite expensive as the order of rows and columns might di...
Recent events such as wars, sanctions, pandemics, and climate change have shown the importance of proper supply network management. A key step in managing supply networks is procurement. We present an approach for realizing a next-generation procurement workspace that aims to facilitate resilience and sustainability. To achieve this, the approach e...
The Open Research Knowledge Graph is an infrastructure for the production, curation, publication and use of FAIR scientific information. Its mission is to shape a future scholarly publishing and communication where the contents of scholarly articles are FAIR research data.
In the last decade, a large number of knowledge graph (KG) completion approaches were proposed. Albeit effective, these efforts are disjoint, and their collective strengths and weaknesses in effective KG completion have not been studied in the literature. We extend Plumber , a framework that brings together the research community’s disjoint efforts...