Michael Faerber

Michael Faerber
Technische Universität Dresden | TUD

Professor
Leading the research group "Scalable Software Architectures for Data Analytics" at ScaDS.AI | TU Dresden

About

133
Publications
20,828
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,340
Citations
Introduction
My research lies at the intersection of knowledge representation, machine learning, and natural language processing. Among other things, I pursue research on scholarly data mining (e.g., scientific impact quantification), scholarly recommender systems (e.g., citation recommendation), and scholarly knowledge graphs (e.g., modeling papers, methods, and datasets). Furthermore, I develop AI solutions for peace mediation.

Publications

Publications (133)
Conference Paper
Full-text available
Relation extraction is crucial for constructing knowledge graphs, with large high-quality datasets serving as the foundation for training, fine-tuning, and evaluating models. Generative data augmentation (GDA) is a common approach to expand such datasets. However, this approach often introduces hallucinations, such as spurious facts, whose impact o...
Preprint
Full-text available
The growing use of Machine Learning and Artificial Intelligence (AI), particularly Large Language Models (LLMs) like OpenAI's GPT series, leads to disruptive changes across organizations. At the same time, there is a growing concern about how organizations handle personal data. Thus, privacy policies are essential for transparency in data processin...
Preprint
Full-text available
Relation extraction is crucial for constructing knowledge graphs, with large high-quality datasets serving as the foundation for training, fine-tuning, and evaluating models. Generative data augmentation (GDA) is a common approach to expand such datasets. However, this approach often introduces hallucinations, such as spurious facts, whose impact o...
Conference Paper
Full-text available
Causal discovery aims to estimate causal structures among variables based on observational data. Large Language Models (LLMs) offer a fresh perspective to tackle the causal discovery problem by reasoning on the metadata associated with variables rather than their actual data values, an approach referred to as knowledge-based causal discovery. In th...
Conference Paper
Full-text available
In this paper, we introduce AutoRDF2GML, a framework designed to convert RDF data into data representations tailored for graph machine learning tasks. AutoRDF2GML enables, for the first time, the creation of both content-based features-i.e., features based on RDF datatype properties-and topology-based features-i.e., features based on RDF object pro...
Preprint
Full-text available
In this paper, we introduce AutoRDF2GML, a framework designed to convert RDF data into data representations tailored for graph machine learning tasks. AutoRDF2GML enables, for the first time, the creation of both content-based features -- i.e., features based on RDF datatype properties -- and topology-based features -- i.e., features based on RDF o...
Preprint
Full-text available
Causal discovery aims to estimate causal structures among variables based on observational data. Large Language Models (LLMs) offer a fresh perspective to tackle the causal discovery problem by reasoning on the metadata associated with variables rather than their actual data values, an approach referred to as knowledge-based causal discovery. In th...
Article
Full-text available
Recently, extensive research has been conducted to explore the utilization of machine learning (ML) algorithms in various direct-detected and (self)-coherent short-reach communication applications. These applications encompass a wide range of tasks, including bandwidth request prediction, signal quality monitoring, fault detection, traffic predicti...
Conference Paper
Full-text available
Pretrained Language Models (PLMs) benefit from external knowledge stored in graph structures for various downstream tasks. However, bridging the modality gap between graph structures and text remains a significant challenge. Traditional methods like linearizing graphs for PLMs lose vital graph connectivity, whereas Graph Neural Networks (GNNs) requ...
Preprint
Full-text available
We introduce ComplexTempQA,a large-scale dataset consisting of over 100 million question-answer pairs designed to tackle the challenges in temporal question answering. ComplexTempQA significantly surpasses existing benchmarks like HOTPOTQA, TORQUE, and TEQUILA in scale and scope. Utilizing data from Wikipedia and Wikidata, the dataset covers questi...
Preprint
Full-text available
In recent years, extensive research has been conducted to explore the utilization of machine learning algorithms in various direct-detected and self-coherent short-reach communication applications. These applications encompass a wide range of tasks, including bandwidth request prediction, signal quality monitoring, fault detection, traffic predicti...
Article
Full-text available
An ML-supported diagnostics concept is introduced and demonstrated to detect and classify events on OTDR traces for application on a PON optical distribution network. We can also associate events with ODN branches by using deployment data of the PON. We analyze an ensemble classifier and neural networks, the usage of synthetic OTDR-like traces, and...
Chapter
Full-text available
Automatic extraction of information from publications is key to making scientific knowledge machine-readable at a large scale. The extracted information can, for example, facilitate academic search, decision making, and knowledge graph construction. An important type of information not covered by existing approaches is hyperparameters. In this pape...
Article
Full-text available
Co-authored publications can bring positive results for those who participate, such as gaining additional expertise, accessing more funding or increasing the publication impact. China, the European Union, and the United States have been collaborating between each other throughout the years in the field of Computer Science. These collaborations vari...
Conference Paper
A frequency-calibrated SCINet (FC-SCINet) equalizer is proposed for downstream 100G PON with 28.7 dB path loss. At 5 km, FC-SCINet improves the BER by 88.87% compared to FFE and a 3-layer DNN with 10.57% lower complexity.
Conference Paper
Full-text available
Automatic extraction of information from publications is key to making scientific knowledge machine-readable at a large scale. The extracted information can, for example, facilitate academic search, decision making, and knowledge graph construction. An important type of information not covered by existing approaches is hyperparameters. In this pape...
Conference Paper
Full-text available
Named entity recognition and disambiguation, often referred to as entity linking systems, refers to the task of automatically identifying knowledge graph entities in text documents. While a variety of entity linking systems based on very different approaches exist, these systems implicitly share certain processing steps in their pipeline. Despite t...
Article
Full-text available
Artificial Intelligence (AI) is one of the most momentous technologies of our time. Thus, it is of major importance to know which stakeholders influence AI research. Besides researchers at universities and colleges, researchers in companies have hardly been considered in this context. In this article, we consider how the influence of companies on A...
Conference Paper
Full-text available
In this paper, we introduce Linked Papers With Code (LPWC), an RDF knowledge graph that provides comprehensive, current information about almost 400,000 machine learning publications. This includes the tasks addressed, the datasets utilized, the methods implemented, and the evaluations conducted, along with their results. Compared to its non-RDF-ba...
Chapter
Full-text available
We present SemOpenAlex , an extensive RDF knowledge graph that contains over 26 billion triples about scientific publications and their associated entities, such as authors, institutions, journals, and concepts. SemOpenAlex is licensed under CC0, providing free and open access to the data. We offer the data through multiple channels, including RDF...
Preprint
Full-text available
Commit messages (CMs) are an essential part of version control. By providing important context in regard to what has changed and why, they strongly support software maintenance and evolution. But writing good CMs is difficult and often neglected by developers. So far, there is no tool suitable for practice that automatically assesses how well a CM...
Conference Paper
Full-text available
In this paper, we propose VOCAB-EXPANDER at https://vocab-expander.com, an on-line tool that enables end-users (e.g., technology scouts) to create and expand a vocabulary of their domain of interest. It utilizes an ensemble of state-of-the-art word embedding techniques based on web text and ConceptNet, a common-sense knowledge base, to suggest rela...
Conference Paper
Full-text available
We present SemOpenAlex, an extensive RDF knowledge graph that contains over 26 billion triples about scientific publications and their associated entities, such as authors, institutions, journals, and concepts. SemOpenAlex is licensed under CC0, providing free and open access to the data. We offer the data through multiple channels, including RDF d...
Preprint
Full-text available
In this paper, we propose Vocab-Expander at https://vocab-expander.com, an online tool that enables end-users (e.g., technology scouts) to create and expand a vocabulary of their domain of interest. It utilizes an ensemble of state-of-the-art word embedding techniques based on web text and ConceptNet, a common-sense knowledge base, to suggest relat...
Preprint
Full-text available
Determining and measuring diversity in news articles is important for a number of reasons, including preventing filter bubbles and fueling public discourse, especially before elections. So far, the identification and analysis of diversity have been illuminated in a variety of ways, such as measuring the overlap of words or topics between news artic...
Preprint
Full-text available
We present SemOpenAlex, an extensive RDF knowledge graph that contains over 26 billion triples about scientific publications and their associated entities, such as authors, institutions, journals, and concepts. SemOpenAlex is licensed under CC0, providing free and open access to the data. We offer the data through multiple channels, including RDF d...
Preprint
Full-text available
Large language models (LLMs) have been widely employed for graph-to-text generation tasks. However, the process of finetuning LLMs requires significant training resources and annotation work. In this paper, we explore the capability of generative models to generate descriptive text from graph data in a zero-shot setting. Specifically, we evaluate G...
Conference Paper
Full-text available
At times when answers to user questions are readily and easily available (at essentially zero cost), it is important for humans to maintain their knowledge and strong reasoning capabilities. We believe that in many cases providing hints rather than final answers should be sufficient and beneficial for users as it requires thinking and stimulates le...
Conference Paper
Full-text available
Commit messages (CMs) are an essential part of version control. By providing important context in regard to what has changed and why, they strongly support software maintenance and evolution. But writing good CMs is difficult and often neglected by developers. So far, there is no tool suitable for practice that automatically assesses how well a CM...
Conference Paper
Full-text available
Large-scale data sets on scholarly publications are the basis for a variety of bibliometric analyses and natural language processing (NLP) applications. Especially data sets derived from publication’s full-text have recently gained attention. While several such data sets already exist, we see key shortcomings in terms of their domain and time cover...
Preprint
Full-text available
In the wake of information overload in academia, methodologies and systems for search, recommendation, and prediction to aid researchers in identifying relevant research are actively studied and developed. Existing work, however, is limited in terms of granularity, focusing only on the level of papers or a single type of artifact, such as data sets...
Preprint
Full-text available
Large-scale data sets on scholarly publications are the basis for a variety of bibliometric analyses and natural language processing (NLP) applications. Especially data sets derived from publication's full-text have recently gained attention. While several such data sets already exist, we see key shortcomings in terms of their domain and time cover...
Article
Full-text available
With the remarkable increase in the number of scientific entities such as publications, researchers, and scientific topics, and the associated information overload in science, academic recommender systems have become increasingly important for millions of researchers and science enthusiasts. However, it is often overlooked that these systems are su...
Chapter
Full-text available
While several approaches have been proposed for estimating the readability of English texts, there is much less work for other languages. In this paper, we present an online service, available at https://readability-check.org/, that provides five well-established statistical methods and two machine learning models for measuring the readability of t...
Preprint
Full-text available
With the remarkable increase in the number of scientific entities such as publications, researchers, and scientific topics, and the associated information overload in science, academic recommender systems have become increasingly important for millions of researchers and science enthusiasts. However, it is often overlooked that these systems are su...
Preprint
Full-text available
Autonomous Driving (AD), the area of robotics with the greatest potential impact on society, has gained a lot of momentum in the last decade. As a result of this, the number of datasets in AD has increased rapidly. Creators and users of datasets can benefit from a better understanding of developments in the field. While scientometric analysis has b...
Preprint
Full-text available
Environmental, social and governance (ESG) engagement of companies moved into the focus of public attention over recent years. With the requirements of compulsory reporting being implemented and investors incorporating sustainability in their investment decisions, the demand for transparent and reliable ESG ratings is increasing. However, automatic...
Conference Paper
Full-text available
Modeling AI systems' characteristics of energy consumption and their sustainability level as an extension of the FAIR data principles has been considered only rudimentarily. In this paper, we propose the Green AI Ontology for modeling the energy consumption and other environmental aspects of AI models. We evaluate our ontology based on competency q...
Conference Paper
Full-text available
Convolutional neural networks (CNNs) have achieved astonishing performance on various image classification tasks, but it is difficult for humans to understand how a classification comes about. Recent literature proposes methods to explain the classification process to humans. These focus mostly on visualizing feature maps and filter weights, which...
Conference Paper
Full-text available
Citing is an important aspect of scientific discourse and important for quantifying the scientific impact quantification of researchers. Previous works observed that citations are made not only based on the pure scholarly contributions but also based on non-scholarly attributes, such as the affiliation or gender of authors. In this way, citation bi...
Article
Full-text available
Citation information in scholarly data is an important source of insight into the reception of publications and the scholarly discourse. Outcomes of citation analyses and the applicability of citation-based machine learning approaches heavily depend on the completeness of such data. One particular shortcoming of scholarly data nowadays is that non-...
Conference Paper
Full-text available
Analyses and applications based on bibliographic references are of ever increasing importance. However, reference linking methods described in the literature are only able to link around half of the references in papers. To improve the quality of reference linking in large scholarly data sets, we propose a blocking-based reference linking approach...
Preprint
Full-text available
Citing is an important aspect of scientific discourse and important for quantifying the scientific impact quantification of researchers. Previous works observed that citations are made not only based on the pure scholarly contributions but also based on non-scholarly attributes, such as the affiliation or gender of authors. In this way, citation bi...
Preprint
Full-text available
We present FREDo, a few-shot document-level relation extraction (FSDLRE) benchmark. As opposed to existing benchmarks which are built on sentence-level relation extraction corpora, we argue that document-level corpora provide more realism, particularly regarding none-of-the-above (NOTA) distributions. Therefore, we propose a set of FSDLRE tasks and...
Conference Paper
Full-text available
The choice of databases containing publications' metadata (i.e., bibliographic databases) determines the available publication list of any author and, thus, their public appearance and evaluation. Having all publications listed in the various bibliographic databases is therefore important for researchers. However, the average number of publications...
Preprint
Full-text available
In this paper, we present an end-to-end joint entity and relation extraction approach based on transformer-based language models. We apply the model to the task of linking mathematical symbols to their descriptions in LaTeX documents. In contrast to existing approaches, which perform entity and relation extraction in sequence, our system incorporat...
Article
Full-text available
Although several large knowledge graphs have been proposed in the scholarly field, such graphs are limited with respect to several data quality dimensions such as accuracy and coverage. In this article, we present methods for enhancing the Microsoft Academic Knowledge Graph (MAKG), a recently published large-scale knowledge graph containing metadat...
Conference Paper
Full-text available
Extracting structured data from bibliographic references is a crucial task for the creation of scholarly databases. While approaches , tools, and evaluation data sets for the task exist, there is a distinct lack of support for languages other than English and scripts other than the Latin alphabet. A significant portion of the scientific literature...
Conference Paper
Full-text available
Considering the increasing rate of scientific papers published in recent years, for researchers throughout all disciplines it has become a challenge to keep track of which latest scientific methods are suitable for which applications. In particular, an unmanageable amount of neural network architectures has been published. In this paper, we propose...
Conference Paper
Full-text available
When used in real-world environments, agents must meet high safety requirements as errors have direct consequences. Besides the safety aspect, the explainability of the systems is of particular importance. Therefore, not only should errors be avoided during the learning process, but also the decision process should be made transparent. Existing app...
Preprint
Full-text available
One of the main challenges of startups is to raise capital from investors. For startup founders, it is therefore crucial to know whether investors have a bias against women as startup founders and in which way startups face disadvantages due to gender bias. Existing works on gender studies have mainly analyzed the US market. In this paper, we aim t...
Preprint
Full-text available
Argument search aims at identifying arguments in natural language texts. In the past, this task has been addressed by a combination of keyword search and argument identification on the sentence- or document-level. However, existing frameworks often address only specific components of argument search and do not address the following aspects: (1) arg...
Preprint
Full-text available
Citation information in scholarly data is an important source of insight into the reception of publications and the scholarly discourse. Outcomes of citation analyses and the applicability of citation based machine learning approaches heavily depend on the completeness of such data. One particular shortcoming of scholarly data nowadays is that non-...
Article
Full-text available
Several scholarly knowledge graphs have been proposed to model and analyze the academic landscape. However, although the number of data sets has increased remarkably in recent years, these knowledge graphs do not primarily focus on data sets but rather associated entities such as publications. Moreover, publicly available data set knowledge graphs...
Preprint
Full-text available
Convolutional neural networks (CNNs) have achieved astonishing performance on various image classification tasks, but it is difficult for humans to understand how a classification comes about. Recent literature proposes methods to explain the classification process to humans. These focus mostly on visualizing feature maps and filter weights, which...
Chapter
Full-text available
Neural networks are a popular tool in e-commerce, in particular for product recommendations. To build reliable recommender systems, it is crucial to understand how exactly recommendations come about. Unfortunately, neural networks work as black boxes that do not provide explanations of how the recommendations are made.
Chapter
Full-text available
In this chapter, we consider the theoretical foundations for representing knowledge in the Internet of Things context. Specifically, we consider (1) the model-theoretic semantics (i.e., extensional semantics ), (2) the possible-world semantics (i.e., intensional semantics ), (3) the situation semantics , and (4) the cognitive/distributional semanti...
Chapter
Full-text available
While the path in the field of Entity Linking (EL) has been long and brought forth a plethora of approaches over the years, many of these are exceedingly difficult to execute for purposes of detailed analysis. In many cases, implementations are available, but far from being a plug-and-play experience. We present Combining Linking Techniques (CLiT),...
Chapter
Full-text available
This paper deals with the question of how artificial intelligence can be used to detect media bias in the overarching topic of manipulation and mood-making. We show three fields of actions that result from using machine learning to analyze media bias: the evaluation principles of media bias, the information presentation of media bias, and the trans...
Conference Paper
Full-text available
Finding suitable citations for scientific publications can be challenging and time-consuming. To this end, context-aware citation recommendation approaches that recommend publications as candidates for in-text citations have been developed. In this paper, we present C-Rex, a web-based demonstration system available at http://c-rex.org for context-a...
Chapter
Full-text available
Many archival collections have been recently digitized and made available to a wide public. The contained documents however tend to have limited attractiveness for ordinary users, since content may appear obsolete and uninteresting. Archival document collections can become more attractive for users if suitable content can be recommended to them. Th...
Chapter
Full-text available
The effectiveness of Convolutional Neural Networks (CNNs) in classifying image data has been thoroughly demonstrated. In order to explain the classification to humans, methods for visualizing classification evidence have been developed in recent years. These explanations reveal that sometimes images are classified correctly, but for the wrong reaso...
Conference Paper
Full-text available
With the development of autonomous vehicles, recent research focuses on semantically representing robotic propri-oceptive and exteroceptive perceptions (i.e., perception of the own body and of an external world). Such semantic representation is queried by reasoning systems to achieve what we would refer to as machine awareness. This aligns with the...
Conference Paper
Full-text available
Although it has become common to assess publications and researchers by means of their citation count (e.g., using the h-index), measuring the impact of scientific methods and datasets (e.g., using an h-index for datasets) has been performed only to a limited extent. This is not surprising because the usage information of methods and datasets is ty...
Article
Full-text available
Temporal collections of news articles (or news archives) contain numerous accurate and time-aligned articles, which offer immense value to our society, helping users to know details of events that occurred at specific time points in the past. Currently, the access to such collections is rather difficult for average users due to their large sizes an...
Chapter
Full-text available
Research on neural networks has gained significant momentum over the past few years. Because training is a resource-intensive process and training data cannot always be made available to everyone, there has been a trend to reuse pre-trained neural networks. As such, neural networks themselves have become research data. In this paper, we first prese...
Article
Full-text available
Citation recommendation describes the task of recommending citations for a given text. Due to the overload of published scientific works in recent years on the one hand, and the need to cite the most appropriate publications when writing scientific texts on the other hand, citation recommendation has emerged as an important research topic. In recen...
Chapter
Full-text available
Citation data is an important source of insight into the scholarly discourse and the reception of publications. Outcomes of citation analyses and the applicability of citation based machine learning approaches heavily depend on the completeness of citation data. One particular shortcoming of scholarly data nowadays is language coverage. That is, no...
Preprint
Full-text available
Research on neural networks has gained significant momentum over the past few years. Because training is a resource-intensive process and training data cannot always be made available to everyone, there has been a trend to reuse pre-trained neural networks. As such, neural networks themselves have become research data. In this paper, we first prese...
Chapter
Full-text available
In this paper, we introduce Aware, a knowledge-enabled framework for robots’ situational awareness. It is designed to support autonomous logistics vehicles operating in automobile manufacturing plants. Aware comprises an ontology grounding robots’ observations, a knowledge reasoner, and a set of behavioral rules: The Aware ontology models data stre...