Michael FaerberTechnische Universität Dresden | TUD
Michael Faerber
Professor
Leading the research group "Scalable Software Architectures for Data Analytics" at ScaDS.AI | TU Dresden
About
133
Publications
20,828
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,340
Citations
Introduction
My research lies at the intersection of
knowledge representation,
machine learning, and
natural language processing.
Among other things, I pursue research on scholarly data mining (e.g., scientific impact quantification), scholarly recommender systems (e.g., citation recommendation), and scholarly knowledge graphs (e.g., modeling papers, methods, and datasets). Furthermore, I develop AI solutions for peace mediation.
Publications
Publications (133)
Relation extraction is crucial for constructing knowledge graphs, with large high-quality datasets serving as the foundation for training, fine-tuning, and evaluating models. Generative data augmentation (GDA) is a common approach to expand such datasets. However, this approach often introduces hallucinations, such as spurious facts, whose impact o...
The growing use of Machine Learning and Artificial Intelligence (AI), particularly Large Language Models (LLMs) like OpenAI's GPT series, leads to disruptive changes across organizations. At the same time, there is a growing concern about how organizations handle personal data. Thus, privacy policies are essential for transparency in data processin...
Relation extraction is crucial for constructing knowledge graphs, with large high-quality datasets serving as the foundation for training, fine-tuning, and evaluating models. Generative data augmentation (GDA) is a common approach to expand such datasets. However, this approach often introduces hallucinations, such as spurious facts, whose impact o...
Causal discovery aims to estimate causal structures among variables based on observational data. Large Language Models (LLMs) offer a fresh perspective to tackle the causal discovery problem by reasoning on the metadata associated with variables rather than their actual data values, an approach referred to as knowledge-based causal discovery. In th...
In this paper, we introduce AutoRDF2GML, a framework designed to convert RDF data into data representations tailored for graph machine learning tasks. AutoRDF2GML enables, for the first time, the creation of both content-based features-i.e., features based on RDF datatype properties-and topology-based features-i.e., features based on RDF object pro...
In this paper, we introduce AutoRDF2GML, a framework designed to convert RDF data into data representations tailored for graph machine learning tasks. AutoRDF2GML enables, for the first time, the creation of both content-based features -- i.e., features based on RDF datatype properties -- and topology-based features -- i.e., features based on RDF o...
Causal discovery aims to estimate causal structures among variables based on observational data. Large Language Models (LLMs) offer a fresh perspective to tackle the causal discovery problem by reasoning on the metadata associated with variables rather than their actual data values, an approach referred to as knowledge-based causal discovery. In th...
Recently, extensive research has been conducted to explore the utilization of machine learning (ML) algorithms in various direct-detected and (self)-coherent short-reach communication applications. These applications encompass a wide range of tasks, including bandwidth request prediction, signal quality monitoring, fault detection, traffic predicti...
Pretrained Language Models (PLMs) benefit from external knowledge stored in graph structures for various downstream tasks. However, bridging the modality gap between graph structures and text remains a significant challenge. Traditional methods like linearizing graphs for PLMs lose vital graph connectivity, whereas Graph Neural Networks (GNNs) requ...
We introduce ComplexTempQA,a large-scale dataset consisting of over 100 million question-answer pairs designed to tackle the challenges in temporal question answering. ComplexTempQA significantly surpasses existing benchmarks like HOTPOTQA, TORQUE, and TEQUILA in scale and scope. Utilizing data from Wikipedia and Wikidata, the dataset covers questi...
In recent years, extensive research has been conducted to explore the utilization of machine learning algorithms in various direct-detected and self-coherent short-reach communication applications. These applications encompass a wide range of tasks, including bandwidth request prediction, signal quality monitoring, fault detection, traffic predicti...
An ML-supported diagnostics concept is introduced and demonstrated to detect and classify events on OTDR traces for application on a PON optical distribution network. We can also associate events with ODN branches by using deployment data of the PON. We analyze an ensemble classifier and neural networks, the usage of synthetic OTDR-like traces, and...
Automatic extraction of information from publications is key to making scientific knowledge machine-readable at a large scale. The extracted information can, for example, facilitate academic search, decision making, and knowledge graph construction. An important type of information not covered by existing approaches is hyperparameters. In this pape...
Co-authored publications can bring positive results for those who participate, such as gaining additional expertise, accessing more funding or increasing the publication impact. China, the European Union, and the United States have been collaborating between each other throughout the years in the field of Computer Science. These collaborations vari...
A frequency-calibrated SCINet (FC-SCINet) equalizer is proposed for downstream 100G PON with 28.7 dB path loss. At 5 km, FC-SCINet improves the BER by 88.87% compared to FFE and a 3-layer DNN with 10.57% lower complexity.
Automatic extraction of information from publications is key to making scientific knowledge machine-readable at a large scale. The extracted information can, for example, facilitate academic search, decision making, and knowledge graph construction. An important type of information not covered by existing approaches is hyperparameters. In this pape...
Named entity recognition and disambiguation, often referred to as entity linking systems, refers to the task of automatically identifying knowledge graph entities in text documents. While a variety of entity linking systems based on very different approaches exist, these systems implicitly share certain processing steps in their pipeline. Despite t...
Artificial Intelligence (AI) is one of the most momentous technologies of our time. Thus, it is of major importance to know which stakeholders influence AI research. Besides researchers at universities and colleges, researchers in companies have hardly been considered in this context. In this article, we consider how the influence of companies on A...
In this paper, we introduce Linked Papers With Code (LPWC), an RDF knowledge graph that provides comprehensive, current information about almost 400,000 machine learning publications. This includes the tasks addressed, the datasets utilized, the methods implemented, and the evaluations conducted, along with their results. Compared to its non-RDF-ba...
We present SemOpenAlex , an extensive RDF knowledge graph that contains over 26 billion triples about scientific publications and their associated entities, such as authors, institutions, journals, and concepts. SemOpenAlex is licensed under CC0, providing free and open access to the data. We offer the data through multiple channels, including RDF...
Commit messages (CMs) are an essential part of version control. By providing important context in regard to what has changed and why, they strongly support software maintenance and evolution. But writing good CMs is difficult and often neglected by developers. So far, there is no tool suitable for practice that automatically assesses how well a CM...
In this paper, we propose VOCAB-EXPANDER at https://vocab-expander.com, an on-line tool that enables end-users (e.g., technology scouts) to create and expand a vocabulary of their domain of interest. It utilizes an ensemble of state-of-the-art word embedding techniques based on web text and ConceptNet, a common-sense knowledge base, to suggest rela...
We present SemOpenAlex, an extensive RDF knowledge graph that contains over 26 billion triples about scientific publications and their associated entities, such as authors, institutions, journals, and concepts. SemOpenAlex is licensed under CC0, providing free and open access to the data. We offer the data through multiple channels, including RDF d...
In this paper, we propose Vocab-Expander at https://vocab-expander.com, an online tool that enables end-users (e.g., technology scouts) to create and expand a vocabulary of their domain of interest. It utilizes an ensemble of state-of-the-art word embedding techniques based on web text and ConceptNet, a common-sense knowledge base, to suggest relat...
Determining and measuring diversity in news articles is important for a number of reasons, including preventing filter bubbles and fueling public discourse, especially before elections. So far, the identification and analysis of diversity have been illuminated in a variety of ways, such as measuring the overlap of words or topics between news artic...
We present SemOpenAlex, an extensive RDF knowledge graph that contains over 26 billion triples about scientific publications and their associated entities, such as authors, institutions, journals, and concepts. SemOpenAlex is licensed under CC0, providing free and open access to the data. We offer the data through multiple channels, including RDF d...
Large language models (LLMs) have been widely employed for graph-to-text generation tasks. However, the process of finetuning LLMs requires significant training resources and annotation work. In this paper, we explore the capability of generative models to generate descriptive text from graph data in a zero-shot setting. Specifically, we evaluate G...
At times when answers to user questions are readily and easily available (at essentially zero cost), it is important for humans to maintain their knowledge and strong reasoning capabilities. We believe that in many cases providing hints rather than final answers should be sufficient and beneficial for users as it requires thinking and stimulates le...
Commit messages (CMs) are an essential part of version control. By providing important context in regard to what has changed and why, they strongly support software maintenance and evolution. But writing good CMs is difficult and often neglected by developers. So far, there is no tool suitable for practice that automatically assesses how well a CM...
Large-scale data sets on scholarly publications are the basis for a variety of bibliometric analyses and natural language processing (NLP) applications. Especially data sets derived from publication’s full-text have recently gained attention. While several such data sets already exist, we see key shortcomings in terms of their domain and time cover...
In the wake of information overload in academia, methodologies and systems for search, recommendation, and prediction to aid researchers in identifying relevant research are actively studied and developed. Existing work, however, is limited in terms of granularity, focusing only on the level of papers or a single type of artifact, such as data sets...
Large-scale data sets on scholarly publications are the basis for a variety of bibliometric analyses and natural language processing (NLP) applications. Especially data sets derived from publication's full-text have recently gained attention. While several such data sets already exist, we see key shortcomings in terms of their domain and time cover...
With the remarkable increase in the number of scientific entities such as publications, researchers, and scientific topics, and the associated information overload in science, academic recommender systems have become increasingly important for millions of researchers and science enthusiasts. However, it is often overlooked that these systems are su...
While several approaches have been proposed for estimating the readability of English texts, there is much less work for other languages. In this paper, we present an online service, available at https://readability-check.org/, that provides five well-established statistical methods and two machine learning models for measuring the readability of t...
With the remarkable increase in the number of scientific entities such as publications, researchers, and scientific topics, and the associated information overload in science, academic recommender systems have become increasingly important for millions of researchers and science enthusiasts. However, it is often overlooked that these systems are su...
Autonomous Driving (AD), the area of robotics with the greatest potential impact on society, has gained a lot of momentum in the last decade. As a result of this, the number of datasets in AD has increased rapidly. Creators and users of datasets can benefit from a better understanding of developments in the field. While scientometric analysis has b...
Environmental, social and governance (ESG) engagement of companies moved into the focus of public attention over recent years. With the requirements of compulsory reporting being implemented and investors incorporating sustainability in their investment decisions, the demand for transparent and reliable ESG ratings is increasing. However, automatic...
Modeling AI systems' characteristics of energy consumption and their sustainability level as an extension of the FAIR data principles has been considered only rudimentarily. In this paper, we propose the Green AI Ontology for modeling the energy consumption and other environmental aspects of AI models. We evaluate our ontology based on competency q...
Convolutional neural networks (CNNs) have achieved astonishing performance on various image classification tasks, but it is difficult for humans to understand how a classification comes about. Recent literature proposes methods to explain the classification process to humans. These focus mostly on visualizing feature maps and filter weights, which...
Citing is an important aspect of scientific discourse and important for quantifying the scientific impact quantification of researchers. Previous works observed that citations are made not only based on the pure scholarly contributions but also based on non-scholarly attributes, such as the affiliation or gender of authors. In this way, citation bi...
Citation information in scholarly data is an important source of insight into the reception of publications and the scholarly discourse. Outcomes of citation analyses and the applicability of citation-based machine learning approaches heavily depend on the completeness of such data. One particular shortcoming of scholarly data nowadays is that non-...
Analyses and applications based on bibliographic references are of ever increasing importance. However, reference linking methods described in the literature are only able to link around half of the references in papers. To improve the quality of reference linking in large scholarly data sets, we propose a blocking-based reference linking approach...
Citing is an important aspect of scientific discourse and important for quantifying the scientific impact quantification of researchers. Previous works observed that citations are made not only based on the pure scholarly contributions but also based on non-scholarly attributes, such as the affiliation or gender of authors. In this way, citation bi...
We present FREDo, a few-shot document-level relation extraction (FSDLRE) benchmark. As opposed to existing benchmarks which are built on sentence-level relation extraction corpora, we argue that document-level corpora provide more realism, particularly regarding none-of-the-above (NOTA) distributions. Therefore, we propose a set of FSDLRE tasks and...
The choice of databases containing publications' metadata (i.e., bibliographic databases) determines the available publication list of any author and, thus, their public appearance and evaluation. Having all publications listed in the various bibliographic databases is therefore important for researchers. However, the average number of publications...
In this paper, we present an end-to-end joint entity and relation extraction approach based on transformer-based language models. We apply the model to the task of linking mathematical symbols to their descriptions in LaTeX documents. In contrast to existing approaches, which perform entity and relation extraction in sequence, our system incorporat...
Although several large knowledge graphs have been proposed in the scholarly field, such graphs are limited with respect to several data quality dimensions such as accuracy and coverage. In this article, we present methods for enhancing the Microsoft Academic Knowledge Graph (MAKG), a recently published large-scale knowledge graph containing metadat...
Extracting structured data from bibliographic references is a crucial task for the creation of scholarly databases. While approaches , tools, and evaluation data sets for the task exist, there is a distinct lack of support for languages other than English and scripts other than the Latin alphabet. A significant portion of the scientific literature...
Considering the increasing rate of scientific papers published in recent years, for researchers throughout all disciplines it has become a challenge to keep track of which latest scientific methods are suitable for which applications. In particular, an unmanageable amount of neural network architectures has been published. In this paper, we propose...
When used in real-world environments, agents must meet high safety requirements as errors have direct consequences. Besides the safety aspect, the explainability of the systems is of particular importance. Therefore, not only should errors be avoided during the learning process, but also the decision process should be made transparent. Existing app...
One of the main challenges of startups is to raise capital from investors. For startup founders, it is therefore crucial to know whether investors have a bias against women as startup founders and in which way startups face disadvantages due to gender bias. Existing works on gender studies have mainly analyzed the US market. In this paper, we aim t...
Argument search aims at identifying arguments in natural language texts. In the past, this task has been addressed by a combination of keyword search and argument identification on the sentence- or document-level. However, existing frameworks often address only specific components of argument search and do not address the following aspects: (1) arg...
Citation information in scholarly data is an important source of insight into the reception of publications and the scholarly discourse. Outcomes of citation analyses and the applicability of citation based machine learning approaches heavily depend on the completeness of such data. One particular shortcoming of scholarly data nowadays is that non-...
Several scholarly knowledge graphs have been proposed to model and analyze the academic landscape. However, although the number of data sets has increased remarkably in recent years, these knowledge graphs do not primarily focus on data sets but rather associated entities such as publications. Moreover, publicly available data set knowledge graphs...
Convolutional neural networks (CNNs) have achieved astonishing performance on various image classification tasks, but it is difficult for humans to understand how a classification comes about. Recent literature proposes methods to explain the classification process to humans. These focus mostly on visualizing feature maps and filter weights, which...
Neural networks are a popular tool in e-commerce, in particular for product recommendations. To build reliable recommender systems, it is crucial to understand how exactly recommendations come about. Unfortunately, neural networks work as black boxes that do not provide explanations of how the recommendations are made.
In this chapter, we consider the theoretical foundations for representing knowledge in the Internet of Things context. Specifically, we consider (1) the model-theoretic semantics (i.e., extensional semantics ), (2) the possible-world semantics (i.e., intensional semantics ), (3) the situation semantics , and (4) the cognitive/distributional semanti...
While the path in the field of Entity Linking (EL) has been long and brought forth a plethora of approaches over the years, many of these are exceedingly difficult to execute for purposes of detailed analysis. In many cases, implementations are available, but far from being a plug-and-play experience. We present Combining Linking Techniques (CLiT),...
This paper deals with the question of how artificial intelligence can be used to detect media bias in the overarching topic of manipulation and mood-making. We show three fields of actions that result from using machine learning to analyze media bias: the evaluation principles of media bias, the information presentation of media bias, and the trans...
Finding suitable citations for scientific publications can be challenging and time-consuming. To this end, context-aware citation recommendation approaches that recommend publications as candidates for in-text citations have been developed. In this paper, we present C-Rex, a web-based demonstration system available at http://c-rex.org for context-a...
Many archival collections have been recently digitized and made available to a wide public. The contained documents however tend to have limited attractiveness for ordinary users, since content may appear obsolete and uninteresting. Archival document collections can become more attractive for users if suitable content can be recommended to them. Th...
The effectiveness of Convolutional Neural Networks (CNNs) in classifying image data has been thoroughly demonstrated. In order to explain the classification to humans, methods for visualizing classification evidence have been developed in recent years. These explanations reveal that sometimes images are classified correctly, but for the wrong reaso...
With the development of autonomous vehicles, recent research focuses on semantically representing robotic propri-oceptive and exteroceptive perceptions (i.e., perception of the own body and of an external world). Such semantic representation is queried by reasoning systems to achieve what we would refer to as machine awareness. This aligns with the...
Although it has become common to assess publications and researchers by means of their citation count (e.g., using the h-index), measuring the impact of scientific methods and datasets (e.g., using an h-index for datasets) has been performed only to a limited extent. This is not surprising because the usage information of methods and datasets is ty...
Temporal collections of news articles (or news archives) contain numerous accurate and time-aligned articles, which offer immense value to our society, helping users to know details of events that occurred at specific time points in the past. Currently, the access to such collections is rather difficult for average users due to their large sizes an...
Research on neural networks has gained significant momentum over the past few years. Because training is a resource-intensive process and training data cannot always be made available to everyone, there has been a trend to reuse pre-trained neural networks. As such, neural networks themselves have become research data. In this paper, we first prese...
Citation recommendation describes the task of recommending citations for a given text. Due to the overload of published scientific works in recent years on the one hand, and the need to cite the most appropriate publications when writing scientific texts on the other hand, citation recommendation has emerged as an important research topic. In recen...
Citation data is an important source of insight into the scholarly discourse and the reception of publications. Outcomes of citation analyses and the applicability of citation based machine learning approaches heavily depend on the completeness of citation data. One particular shortcoming of scholarly data nowadays is language coverage. That is, no...
Research on neural networks has gained significant momentum over the past few years. Because training is a resource-intensive process and training data cannot always be made available to everyone, there has been a trend to reuse pre-trained neural networks. As such, neural networks themselves have become research data. In this paper, we first prese...
In this paper, we introduce Aware, a knowledge-enabled framework for robots’ situational awareness. It is designed to support autonomous logistics vehicles operating in automobile manufacturing plants. Aware comprises an ontology grounding robots’ observations, a knowledge reasoner, and a set of behavioral rules: The Aware ontology models data stre...