Michael Faerber

Michael Faerber
Karlsruhe Institute of Technology | KIT · Institute of Applied Informatics and Formal Description Methods

Deputy professor (W3)
Leading the research group "Web Science" at KIT. Interested in AI & data science (KR, NLP, ML)... and philosophy.

About

98
Publications
11,728
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
832
Citations
Citations since 2017
83 Research Items
808 Citations
2017201820192020202120222023050100150200250
2017201820192020202120222023050100150200250
2017201820192020202120222023050100150200250
2017201820192020202120222023050100150200250
Introduction
My research lies at the intersection of knowledge representation, machine learning, and natural language processing. Among other things, I pursue research on scholarly data mining (e.g., scientific impact quantification), scholarly recommender systems (e.g., citation recommendation), and scholarly knowledge graphs (e.g., modeling papers, methods, and datasets). Furthermore, I develop AI solutions for peace mediation.

Publications

Publications (98)
Preprint
Full-text available
In the wake of information overload in academia, methodologies and systems for search, recommendation, and prediction to aid researchers in identifying relevant research are actively studied and developed. Existing work, however, is limited in terms of granularity, focusing only on the level of papers or a single type of artifact, such as data sets...
Preprint
Full-text available
Large-scale data sets on scholarly publications are the basis for a variety of bibliometric analyses and natural language processing (NLP) applications. Especially data sets derived from publication's full-text have recently gained attention. While several such data sets already exist, we see key shortcomings in terms of their domain and time cover...
Article
Full-text available
With the remarkable increase in the number of scientific entities such as publications, researchers, and scientific topics, and the associated information overload in science, academic recommender systems have become increasingly important for millions of researchers and science enthusiasts. However, it is often overlooked that these systems are su...
Chapter
Full-text available
While several approaches have been proposed for estimating the readability of English texts, there is much less work for other languages. In this paper, we present an online service, available at https://readability-check.org/, that provides five well-established statistical methods and two machine learning models for measuring the readability of t...
Preprint
Full-text available
With the remarkable increase in the number of scientific entities such as publications, researchers, and scientific topics, and the associated information overload in science, academic recommender systems have become increasingly important for millions of researchers and science enthusiasts. However, it is often overlooked that these systems are su...
Preprint
Full-text available
Autonomous Driving (AD), the area of robotics with the greatest potential impact on society, has gained a lot of momentum in the last decade. As a result of this, the number of datasets in AD has increased rapidly. Creators and users of datasets can benefit from a better understanding of developments in the field. While scientometric analysis has b...
Preprint
Full-text available
Environmental, social and governance (ESG) engagement of companies moved into the focus of public attention over recent years. With the requirements of compulsory reporting being implemented and investors incorporating sustainability in their investment decisions, the demand for transparent and reliable ESG ratings is increasing. However, automatic...
Conference Paper
Full-text available
Modeling AI systems' characteristics of energy consumption and their sustainability level as an extension of the FAIR data principles has been considered only rudimentarily. In this paper, we propose the Green AI Ontology for modeling the energy consumption and other environmental aspects of AI models. We evaluate our ontology based on competency q...
Conference Paper
Full-text available
Convolutional neural networks (CNNs) have achieved astonishing performance on various image classification tasks, but it is difficult for humans to understand how a classification comes about. Recent literature proposes methods to explain the classification process to humans. These focus mostly on visualizing feature maps and filter weights, which...
Conference Paper
Full-text available
Citing is an important aspect of scientific discourse and important for quantifying the scientific impact quantification of researchers. Previous works observed that citations are made not only based on the pure scholarly contributions but also based on non-scholarly attributes, such as the affiliation or gender of authors. In this way, citation bi...
Article
Full-text available
Citation information in scholarly data is an important source of insight into the reception of publications and the scholarly discourse. Outcomes of citation analyses and the applicability of citation-based machine learning approaches heavily depend on the completeness of such data. One particular shortcoming of scholarly data nowadays is that non-...
Conference Paper
Full-text available
Analyses and applications based on bibliographic references are of ever increasing importance. However, reference linking methods described in the literature are only able to link around half of the references in papers. To improve the quality of reference linking in large scholarly data sets, we propose a blocking-based reference linking approach...
Preprint
Full-text available
Citing is an important aspect of scientific discourse and important for quantifying the scientific impact quantification of researchers. Previous works observed that citations are made not only based on the pure scholarly contributions but also based on non-scholarly attributes, such as the affiliation or gender of authors. In this way, citation bi...
Preprint
Full-text available
We present FREDo, a few-shot document-level relation extraction (FSDLRE) benchmark. As opposed to existing benchmarks which are built on sentence-level relation extraction corpora, we argue that document-level corpora provide more realism, particularly regarding none-of-the-above (NOTA) distributions. Therefore, we propose a set of FSDLRE tasks and...
Conference Paper
Full-text available
The choice of databases containing publications' metadata (i.e., bibliographic databases) determines the available publication list of any author and, thus, their public appearance and evaluation. Having all publications listed in the various bibliographic databases is therefore important for researchers. However, the average number of publications...
Preprint
Full-text available
In this paper, we present an end-to-end joint entity and relation extraction approach based on transformer-based language models. We apply the model to the task of linking mathematical symbols to their descriptions in LaTeX documents. In contrast to existing approaches, which perform entity and relation extraction in sequence, our system incorporat...
Article
Full-text available
Although several large knowledge graphs have been proposed in the scholarly field, such graphs are limited with respect to several data quality dimensions such as accuracy and coverage. In this article, we present methods for enhancing the Microsoft Academic Knowledge Graph (MAKG), a recently published large-scale knowledge graph containing metadat...
Conference Paper
Full-text available
Extracting structured data from bibliographic references is a crucial task for the creation of scholarly databases. While approaches , tools, and evaluation data sets for the task exist, there is a distinct lack of support for languages other than English and scripts other than the Latin alphabet. A significant portion of the scientific literature...
Conference Paper
Full-text available
Considering the increasing rate of scientific papers published in recent years, for researchers throughout all disciplines it has become a challenge to keep track of which latest scientific methods are suitable for which applications. In particular, an unmanageable amount of neural network architectures has been published. In this paper, we propose...
Conference Paper
Full-text available
When used in real-world environments, agents must meet high safety requirements as errors have direct consequences. Besides the safety aspect, the explainability of the systems is of particular importance. Therefore, not only should errors be avoided during the learning process, but also the decision process should be made transparent. Existing app...
Preprint
Full-text available
One of the main challenges of startups is to raise capital from investors. For startup founders, it is therefore crucial to know whether investors have a bias against women as startup founders and in which way startups face disadvantages due to gender bias. Existing works on gender studies have mainly analyzed the US market. In this paper, we aim t...
Preprint
Full-text available
Argument search aims at identifying arguments in natural language texts. In the past, this task has been addressed by a combination of keyword search and argument identification on the sentence- or document-level. However, existing frameworks often address only specific components of argument search and do not address the following aspects: (1) arg...
Preprint
Full-text available
Citation information in scholarly data is an important source of insight into the reception of publications and the scholarly discourse. Outcomes of citation analyses and the applicability of citation based machine learning approaches heavily depend on the completeness of such data. One particular shortcoming of scholarly data nowadays is that non-...
Article
Full-text available
Several scholarly knowledge graphs have been proposed to model and analyze the academic landscape. However, although the number of data sets has increased remarkably in recent years, these knowledge graphs do not primarily focus on data sets but rather associated entities such as publications. Moreover, publicly available data set knowledge graphs...
Preprint
Full-text available
Convolutional neural networks (CNNs) have achieved astonishing performance on various image classification tasks, but it is difficult for humans to understand how a classification comes about. Recent literature proposes methods to explain the classification process to humans. These focus mostly on visualizing feature maps and filter weights, which...
Chapter
Full-text available
Neural networks are a popular tool in e-commerce, in particular for product recommendations. To build reliable recommender systems, it is crucial to understand how exactly recommendations come about. Unfortunately, neural networks work as black boxes that do not provide explanations of how the recommendations are made.
Chapter
Full-text available
In this chapter, we consider the theoretical foundations for representing knowledge in the Internet of Things context. Specifically, we consider (1) the model-theoretic semantics (i.e., extensional semantics ), (2) the possible-world semantics (i.e., intensional semantics ), (3) the situation semantics , and (4) the cognitive/distributional semanti...
Chapter
Full-text available
While the path in the field of Entity Linking (EL) has been long and brought forth a plethora of approaches over the years, many of these are exceedingly difficult to execute for purposes of detailed analysis. In many cases, implementations are available, but far from being a plug-and-play experience. We present Combining Linking Techniques (CLiT),...
Chapter
Full-text available
This paper deals with the question of how artificial intelligence can be used to detect media bias in the overarching topic of manipulation and mood-making. We show three fields of actions that result from using machine learning to analyze media bias: the evaluation principles of media bias, the information presentation of media bias, and the trans...
Chapter
Full-text available
Many archival collections have been recently digitized and made available to a wide public. The contained documents however tend to have limited attractiveness for ordinary users, since content may appear obsolete and uninteresting. Archival document collections can become more attractive for users if suitable content can be recommended to them. Th...
Chapter
Full-text available
The effectiveness of Convolutional Neural Networks (CNNs) in classifying image data has been thoroughly demonstrated. In order to explain the classification to humans, methods for visualizing classification evidence have been developed in recent years. These explanations reveal that sometimes images are classified correctly, but for the wrong reaso...
Article
Full-text available
Temporal collections of news articles (or news archives) contain numerous accurate and time-aligned articles, which offer immense value to our society, helping users to know details of events that occurred at specific time points in the past. Currently, the access to such collections is rather difficult for average users due to their large sizes an...
Conference Paper
Full-text available
With the development of autonomous vehicles, recent research focuses on semantically representing robotic propri-oceptive and exteroceptive perceptions (i.e., perception of the own body and of an external world). Such semantic representation is queried by reasoning systems to achieve what we would refer to as machine awareness. This aligns with the...
Conference Paper
Full-text available
Although it has become common to assess publications and researchers by means of their citation count (e.g., using the h-index), measuring the impact of scientific methods and datasets (e.g., using an h-index for datasets) has been performed only to a limited extent. This is not surprising because the usage information of methods and datasets is ty...
Chapter
Full-text available
Research on neural networks has gained significant momentum over the past few years. Because training is a resource-intensive process and training data cannot always be made available to everyone, there has been a trend to reuse pre-trained neural networks. As such, neural networks themselves have become research data. In this paper, we first prese...
Article
Full-text available
Citation recommendation describes the task of recommending citations for a given text. Due to the overload of published scientific works in recent years on the one hand, and the need to cite the most appropriate publications when writing scientific texts on the other hand, citation recommendation has emerged as an important research topic. In recen...
Chapter
Full-text available
Citation data is an important source of insight into the scholarly discourse and the reception of publications. Outcomes of citation analyses and the applicability of citation based machine learning approaches heavily depend on the completeness of citation data. One particular shortcoming of scholarly data nowadays is language coverage. That is, no...
Preprint
Full-text available
Research on neural networks has gained significant momentum over the past few years. Because training is a resource-intensive process and training data cannot always be made available to everyone, there has been a trend to reuse pre-trained neural networks. As such, neural networks themselves have become research data. In this paper, we first prese...
Chapter
Full-text available
In this paper, we introduce Aware, a knowledge-enabled framework for robots’ situational awareness. It is designed to support autonomous logistics vehicles operating in automobile manufacturing plants. Aware comprises an ontology grounding robots’ observations, a knowledge reasoner, and a set of behavioral rules: The Aware ontology models data stre...
Preprint
Full-text available
The effectiveness of Convolutional Neural Networks (CNNs) in classifying image data has been thoroughly demonstrated. In order to explain the classification to humans, methods for visualizing classification evidence have been developed in recent years. These explanations reveal that sometimes images are classified correctly, but for the wrong reaso...
Chapter
Full-text available
Web hosting companies strive to provide customised customer services and want to know the commercial intent of a website. Whether a website is run by an individual person, a company, a non-profit organisation, or a public institution constitutes a great challenge in website classification as website content might be sparse. In this paper, we presen...
Conference Paper
Full-text available
Web hosting companies strive to provide customised customer services and want to know the commercial intent of a website. Whether a website is run by an individual person, a company, a non-profit organisation, or a public institution constitutes a great challenge in website classification as website content might be sparse. In this paper, we presen...
Conference Paper
Full-text available
The spread of biased news and its consumption by the readers has become a considerable issue. Researchers from multiple domains including social science and media studies have made efforts to mitigate this media bias issue. Specifically, various techniques ranging from natural language processing to machine learning have been used to help determine...
Chapter
Full-text available
New research is being published at a rate, at which it is infeasible for many scholars to read and assess everything possibly relevant to their work. In pursuit of a remedy, efforts towards automated processing of publications, like semantic modelling of papers to facilitate their digital handling, and the development of information filtering syste...
Chapter
Full-text available
Long-term news article archives are valuable resources about our past, allowing people to know detailed information of events that occurred at specific time points. To make better use of such heritage collections, this work considers the task of large scale question answering on long-term news article archives. Questions on such archives are often...
Conference Paper
Full-text available
A major domain of research in natural language processing is named entity recognition and disambiguation (NERD). One of the main ways of attempting to achieve this goal is through use of Semantic Web technologies and its structured data formats. Due to the nature of structured data, information can be extracted more easily, therewith allowing for t...
Article
Full-text available
In recent years, scholarly data sets have been used for various purposes, such as paper recommendation, citation recommendation, citation context analysis, and citation context-based document summarization. The evaluation of approaches to such tasks and their applicability in real-world scenarios heavily depend on the used data set. However, existi...
Preprint
Full-text available
Citation recommendation describes the task of recommending citations for a given text. Due to the overload of published scientific works in recent years on the one hand, and the need to cite the most appropriate publications when writing scientific texts on the other hand, citation recommendation has emerged as an important research topic. In recen...
Preprint
Full-text available
Citation recommendation systems aim to recommend citations for either a complete paper or a small portion of text called a citation context. The process of recommending citations for citation contexts is called local citation recommendation and is the focus of this paper. In this paper, firstly, we develop citation recommendation approaches based o...
Conference Paper
Full-text available
Context-aware citation recommendation is used to overcome the process of manually searching for relevant citations by automatically recommending suitable papers as citations for a specified input text. In this paper, we examine the reproducibility of a state-of-the-art approach to context-aware citation recommendation, namely the neural citation ne...
Chapter
Full-text available
In this paper, we present the Microsoft Academic Knowledge Graph (MAKG), a large RDF data set with over eight billion triples with information about scientific publications and related entities, such as authors, institutions, journals, and fields of study. The data set is licensed under the Open Data Commons Attribution License (ODC-By). By providi...
Chapter
Full-text available
Citations have been classified based on their textual contexts w.r.t. their worthiness, function, polarity, and importance. To the best of our knowledge, so far citations have not automatically been classified by their grammatical role, that is, whether the citation (1) is grammatically integrated in the sentence, (2) is annotated directly after th...
Preprint
Full-text available
Crunchbase is an online platform collecting information about startups and technology companies, including attributes and relations of companies, people, and investments. Data contained in Crunchbase is, to a large extent, not available elsewhere, making Crunchbase to a unique data source. In this paper, we present how to bring Crunchbase to the We...
Conference Paper
Full-text available
To benefit from mature database technology RDF stores are built on top of relational databases and SPARQL queries are mapped into SQL. Using a shared-nothing computer cluster is a way to achieve scalability by carrying out query processing on top of large RDF datasets in a distributed fashion. Aiming to this the current paper elaborates on the impa...
Chapter
Full-text available
In this paper, we present a system that allows researchers to search for papers and in-text citations in a novel way. Specifically, our system allows users to search for the textual contexts in which publications are cited (so-called citation contexts), given either the cited paper’s title or the cited paper’s author name. To better assess the cita...
Conference Paper
Full-text available
Science evolves very rapidly, and researchers have studied the evolution of coarse-grained research topics. However, to our knowledge, no analysis of the temporal trends of fine-grained scientific concepts has been performed based on papers' full texts. For this paper, we extract noun phrases as concepts from all computer science papers of arXiv.or...
Conference Paper
Full-text available
In recent years, several research paper-based tasks, such as paper recommendation, and citation-based tasks, such as citation recommendation and citation context-based document summarization, have been proposed. The evaluations of approaches to such tasks and their applicability in real-world scenarios heavily depend on the used data set. However,...
Conference Paper
Full-text available
In this paper, we present an approach for identifying Twitter bots based on their written tweets using a convolutional neural network. We experiment with various embedding methods (pretrained and trained on the training dataset) and convolutional neural network architectures and compare their performance. When evaluating our best performing approac...
Conference Paper
Full-text available
Citation data of scientific publications are essential for dif- ferent purposes, such as evaluating research and building digital library collections. In this paper, we analyze to which extent citation data of publications are openly available, using the intersection of the Crossref metadata and unpaywall snapshot as publication dataset and the COC...
Chapter
Full-text available
Recently, many archival news article collections have been made available to wide public. However, such collections are typically large, making it difficult for users to find content they would be interested in. Furthermore, archived news articles tend to be perceived by ordinary users as having rather weak attractiveness and being obsolete or unin...
Preprint
Full-text available
In recent years, DBpedia, Freebase, OpenCyc, Wikidata, and YAGO have been published as noteworthy large, cross-domain, and freely available knowledge graphs. Although extensively in use, these knowledge graphs are hard to compare against each other in a given setting. Thus, it is a challenge for researchers and developers to pick the best knowledge...
Conference Paper
Full-text available
Analyzing and recommending citations within their specific citation contexts has recently received much attention due to the growing number of available publications. Although data sets such as CiteSeerX have been created for evaluating approaches for such tasks, those data sets exhibit striking defects. This is understandable when one considers th...
Conference Paper
Full-text available
Recommending citations for scientific texts and other texts such as news articles has recently attracted considerable amount of attention. However, typically, the existing approaches for citation recommendation do not explicitly incorporate the question of whether a given context (e.g., a sentence), for which citations are to be recommended, actual...
Article
Full-text available
The rapidly growing size of RDF graphs in recent years necessitates distributed storage and parallel processing strategies. To obtain efficient query processing using computer clusters a wide variety of different approaches have been proposed. Related to the approach presented in the current paper are systems built on top of Hadoop HDFS, for exampl...
Conference Paper
Full-text available
Freely available large knowledge graphs, such as DBpedia, Wiki-data, and YAGO, generally provide a very solid representation of general knowledge, making them a good basis for text annotation. However, when it comes to annotating domain-specific text documents , these knowledge graphs need to be used with care. Moreover, publications describing rea...