Julian Risch

Julian Risch
deepset

Dr. rer. nat.
Senior Machine Learning Engineer @ deepset https://www.deepset.ai/papers

About

45
Publications
24,566
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
657
Citations
Citations since 2017
42 Research Items
654 Citations
2017201820192020202120222023050100150200
2017201820192020202120222023050100150200
2017201820192020202120222023050100150200
2017201820192020202120222023050100150200
Introduction
I finished my PhD studies in the field of information systems in December 2020 with a thesis on reader comments on online news platforms. My research focuses on natural language processing, information retrieval and deep learning.

Publications

Publications (45)
Conference Paper
Full-text available
We present the GermEval 2021 shared task on the identification of toxic, engaging, and fact-claiming comments. This shared task comprises three binary classification subtasks with the goal to identify: toxic comments, engaging comments, and comments that include indications of a need for fact-checking, here referred to as fact-claiming comments. Bu...
Preprint
Full-text available
The evaluation of question answering models compares ground-truth annotations with model predictions. However, as of today, this comparison is mostly lexical-based and therefore misses out on answers that have no lexical overlap but are still semantically similar, thus treating correct answers as false. This underestimation of the true performance...
Conference Paper
Full-text available
With the rise of research on toxic comment classification, more and more annotated datasets have been released. The wide variety of the task (different languages, different labeling processes and schemes) has led to a large amount of heterogeneous datasets that can be used for training and testing very specific settings. Despite recent efforts to c...
Article
Full-text available
Patent document collections are an immense source of knowledge for research and innovation communities worldwide. The rapid growth of the number of patent documents poses an enormous challenge for retrieving and analyzing information from this source in an effective manner. Based on deep learning methods for natural language processing, novel appro...
Thesis
Full-text available
Comment sections of online news platforms are an essential space to express opinions and discuss political topics. However, the misuse by spammers, haters, and trolls raises doubts about whether the benefits justify the costs of the time-consuming content moderation. As a consequence, many platforms limited or even shut down comment sections comple...
Preprint
Full-text available
Automatically estimating the complexity of texts for readers has a variety of applications, such as recommending texts with an appropriate complexity level to language learners or supporting the evaluation of text simplification approaches. In this paper, we present our submission to the Text Complexity DE Challenge 2022, a regression task where th...
Preprint
Full-text available
Open-domain extractive question answering works well on textual data by first retrieving candidate texts and then extracting the answer from those candidates. However, some questions cannot be answered by text alone but require information stored in tables. In this paper, we present an approach for retrieving both texts and tables relevant to a que...
Conference Paper
Full-text available
Patent examiners need to solve a complex information retrieval task when they assess the novelty and inventive step of claims made in a patent application. Given a claim, they search for prior art, which comprises all relevant publicly available information. This time-consuming task requires a deep understanding of the respective technical domain a...
Preprint
Full-text available
A major challenge of research on non-English machine reading for question answering (QA) is the lack of annotated datasets. In this paper, we present GermanQuAD, a dataset of 13,722 extractive question/answer pairs. To improve the reproducibility of the dataset creation approach and foster QA research on other languages, we summarize lessons learne...
Preprint
Full-text available
Patent examiners need to solve a complex information retrieval task when they assess the novelty and inventive step of claims made in a patent application. Given a claim, they search for prior art, which comprises all relevant publicly available information. This time-consuming task requires a deep understanding of the respective technical domain a...
Conference Paper
Full-text available
Many online news platforms provide comment sections for reader discussions below articles. While users of these platforms often read comments, only a minority of them regularly write comments. To encourage and foster more frequent engagement, we study the task of personalized recommendation of reader discussions to users. We present a neural networ...
Preprint
Full-text available
Many online news platforms provide comment sections for reader discussions below articles. While users of these platforms often read comments, only a minority of them regularly write comments. To encourage and foster more frequent engagement, we study the task of personalized recommendation of reader discussions to users. We present a neural networ...
Conference Paper
Full-text available
The comment sections of online news platforms are an important space to indulge in political conversations andto discuss opinions. Although primarily meant as forums where readers discuss amongst each other, they can also spark a dialog with the journalists who authored the article. A small but important fraction of comments address the journalists...
Article
Full-text available
Machine learning approaches have proven to be on or even above human-level accuracy for the task of offensive language detection. In contrast to human experts, however, they often lack the capability of giving explanations for their decisions. This article compares four different approaches to make offensive language detection explainable: an inter...
Conference Paper
Full-text available
Modern transformer-based models with hundreds of millions of parameters, such as BERT, achieve impressive results at text classification tasks. This also holds for aggression identification and offensive language detection, where they consistently outperform less complex models, such as decision trees. While the complex models fit training data wel...
Conference Paper
Full-text available
Hierarchical classification schemes are an effective and natural way to organize large document collections. However, complex schemes make the manual classification time-consuming and require domain experts. Current machine learning approaches for hierarchical classification do not exploit all the information contained in the hierarchical schemes....
Conference Paper
Full-text available
Many online discussion platforms use a content moderation process, where human moderators check user comments for offensive language and other rule violations. It is the moderator's decision which comments to remove from the platform because of violations and which ones to keep. Research so far focused on automating this decision process in the for...
Conference Paper
Full-text available
Comment sections below online news articles enjoy growing popularity among readers. However, the overwhelming number of comments makes it infeasible for the average news consumer to read all of them and hinders engaging discussions. Most platforms display comments in chronological order , which neglects that some of them are more relevant to users...
Preprint
Full-text available
Comment sections below online news articles enjoy growing popularity among readers. However, the overwhelming number of comments makes it infeasible for the average news consumer to read all of them and hinders engaging discussions. Most platforms display comments in chronological order, which neglects that some of them are more relevant to users a...
Chapter
Full-text available
Comment sections of online news platforms are an essential space to express opinions and discuss political topics. In contrast to other online posts, news discussions are related to particular news articles, comments refer to each other, and individual conversations emerge. However, the misuse by spammers, haters, and trolls makes costly content mo...
Preprint
Full-text available
Comparative text mining extends from genre analysis and political bias detection to the revelation of cultural and geographic differences, through to the search for prior art across patents and scientific papers. These applications use cross-collection topic modeling for the exploration, clustering, and comparison of large sets of documents, such a...
Conference Paper
Full-text available
Pre-training language representations on large text corpora, for example, with BERT, has recently shown to achieve impressive performance at a variety of downstream NLP tasks. So far, applying BERT to offensive language identification for German-language texts failed due to the lack of pre-trained, German-language models. In this paper, we fine-tun...
Article
Full-text available
Accessible and reusable datasets are a necessity to accomplish repeatable research. This requirement poses a problem particularly for web science, since scraped data comes in various formats and can change due to the dynamic character of the web. Further, usage of web data is typically restricted by copyright-protection or privacy regulations, whic...
Article
Full-text available
Purpose Patent offices and other stakeholders in the patent domain need to classify patent applications according to a standardized classification scheme. The purpose of this paper is to examine the novelty of an application it can then be compared to previously granted patents in the same class. Automatic classification would be highly beneficial,...
Book
Design and Implementation of service-oriented architectures imposes a huge number of research questions from the fields of software engineering, system analysis and modeling, adaptability, and application integration. Component orientation and web services are two approaches for design and realization of complex web-based system. Both approaches al...
Conference Paper
Full-text available
Toxic comment classification has become an active research field with many recently proposed approaches. However, while these approaches address some of the task's challenges others still remain unsolved and directions for further research are needed. To this end, we compare different deep learning and shallow approaches on a new, large comment dat...
Preprint
Full-text available
Toxic comment classification has become an active research field with many recently proposed approaches. However, while these approaches address some of the task's challenges others still remain unsolved and directions for further research are needed. To this end, we compare different deep learning and shallow approaches on a new, large comment dat...
Conference Paper
Full-text available
Content-based recommendation of books and other media is usually based on semantic similarity measures. While metadata can be compared easily, measuring the semantic similarity of narrative literature is challenging. Keyword-based approaches are biased to retrieve books of the same series or do not retrieve any results at all in sparser libraries....
Conference Paper
Full-text available
Social media platforms receive massive amounts of user-generated content that may include offensive text messages. In the context of the GermEval task 2018, we propose an approach for fine-grained classification of offensive language. Our approach comprises a Naive Bayes classifier, a neural network, and a rule-based approach that categorize tweets...
Conference Paper
Full-text available
A patent examiner needs domain-specific knowledge to classify a patent application according to its field of invention. Standardized classification schemes help to compare a patent application to previously granted patents and thereby check its novelty. Due to the large volume of patents, automatic patent classification would be highly beneficial t...
Conference Paper
Full-text available
Comment sections of online news providers have enabled millions to share and discuss their opinions on news topics. Today, moderators ensure respectful and informative discussions by deleting not only insults, defamation, and hate speech, but also unverifiable facts. This process has to be transparent and comprehensive in order to keep the communit...
Conference Paper
Full-text available
Social media platforms allow users to share and discuss their opinions online. However, a minor- ity of user posts is aggressive, thereby hinders respectful discussion, and — at an extreme level — is liable to prosecution. The automatic identification of such harmful posts is important, be- cause it can support the costly manual moderation of onlin...
Conference Paper
Full-text available
The overwhelming success of the Web and mobile technologies has enabled millions to share their opinions publicly at any time. But the same success also endangers this freedom of speech due to closing down of participatory sites misused by individuals or interest groups. We propose to support manual moderation by proactively drawing the attention o...
Conference Paper
Full-text available
Comparative text mining extends from genre analysis and political bias detection to the revelation of cultural and geographic differences, through to the search for prior art across patents and scientific papers. These applications use cross-collection topic modeling for the exploration, clustering, and comparison of large sets of documents, such a...
Conference Paper
Full-text available
On the Internet, criminal hackers frequently leak identity data on a massive scale. Subsequent criminal activities, such as identity theft and misuse, put Internet users at risk. Leak checker services enable users to check whether their personal data has been made public. However, automatic crawling and identification of leak data is error-prone fo...
Conference Paper
Full-text available
Massive Open Online Courses (MOOCs) have introduced a new form of education. With thousands of participants per course, lecturers are confronted with new challenges in the teaching process. In this paper , we describe how we conducted an introductory information retrieval course for participants from all ages and educational backgrounds. We analyze...
Conference Paper
Full-text available
Research results manifest in large corpora of patents and scientific papers. However, both corpora lack a consistent taxonomy and references across different document types are sparse. Therefore, and because of contrastive, domain-specific language, recommending similar papers for a given patent (or vice versa) is challenging.
Conference Paper
Full-text available
Functional dependencies (FDs) are an important prerequisite for various data management tasks, such as schema normalization, query optimization, and data cleansing. However, automatic FD discovery entails an exponentially growing search and solution space, so that even today's fastest FD discovery algorithms are limited to small datasets only, due...
Conference Paper
The main appeal of touch floors is that they are the only direct touch form factor that scales to arbitrary size, therefore allowing direct touch to scale to very large numbers of display objects. In this paper, however, we argue that the price for this benefit is bad physical ergonomics: prolonged standing, especially in combination with looking d...

Network

Cited By