Torsten Zesch's research while affiliated with FernUniversität in Hagen and other places

Publications (132)

Conference Paper
Full-text available
Automatically scoring student answers is an important task that is usually solved using instance-based supervised learning. Recently, similarity-based scoring has been proposed as an alternative approach yielding similar performance. It has hypothetical advantages such as a lower need for annotated training data and better zero-shot performance, bo...
Article
In this work, we describe the findings of the 'WisPerMed' team from their participation in Track 1 (Contextualized Medication Event Extraction) of the n2c2 2022 challenge. We tackle two tasks: (i) medication extraction, which involves extracting all mentions of medications from the clinical notes, and (ii) event classification, which involves class...
Preprint
Full-text available
Exploiting social media to spread hate has tremendously increased over the years. Lately, multi-modal hateful content such as memes has drawn relatively more traction than uni-modal content. Moreover, the availability of implicit content payloads makes them fairly challenging to be detected by existing hateful meme detection systems. In this paper,...
Article
Full-text available
In this article, we systematize the factors influencing performance and feasibility of automatic content scoring methods for short text responses. We argue that performance (i.e., how well an automatic system agrees with human judgments) mainly depends on the linguistic variance seen in the responses and that this variance is indirectly influenced...
Conference Paper
Full-text available
We propose a 'legal approach' to hate speech detection by operationalization of the decision as to whether a post is subject to criminal law into an NLP task. Comparing existing regulatory regimes for hate speech, we base our investigation on the European Union's framework as it provides a widely applicable legal minimum standard. Accurately decidi...
Conference Paper
Full-text available
Most existing spellcheckers have been developed for adults and it is yet understudied how well children’s texts can be automatically spellchecked, e.g. to build tools that assist them in spelling acquisition. This paper presents a detailed evaluation of six tools for automatic spelling correction on texts produced by German primary school children...
Conference Paper
Full-text available
While many methods for automatically scoring student writings have been proposed, few studies have inquired whether such scores constitute effective feedback improving learners’ writing quality. In this paper, we use an EFL email dataset annotated according to five analytic assessment criteria to train a classifier for each criterion, reaching huma...
Chapter
Even state-of-the-art neural approaches to handwriting recognition struggle when the handwriting is on ruled paper. We thus explore CNN-based methods to remove ruled lines and at the same time retain the parts of the writing overlapping with the ruled line. For that purpose, we devise a method to create a large synthetic dataset for training and ev...
Conference Paper
Full-text available
Spellchecking text written by language learners is especially challenging because errors made by learners differ both quantitatively and qualitatively from errors made by already proficient learners. We introduce LESPELL, a multilingual (English, German, Italian, and Czech) evaluation data set of spelling mistakes in context that we compiled from s...
Chapter
Personalization of handwriting recognition is still an understudied area due to the lack of a comprehensive dataset. We collect a dataset of 37,000 words handwritten by 40 writers that we make publicly available. We investigate the impact of personalization on recognition by training a baseline recognition model and retraining it using our dataset....
Preprint
Full-text available
When evaluating the performance of automatic speech recognition models, usually word error rate within a certain dataset is used. Special care must be taken in understanding the dataset in order to report realistic performance numbers. We argue that many performance numbers reported probably underestimate the expected error rate. We conduct experim...
Chapter
While in developed countries routine dental consultations are often covered by insurance, access to prophylactic dental examinations is often expensive in developing countries. Therefore, sufficient oral health prevention, particularly early caries detection, is not accessible to many people in these countries, yet. This observation is, however, co...
Preprint
Full-text available
In this paper, we train Mozilla's DeepSpeech architecture on German and Swiss German speech datasets and compare the results of different training methods. We first train the models from scratch on both languages and then improve upon the results by using an English pretrained version of DeepSpeech for weight initialization and experiment with the...
Conference Paper
Full-text available
In this paper, we analyse the challenges of Chinese content scoring in comparison to English. As a review of prior work for Chi-nese content scoring shows a lack of open-access data in the field, we present two short-answer data sets for Chinese. The Chinese Educational Short Answers data set (CESA) contains 1800 student answers for five science-re...
Article
Full-text available
Semantic relatedness between words is a core concept in natural language processing. While countless approaches have been proposed, measuring which one works best is still a challenging task. Thus, in this article, we give a comprehensive overview of the evaluation protocols and datasets for semantic relatedness covering both intrinsic and extrinsi...
Article
Full-text available
To validly assess teachers’ pedagogical content knowledge (PCK), performance-based tasks with open-response formats are required. Automated scoring is considered an appropriate approach to reduce the resource-intensity of human scoring and to achieve more consistent scoring results than human raters. The focus is on the comparability of human and a...
Conference Paper
Full-text available
We describe our system participating in the SwissText/KONVENS shared task on low-resource speech-to-text (Plüss et al., 2020). We train an end-to-end neural model based on Mozilla DeepSpeech. We examine various methods to improve over the baseline results: transfer learning from standard German and English, data augmentation, and post-processing. O...
Preprint
Full-text available
We propose a 'legal approach' to hate speech detection by operationalization of the decision as to whether a post is subject to criminal law into an NLP task. Comparing existing regulatory regimes for hate speech, we base our investigation on the European Union's framework as it provides a widely applicable legal minimum standard. Accurately judgin...
Conference Paper
Full-text available
While automatic speech recognition is an important task, freely available models are rare, especially for languages other than English. In this paper, we describe the process of training German models based on the Mozilla DeepSpeech architecture using publicly available data. We compare the resulting models with other available speech recognition s...
Conference Paper
Full-text available
This paper describes LTL-UDE's systems for the SemEval 2019 Shared Task 6. We present results for Subtask A and C. In Subtask A, we experiment with an embedding representation of postings and use a Multi-Layer Perceptron and BERT to categorize postings. Our best result reaches the 10th place (out of 103) using BERT. In Subtask C, we applied a two-v...
Conference Paper
Full-text available
Advances in the automated detection of offensive Internet postings make this mechanism very attractive to social media companies, who are increasingly under pressure to monitor and action activity on their sites. However, these advances also have important implications as a threat to the fundamental right of free expression. In this article, we an...
Article
Full-text available
Automatic content scoring is an important application in the area of automatic educational assessment. Short texts written by learners are scored based on their content while spelling and grammar mistakes are usually ignored. The difficulty of automatically scoring such texts varies according to the variance within the learner answers. In this pape...
Conference Paper
Full-text available
We propose ESCRITO, a toolkit for scoring student writings using NLP techniques that addresses two main user groups: teachers and NLP researchers. Teachers can use a high-level API in the teacher mode to assemble scoring pipelines easily. NLP researchers can use the developer mode to access a low-level API, which not only makes available a number o...
Conference Paper
Full-text available
Understanding hate speech remains a significant challenge for both creating reliable datasets and automated hate speech detection. We hypothesize that being part of the targeted group or personally agreeing with an assertion substantially effects hate speech perception. To test these hypotheses , we create FEMHATE-a dataset containing 400 assertion...
Conference Paper
Full-text available
Stance on topics such as the death penalty is often expressed by discussing subordinated or related targets (e.g. costs of execution). This implicit way of communication is typically modelled by defining a set of explicit targets, which are related to the topic (e.g. death is irreversible). As these sets can be created in different ways, it remains...
Conference Paper
Full-text available
We present a corpus of political debates annotated with aspect-based sentiment and a corpus analysis. The source corpus consists of transcribed speeches taken from the two presidential debates of the 2016 US election. We annotate the corpus according to two different schemata and analyze their differences. We show that the choice schema has a stron...
Chapter
Lexical recognition tests are frequently used to assess vocabulary knowledge. In such tests, learners need to differentiate between words and artificial nonwords that look much like real words. Our ultimate goal is to create high quality lexical recognition tests automatically which enables repetitive automated testing for different languages. This...
Conference Paper
Full-text available
Being able to predict whether people agree or disagree with an assertion (i.e. an explicit, self-contained statement) has several applications ranging from predicting how many people will like or dislike a social media post to classifying posts based on whether they are in accordance with a particular point of view. We formalize this as two NLP tas...
Conference Paper
Full-text available
Understanding public opinion on complex controversial issues such as 'Legalization of Marijuana' and 'Gun Rights' is of considerable importance for a number of objectives such as identifying the most divisive facets of the issue, developing a consensus, and making informed policy decisions. However, an individual's position on a controversial issue...
Chapter
Full-text available
We analyze whether implicitness affects human perception of hate speech. To do so, we use Tweets from an existing hate speech corpus and paraphrase them with rules to make the hate speech they contain more explicit. Comparing the judgment on the original and the paraphrased Tweets, our study indicates that implicitness is a factor in human and auto...
Article
Lexical recognition tests are widely used to assess vocabulary knowledge. We investigate the role that diacritics play in designing an Arabic lexical recognition test. We compare a non-diacritized and a diacritized test in a user study and find that they are largely comparable in their ability to assess vocabulary proficiency. However, we argue tha...
Conference Paper
Full-text available
Spelling errors occur frequently in educational settings, but their influence on automatic scoring is largely unknown. We therefore investigate the influence of spelling errors on content scoring performance using the example of the short answer data set of the Automated Student Assessment Prize (ASAP). We conduct an annotation study on the nature...
Conference Paper
Full-text available
Lexical recognition tests are widely used to assess vocabulary knowledge. We investigate the role that diacritics play in designing an Arabic lexical recognition test. We compare a non-diacritized and a diacritized test in a user study and find that they are largely comparable in their ability to assess vocabulary proficiency. However, we argue tha...
Conference Paper
Full-text available
We present our system LTL_UNI_DUE which participated in the shared task on automated stance detection in tweets on Catalan independence at IberEval 2017. In our system, we combine neural (LSTM) and non-neural (SVM) classifiers to a hybrid approach using a decision tree and heuristics.
Conference Paper
Full-text available
This paper describes the GermEval 2017 shared task on Aspect-Based Sentiment Analysis that consists of four subtasks: relevance , document-level sentiment polarity, aspect-level polarity ad opinion target extraction. System performance is measured on two evaluation sets-one from the same time period as the training and development set, and a second...
Book
Full-text available
In the connected, modern world, customer feedback is a valuable source for insights on the quality of products or services. This feedback allows other customers to benefit from the experiences of others and enables businesses to react on requests, complaints or recommendations. However, the more people use a product or service, the more feedback is...
Conference Paper
Full-text available
A recent study by Plank et al. (2016) found that LSTM-based PoS taggers considerably improve over the current state-of-the-art when evaluated on the corpora of the Universal Dependencies project that use a coarse-grained tagset. We replicate this study using a fresh collection of 27 corpora of 21 languages that are annotated with fine-grained tagse...
Conference Paper
Full-text available
The lack of a sufficient amount of data tailored for a task is a well-recognized problem for many statistical NLP methods. In this paper, we explore whether data sparsity can be successfully tackled when classifying language proficiency levels in the domain of learner-written output texts. We aim at overcoming data sparsity by incorporating knowled...
Conference Paper
Full-text available
We propose a new approach to PoS tagging where in a first step, we assign a coarse-grained tag corresponding to the main syntactic category. Based on this high-precision decision, in the second step we utilize specially trained fine-grained models with heavily reduced decision complexity. By analyzing the system under oracle conditions, we show tha...
Conference Paper
Full-text available
English. We present a detailed description of our submission to the PoSTWITA shared-task for PoS tagging of Italian social media text. We train a model based on FlexTag using only the provided training data and external resources like word clusters and a PoS dictionary which are build from publicly available Italian corpora. We find that this minim...
Conference Paper
Full-text available
Bundled gap filling exercises (Wojatzki et al., 2016) were recently introduced as a promising new exercise type to complement or even replace single gap-fill tasks. However, it is not yet confirmed that the applied creation method works properly and it is still to be investigated if bundled gap-fill tests are a suitable method for assessing languag...
Conference Paper
Full-text available
A major remaining challenge in argument mining is implicitness. We propose to model implicit argumentation using explicit stances and the overall stance of a debate. Our evaluation on a social media corpus shows that our model (i) can be reliably annotated even on noisy data and (ii) has the potential to improve the performance of automated argumen...
Conference Paper
Full-text available
We present FlexTag, a highly flexible PoS tagging framework. In contrast to monolithic implementations that can only be retrained but not adapted otherwise, FlexTag enables users to modify the feature space and the classification algorithm. We categorize existing PoS tagger implementations into one of three categories with regards to model-training...
Article
Full-text available
The Semantic Evaluation (SemEval) series of workshops focuses on the evaluation and comparison of systems that can analyze diverse semantic phenomena in text, with the aim of extending the current state of the art in semantic analysis and creating high-quality annotated datasets in a range of increasingly challenging problems in natural language se...
Conference Paper
Full-text available
English. We compare a comprehensive list of domain adaptation approaches for PoS tagging of social media data. We find that the most effective approach is based on clustering of unlabeled data. We also show that combining different approaches does not further improve performance. Thus, PoS tagging of social media data remains a challenging problem....
Conference Paper
Full-text available
Lexical recognition tests are frequently used for measuring language proficiency. In such tests, learners need to differentiate between words and artificial nonwords that look much like real words. Our goal is to automatically generate word-like nonwords which enables repeated automated testing. We compare different ranking strategy and find that o...
Conference Paper
Full-text available
We perform a comparison of 22 PoS tag-ger models for English and German offered by 9 different implementations. By evaluating on a mix of corpora from different domains, we simulate a black-box usage where researchers select a tagger (because of popularity, ease of use, etc.) and apply it to all sorts of text. We find the expected trade-off between...
Article
Full-text available
In this paper, we analyse the differences between L1 acquisition and L2 learning and identify four main aspects: input quality and quantity, mapping processes, cross-lingual influence, and reading experience. As a consequence of these differences, we conclude that L1 readability measures cannot be directly mapped to L2 readability. We propose to ca...
Conference Paper
Full-text available
In second language learning, cloze tests (also known as fill-in-the-blank tests) are frequently used for assessing the learning progress of students. While preparation effort for these tests is low, scoring needs to be done manually, as there usually is a huge number of correct solutions. In this paper, we examine whether the ambiguity of cloze ite...
Article
Full-text available
Language proficiency tests are used to evaluate and compare the progress of language learners. We present an approach for automatic difficulty prediction of C-tests that performs on par with human experts. On the basis of detailed analysis of newly collected data, we develop a model for C-test difficulty introducing four dimensions: solution diffic...
Conference Paper
In this paper, we investigate the difference between word and sense similarity measures and present means to convert a state-of-the-art word similarity measure into a sense similarity measure. In order to evaluate the new measure, we create a special sense similarity dataset and re-rate an existing word similarity dataset using two different sense...