
Alexander MehlerGoethe-Universität Frankfurt am Main · Institut für Informatik
Alexander Mehler
Full Professor
Current activity: formalizing and exploring multilayer, multiplex Twitter networks
About
266
Publications
39,964
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,854
Citations
Introduction
Additional affiliations
May 2010 - present
Publications
Publications (266)
Differential diagnosis aims at distinguishing between diseases causing similar symptoms. This is exemplified by epilepsies and dissociative disorders. Recently, it has been shown that linguistic features of physician-patient talks allow for differentiating between these two diseases. Since this method relies on trained linguists, it is not suitable...
We study the role of the second language in bilingual word embeddings in monolingual semantic evaluation tasks. We find strongly and weakly positive correlations between down-stream task performance and second language similarity to the target language. Additionally, we show how bilingual word embeddings can be employed for the task of semantic lan...
We consider two graph models of semantic change. The first is a time-series model that relates embedding vectors from one time period to embedding vectors of previous time periods. In the second, we construct one graph for each word: nodes in this graph correspond to time points and edge weights to the similarity of the word's meaning across two ti...
Transformer-based models are now predominant in NLP.
They outperform approaches based on static models in many respects.
This success has in turn prompted research that reveals a number of biases in the language models generated by transformers.
In this paper we utilize this research on biases to investigate to what extent transformer-based languag...
Parliamentary debates represent a large and partly unexploited treasure trove of publicly accessible texts. In the German-speaking area, there is a certain deficit of uniformly accessible and annotated corpora covering all German-speaking parliaments at the national and federal level. To address this gap, we introduce the German Parliament Corpus (...
Parliamentary debates represent a large and partly unexploited treasure trove of publicly accessible texts. In the German-speaking area, there is a certain deficit of uniformly accessible and annotated corpora covering all German-speaking parliaments at the national and federal level. To address this gap, we introduce the German Parliamentary Corpu...
HeidelTime is one of the most widespread and successful tools for detecting temporal expressions in texts. Since HeidelTime’s pattern matching system is based on regular expression, it can be extended in a convenient way. We present such an extension for the German resources of HeidelTime: HeidelTimeExt. The extension has been brought about by mean...
Parliamentary debates represent a large and partly unexploited treasure trove of publicly accessible texts. In the German-speaking area, there is a certain deficit of uniformly accessible and annotated corpora covering all German-speaking parliaments at the national and federal level. To address this gap, we introduce the German Parliament Corpus (...
HeidelTime is one of the most widespread and successful tools for detecting temporal expressions in texts. Since HeidelTime's pattern matching system is based on regular expression, it can be extended in a convenient way. We present such an extension for the German resources of HeidelTime: HeidelTimeext. The extension has been brought about by mean...
HeidelTime is one of the most widespread and successful tools for detecting temporal expressions in texts. Since HeidelTime's pattern matching system is based on regular expression, it can be extended in a convenient way. We present such an extension for the German resources of HeidelTime: HeidelTime-EXT . The extension has been brought about by me...
Transformer-based models are now predominant in NLP. They outperform approaches based on static models in many respects. This success has in turn prompted research that reveals a number of biases in the language models generated by transformers. In this paper we utilize this research on biases to investigate to what extent transformer-based languag...
Although Critical Online Reasoning (COR) is often viewed as a general competency (e.g. Alexander et al. 2016), studies have found evidence supporting their domain-specificity (Toplak et al. 2002). To investigate this assumption, we focus on commonalities and differences in textual preferences in solving COR-related tasks between graduates/young pro...
The average geodesic distance L Newman (2003) and the compactness C B Botafogo (1992) are important graph indices in applications of complex network theory to real-world problems. Here, for simple connected undirected graphs G of order n , we study the behavior of L ( G ) and C B ( G ), subject to the condition that their order | V ( G )| approache...
Circumstances surrounding the COVID-19 pandemic have serious implications for a multitude of areas of life. Alongside a decrease in the state of health of a considerable number of people, this global crisis also shows that society – both civil and professional, regardless of the sector – is now facing new technological challenges. Furthermore, due...
As global trends are shifting towards data-driven industries, the demand for automated algorithms that can convert images of scanned documents into machine readable information is rapidly growing. In addition to digitization there is an improvement toward process automation that used to require manual inspection of documents. Although optical chara...
SemioGraphs are Multi-codal graphs, whose vertices and edges are simultaneously mapped onto different systems (or codes) of types or labels. We provide a visualization technique that allows for comparing word embeddings across different corpora and being based on different techniques in an interactive manner to exemplify this technique in digital h...
Biodiversity information is contained in countless digitized and unprocessed scholarly texts. Although automated extraction of these data has been gaining momentum for years, there are still innumerable text sources that are poorly accessible and require a more advanced range of methods to extract relevant information. To improve the access to sema...
The aim of this paper is to evaluate the annotation tool used in VoxML Track ISA-17 henceforth called VoxML Annotation Environment 1. We describe our experiences with the model-ing language VoxML, present our own annotation tool-called TTLab VoxML Annotator-. and work out qualitative criteria for evaluating both tools. In our view, the most importa...
We argue that mainly due to technical innovation in the landscape of annotation tools, a conceptual change in annotation models and processes is also on the horizon. It is diagnosed that these changes are bound up with multi-media and multi-perspective facilities of annotation tools, in particular when considering virtual reality (VR) and augmented...
As global trends are shifting towards data-driven industries, the demand for automated algorithms that can convert digital images of scanned documents into machine readable information is rapidly growing. Besides the opportunity of data digitization for the application of data analytic tools, there is also a massive improvement towards automation o...
The ongoing digitalization of educational resources and the use of the internet lead to a steady increase of potentially available learning media. However, many of the media which are used for educational purposes have not been designed specifically for teaching and learning. Usually, linguistic criteria of readability and comprehensibility as well...
We test the hypothesis that the extent to which one obtains information on a given topic through Wikipedia depends on the language in which it is consulted. Controlling the size factor, we investigate this hypothesis for a number of 25 subject areas. Since Wikipedia is a central part of the web-based information landscape, this indicates a language...
The storage of data in public repositories such as the Global Biodiversity Information Facility (GBIF) or the National Center for Biotechnology Information (NCBI) is nowadays stipulated in the policies of many publishers in order to facilitate data replication or proliferation. Species occurrence records contained in legacy printed literature are n...
In this article we present the Frankfurt Latin Lexicon (FLL), a lexical resource for Medieval Latin that is used both for the lemmatization of Latin texts and for the post-editing of lemmatizations. We describe recent advances in the development of lemmatizers and test them against the Capitularies corpus (comprising Frankish royal edicts, mid-6th...
Threshold concepts are key terms in domain-based knowledge acquisition. They are regarded as building blocks of the conceptual development of domain knowledge within particular learners. From a linguistic perspective, however, threshold concepts are instances of specialized vocabularies, exhibiting particular linguistic features. Threshold concepts...
We test the hypothesis that the extent to which one obtains information on a given topic through Wikipedia depends on the language in which it is consulted. Controlling the size factor, we investigate this hypothesis for a number of 25 subject areas. Since Wikipedia is a central part of the web-based information landscape, this indicates a language...
The TextAnnotator is a tool for simultaneous and collaborative annotation of texts with visual annotation support, integration of knowledge bases and, by pipelining the TextImager, a rich variety of pre-processing and automatic annotation tools. It includes a variety of modules for the annotation of texts, which contains the annotation of argumenta...
The automatic generation of digital scenes from texts is a central task of computer science. This task requires a kind of text comprehension, the automation of which is tied to the availability of sufficiently large, diverse and deeply annotated data, which is freely available. This paper introducesText2SceneVR, a system that addresses this bottlen...
In recent years, the usability of interfaces in the field of Virtual Realities (VR) has massively improved, so that theories and applications of multimodal data processing can now be tested more extensively. In this paper we present an extension of VAnnotatoR, which is a VR-based open hypermedia system that is used for annotating, visualizing and i...
In recent years, the usability of interfaces in the field of Virtual Realities (VR) has massively improved, so that theories and applications of multimodal data processing can now be tested more extensively. In this paper we present an extension of VAnnotatoR, which is a VR-based open hypermedia system that is used for annotating, visualizing and i...
In this article we present the Frankfurt Latin Lexicon (FLL), a lexical resource for Medieval Latin that is used both for the lemmatization of Latin texts and for the post-editing of lemmatizations. We describe recent advances in the development of lemmatizers and test them against the Capitularies corpus (comprising Frankish royal edicts, mid-6th...
Despite the great importance of the Latin language in the past, there are relatively few resources available today to develop modern NLP tools for this language. Therefore, the EvaLatin Shared Task for Lemmatization and Part-of-Speech (POS) tagging was published in the LT4HALA workshop. In our work, we dealt with the second EvaLatin task, that is,...
The annotation of texts and other material in the field of digital humanities and Natural Language Processing (NLP) is a common task of research projects. At the same time, the annotation of corpora is certainly the most time- and cost-intensive component in research projects and often requires a high level of expertise according to the research in...
Are nearby places (e.g., cities) described by related words? In this article, we transfer this research question in the field of lexical encoding of geographic information onto the level of intertextuality. To this end, we explore Volunteered Geographic Information (VGI) to model texts addressing places at the level of cities or regions with the he...
Coreference resolution (CR) aims to find all spans of a text that refer to the same entity. The F1-Scores on these task have been greatly improved by new developed End2End-approaches (Lee et al., 2017) and transformer networks (Joshi et al., 2019b). The inclusion of CR as a pre-processing step is expected to lead to improvements in downstream tasks...
People's visual perception is very pronounced and therefore it is usually no problem for them to describe the space around them in words. Conversely, people also have no problems imagining a concept of a described space. In recent years many efforts have been made to develop a linguistic scheme for spatial and spatial-temporal relations. However, t...
Current sentence boundary detectors, such as [2, 8], split documents into sequentially ordered sentences by detecting their beginnings and ends. Sentences, however, are more deeply structured even on this side of constituent and dependency structure: they can consist of a main sentence and several subordinate clauses as well as further segments (e....
Are nearby places (e.g. cities) described by related words? In this article we transfer this research question in the field of lexical encoding of geographic information onto the level of intertextuality. To this end, we explore Volunteered Geographic Information (VGI) to model texts addressing places at the level of cities or regions with the help...
About ten years ago, we were still able to justify the reservations of sociology with regard to digitisation as a healthy caution against the hype of the ‘virtual world’ and ‘cyberspace’; today, the situation looks different: beyond the usual rhetoric of media revolutions, new forms of practice, organisation and order have emerged around digital te...
The Specialized Information Service Biodiversity Research (BIOfid) has been launched to mobilize valuable biological data from printed literature hidden in German libraries for over the past 250 years. In this project, we annotate German texts converted by OCR from historical scientific literature on the biodiversity of plants, birds, moths and but...
We present VAnnotatoR, a framework for the multimodal reconstruction of historical events and spaces. The aim of VAnnotatoR, which is implemented in Unity3D, is to develop, implement and test a theory of processing, understanding (multimedia mining) and generating (text2scene) multimodal signs in a single environment that comprises Augmented Realit...
The recognition of pharmacological substances, compounds and proteins is an essential preliminary work for the recognition of relations between chemicals and other biomedically relevant units. In this paper, we describe an approach to Task 1 of the PharmaCoNER Challenge, which involves the recognition of mentions of chemicals and drugs in Spanish m...
The recognition of pharmacological substances, compounds and proteins is an essential preliminary work for the recognition of relations between chemicals and other biomedically relevant units. In this paper, we describe an approach to Task 1 of the PharmaCoNER Challenge, which involves the recognition of mentions of chemicals and drugs in Spanish m...
Projects in the field of Natural Language Processing (NLP), the Digital Humanities (DH) and related disciplines dealing with machine learning of complex relationships between data objects need annotations to obtain sufficiently rich training and test sets. The visualization of such data sets and their underlying Human Computer Interaction (HCI) are...
We introduce a neural network-based system of Word Sense Disambiguation (WSD) for German that is based on SenseFitting, a novel method for optimizing WSD. We outperform knowledge-based WSD methods by up to 25% F1-score and produce a new state-of-the-art on the German sense-annotated dataset WebCAGe. Our method uses three feature vectors consisting...
The challenge of POS tagging and lemmatization in morphologically
rich languages is examined by comparing German and Latin. We start
by defining an NLP evaluation roadmap to model the combination
of tools and resources guiding our experiments. We focus on what a
practitioner can expect when using state-of-the-art solutions. These
solutions are then...
We describe current developments of Corpus2Wiki. Corpus2Wiki is a tool for generating so-called Wikiditions out of text corpora. It provides text analyses, annotations and their visualizations without requiring programming or advanced computer skills. By using TextImager as a back-end, Corpus2Wiki can automatically analyze input documents at differ...
The challenge of POS tagging and lemmatization in morphologically rich languages is examined by comparing German and Latin. We start by defining an NLP evaluation roadmap to model the combination of tools and resources guiding our experiments. We focus on what a practitioner can expect when using state-of-the-art solutions. These solutions are then...
We present the Frankfurt Latin Lexicon (FLL) as a lexical resource used by us in a number of NLP tasks of preprocessing Latin texts such as morphological tagging, lemmatization, and POS tagging. FLL was developed with the help of several source lexicons and taggers. First, a large number of so-called superlemmas were collected, then variants (lemma...
In recent years, the generation of linguistic networks from text corpora has attracted much attention in the historical sciences, social sciences, linguistics and philologies. These works aim, for example, to model social relations as a function of linguistic interactions or to discover thematic trends in larger communities. The presentation takes...
Modern annotation tools should meet at least the following general requirements: they can handle diverse data and annotation levels within one tool, and they support the annotation process with automatic (pre-)processing outcomes as much as possible. We developed a framework that meets these general requirements and that enables versatile and brows...
Die Untersuchung der Beziehungsstruktur in internetbasierten
Sozialräumen bringt fast immer ähnliche Relationenmuster hervor, nämliche Zentrum-Peripheriestrukturen. Dieses Muster hat erhebliche Konsequenzen für die Möglichkeiten gleicher Beteiligung an Diskussionen. Informationen werden bei wenigen Teilnehmenden konzentriert. Das führt zu einer Ung...
We introduce a method for computing classifier-based semantic spaces on top of text2ddc. To this end, we optimize text2ddc, a neural network-based classifier for the Dewey Decimal Classification (DDC). By using a wide range of linguistic features, including sense embeddings, we achieve an F-score of 87,4%. To show that our approach is language inde...
In diesem Poster geht es um die thematische Analyse und Visualisierung literarischer Werke mithilfe automatisierter Klassifikationsalgorithmen. Hierfür wird ein bereits entwickelter Algorithmus namens text2ddc [3, 1] verwendet, um die Themenverteilungen literarischer Werke zu identifizieren. Darüber hinaus thematisiert der Beitrag, wie diese Vertei...
Der TextImager als Front- und Backend für das verteilte NLP von Big Digital Humanities Data
Background
Gene and protein related objects are an important class of entities in biomedical research, whose identification and extraction from scientific articles is attracting increasing interest. In this work, we describe an approach to the BioCreative V.5 challenge regarding the recognition and classification of gene and protein related objects...
Die Untersuchung der Beziehungsstruktur in internetbasierten Sozialräumen bringt fast immer ähnliche Relationenmuster hervor, nämliche Zentrum-Peripheriestrukturen. Dieses Muster hat erhebliche Konsequenzen für die Möglichkeiten gleicher Beteiligung an Diskussionen. Informationen werden bei wenigen Teilnehmenden konzentriert. Das führt zu einer Ung...
Background
Chemical and biomedical named entity recognition (NER) is an essential preprocessing task in natural language processing. The identification and extraction of named entities from scientific articles is also attracting increasing interest in many scientific disciplines. Locating chemical named entities in the literature is an essential st...
Die Entstehungsgeschichte des Cours ist einzigartig. Denn Saussure hat bekanntermaßen niemals ein solches Buch geschrieben. Der Cours ist vielmehr auf der Basis von studentischen Vorlesungsmitschriften und Notizen aus seinem Nachlass entstanden. Diese Entstehungsgeschichte war und ist Gegenstand von Kritik wie auch von Versuchen, den » authentische...
The Internet has become the main informational entity, i.e., a public source of information. The Internet offers many new benefits and opportunities for human learning, teaching, and research. However, by providing a vast amount of information from innumerable sources, it also enables the manipulation of information; there are countless examples of...
We develop a framework for modeling the context sensitivity of text interpretation. As a point of reference, we focus on the complexity of educational texts. To open up a broader basis for representing phenomena of context sensitivity, we integrate a learning theory (i.e., the Cognitive Load Theory) with a theory of discourse comprehension (i.e., t...
This study improves the performance of neural named entity recognition by a margin of up to 11% in terms of F-score on the example of a low-resource language like German, thereby outperforming existing baselines and establishing a new state-of-the-art on each single open-source dataset (CoNLL 2003, GermEval 2014 and Tübingen Treebank 2018). Rather...
In this paper, we study the limit of compactness which is a graph index originally
introduced for measuring structural characteristics of hypermedia. Applying
compactness to large scale small-world graphs [1] observed its limit behaviour to be
equal 1. The striking question concerning this finding was whether this limit behaviour
resulted from the...
In this paper, we present Corpus2Wiki, a tool which automatically creates a MediaWiki site for a given corpus of texts. The texts, along with automatically generated annotations and visualisations associated with them, are displayed on this MediaWiki site, locally hosted on the user's own machine. Several different software components are used to t...