Alexander Mehler

Alexander Mehler
Goethe-Universität Frankfurt am Main · Institut für Informatik

Full Professor
Current activity: formalizing and exploring multilayer, multiplex Twitter networks

About

266
Publications
39,964
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,854
Citations
Additional affiliations
May 2010 - present
Goethe-Universität Frankfurt am Main
Position
  • Professor (Full)

Publications

Publications (266)
Conference Paper
Full-text available
Differential diagnosis aims at distinguishing between diseases causing similar symptoms. This is exemplified by epilepsies and dissociative disorders. Recently, it has been shown that linguistic features of physician-patient talks allow for differentiating between these two diseases. Since this method relies on trained linguists, it is not suitable...
Conference Paper
Full-text available
We study the role of the second language in bilingual word embeddings in monolingual semantic evaluation tasks. We find strongly and weakly positive correlations between down-stream task performance and second language similarity to the target language. Additionally, we show how bilingual word embeddings can be employed for the task of semantic lan...
Conference Paper
Full-text available
We consider two graph models of semantic change. The first is a time-series model that relates embedding vectors from one time period to embedding vectors of previous time periods. In the second, we construct one graph for each word: nodes in this graph correspond to time points and edge weights to the similarity of the word's meaning across two ti...
Conference Paper
Transformer-based models are now predominant in NLP. They outperform approaches based on static models in many respects. This success has in turn prompted research that reveals a number of biases in the language models generated by transformers. In this paper we utilize this research on biases to investigate to what extent transformer-based languag...
Conference Paper
Full-text available
Parliamentary debates represent a large and partly unexploited treasure trove of publicly accessible texts. In the German-speaking area, there is a certain deficit of uniformly accessible and annotated corpora covering all German-speaking parliaments at the national and federal level. To address this gap, we introduce the German Parliament Corpus (...
Poster
Full-text available
Parliamentary debates represent a large and partly unexploited treasure trove of publicly accessible texts. In the German-speaking area, there is a certain deficit of uniformly accessible and annotated corpora covering all German-speaking parliaments at the national and federal level. To address this gap, we introduce the German Parliamentary Corpu...
Poster
Full-text available
HeidelTime is one of the most widespread and successful tools for detecting temporal expressions in texts. Since HeidelTime’s pattern matching system is based on regular expression, it can be extended in a convenient way. We present such an extension for the German resources of HeidelTime: HeidelTimeExt. The extension has been brought about by mean...
Preprint
Full-text available
Parliamentary debates represent a large and partly unexploited treasure trove of publicly accessible texts. In the German-speaking area, there is a certain deficit of uniformly accessible and annotated corpora covering all German-speaking parliaments at the national and federal level. To address this gap, we introduce the German Parliament Corpus (...
Conference Paper
Full-text available
HeidelTime is one of the most widespread and successful tools for detecting temporal expressions in texts. Since HeidelTime's pattern matching system is based on regular expression, it can be extended in a convenient way. We present such an extension for the German resources of HeidelTime: HeidelTimeext. The extension has been brought about by mean...
Preprint
Full-text available
HeidelTime is one of the most widespread and successful tools for detecting temporal expressions in texts. Since HeidelTime's pattern matching system is based on regular expression, it can be extended in a convenient way. We present such an extension for the German resources of HeidelTime: HeidelTime-EXT . The extension has been brought about by me...
Preprint
Full-text available
Transformer-based models are now predominant in NLP. They outperform approaches based on static models in many respects. This success has in turn prompted research that reveals a number of biases in the language models generated by transformers. In this paper we utilize this research on biases to investigate to what extent transformer-based languag...
Presentation
Full-text available
Although Critical Online Reasoning (COR) is often viewed as a general competency (e.g. Alexander et al. 2016), studies have found evidence supporting their domain-specificity (Toplak et al. 2002). To investigate this assumption, we focus on commonalities and differences in textual preferences in solving COR-related tasks between graduates/young pro...
Article
Full-text available
The average geodesic distance L Newman (2003) and the compactness C B Botafogo (1992) are important graph indices in applications of complex network theory to real-world problems. Here, for simple connected undirected graphs G of order n , we study the behavior of L ( G ) and C B ( G ), subject to the condition that their order | V ( G )| approache...
Chapter
Circumstances surrounding the COVID-19 pandemic have serious implications for a multitude of areas of life. Alongside a decrease in the state of health of a considerable number of people, this global crisis also shows that society – both civil and professional, regardless of the sector – is now facing new technological challenges. Furthermore, due...
Chapter
Full-text available
As global trends are shifting towards data-driven industries, the demand for automated algorithms that can convert images of scanned documents into machine readable information is rapidly growing. In addition to digitization there is an improvement toward process automation that used to require manual inspection of documents. Although optical chara...
Poster
Full-text available
SemioGraphs are Multi-codal graphs, whose vertices and edges are simultaneously mapped onto different systems (or codes) of types or labels. We provide a visualization technique that allows for comparing word embeddings across different corpora and being based on different techniques in an interactive manner to exemplify this technique in digital h...
Article
Full-text available
Biodiversity information is contained in countless digitized and unprocessed scholarly texts. Although automated extraction of these data has been gaining momentum for years, there are still innumerable text sources that are poorly accessible and require a more advanced range of methods to extract relevant information. To improve the access to sema...
Conference Paper
Full-text available
The aim of this paper is to evaluate the annotation tool used in VoxML Track ISA-17 henceforth called VoxML Annotation Environment 1. We describe our experiences with the model-ing language VoxML, present our own annotation tool-called TTLab VoxML Annotator-. and work out qualitative criteria for evaluating both tools. In our view, the most importa...
Conference Paper
Full-text available
We argue that mainly due to technical innovation in the landscape of annotation tools, a conceptual change in annotation models and processes is also on the horizon. It is diagnosed that these changes are bound up with multi-media and multi-perspective facilities of annotation tools, in particular when considering virtual reality (VR) and augmented...
Preprint
Full-text available
As global trends are shifting towards data-driven industries, the demand for automated algorithms that can convert digital images of scanned documents into machine readable information is rapidly growing. Besides the opportunity of data digitization for the application of data analytic tools, there is also a massive improvement towards automation o...
Article
Full-text available
The ongoing digitalization of educational resources and the use of the internet lead to a steady increase of potentially available learning media. However, many of the media which are used for educational purposes have not been designed specifically for teaching and learning. Usually, linguistic criteria of readability and comprehensibility as well...
Article
Full-text available
We test the hypothesis that the extent to which one obtains information on a given topic through Wikipedia depends on the language in which it is consulted. Controlling the size factor, we investigate this hypothesis for a number of 25 subject areas. Since Wikipedia is a central part of the web-based information landscape, this indicates a language...
Conference Paper
Full-text available
The storage of data in public repositories such as the Global Biodiversity Information Facility (GBIF) or the National Center for Biotechnology Information (NCBI) is nowadays stipulated in the policies of many publishers in order to facilitate data replication or proliferation. Species occurrence records contained in legacy printed literature are n...
Article
Full-text available
In this article we present the Frankfurt Latin Lexicon (FLL), a lexical resource for Medieval Latin that is used both for the lemmatization of Latin texts and for the post-editing of lemmatizations. We describe recent advances in the development of lemmatizers and test them against the Capitularies corpus (comprising Frankish royal edicts, mid-6th...
Preprint
Full-text available
Threshold concepts are key terms in domain-based knowledge acquisition. They are regarded as building blocks of the conceptual development of domain knowledge within particular learners. From a linguistic perspective, however, threshold concepts are instances of specialized vocabularies, exhibiting particular linguistic features. Threshold concepts...
Preprint
Full-text available
We test the hypothesis that the extent to which one obtains information on a given topic through Wikipedia depends on the language in which it is consulted. Controlling the size factor, we investigate this hypothesis for a number of 25 subject areas. Since Wikipedia is a central part of the web-based information landscape, this indicates a language...
Poster
Full-text available
The TextAnnotator is a tool for simultaneous and collaborative annotation of texts with visual annotation support, integration of knowledge bases and, by pipelining the TextImager, a rich variety of pre-processing and automatic annotation tools. It includes a variety of modules for the annotation of texts, which contains the annotation of argumenta...
Conference Paper
The automatic generation of digital scenes from texts is a central task of computer science. This task requires a kind of text comprehension, the automation of which is tied to the availability of sufficiently large, diverse and deeply annotated data, which is freely available. This paper introducesText2SceneVR, a system that addresses this bottlen...
Conference Paper
In recent years, the usability of interfaces in the field of Virtual Realities (VR) has massively improved, so that theories and applications of multimodal data processing can now be tested more extensively. In this paper we present an extension of VAnnotatoR, which is a VR-based open hypermedia system that is used for annotating, visualizing and i...
Chapter
In recent years, the usability of interfaces in the field of Virtual Realities (VR) has massively improved, so that theories and applications of multimodal data processing can now be tested more extensively. In this paper we present an extension of VAnnotatoR, which is a VR-based open hypermedia system that is used for annotating, visualizing and i...
Preprint
Full-text available
In this article we present the Frankfurt Latin Lexicon (FLL), a lexical resource for Medieval Latin that is used both for the lemmatization of Latin texts and for the post-editing of lemmatizations. We describe recent advances in the development of lemmatizers and test them against the Capitularies corpus (comprising Frankish royal edicts, mid-6th...
Conference Paper
Full-text available
Despite the great importance of the Latin language in the past, there are relatively few resources available today to develop modern NLP tools for this language. Therefore, the EvaLatin Shared Task for Lemmatization and Part-of-Speech (POS) tagging was published in the LT4HALA workshop. In our work, we dealt with the second EvaLatin task, that is,...
Conference Paper
Full-text available
The annotation of texts and other material in the field of digital humanities and Natural Language Processing (NLP) is a common task of research projects. At the same time, the annotation of corpora is certainly the most time- and cost-intensive component in research projects and often requires a high level of expertise according to the research in...
Article
Full-text available
Are nearby places (e.g., cities) described by related words? In this article, we transfer this research question in the field of lexical encoding of geographic information onto the level of intertextuality. To this end, we explore Volunteered Geographic Information (VGI) to model texts addressing places at the level of cities or regions with the he...
Conference Paper
Full-text available
Coreference resolution (CR) aims to find all spans of a text that refer to the same entity. The F1-Scores on these task have been greatly improved by new developed End2End-approaches (Lee et al., 2017) and transformer networks (Joshi et al., 2019b). The inclusion of CR as a pre-processing step is expected to lead to improvements in downstream tasks...
Conference Paper
Full-text available
People's visual perception is very pronounced and therefore it is usually no problem for them to describe the space around them in words. Conversely, people also have no problems imagining a concept of a described space. In recent years many efforts have been made to develop a linguistic scheme for spatial and spatial-temporal relations. However, t...
Conference Paper
Full-text available
Current sentence boundary detectors, such as [2, 8], split documents into sequentially ordered sentences by detecting their beginnings and ends. Sentences, however, are more deeply structured even on this side of constituent and dependency structure: they can consist of a main sentence and several subordinate clauses as well as further segments (e....
Preprint
Full-text available
Are nearby places (e.g. cities) described by related words? In this article we transfer this research question in the field of lexical encoding of geographic information onto the level of intertextuality. To this end, we explore Volunteered Geographic Information (VGI) to model texts addressing places at the level of cities or regions with the help...
Chapter
Full-text available
About ten years ago, we were still able to justify the reservations of sociology with regard to digitisation as a healthy caution against the hype of the ‘virtual world’ and ‘cyberspace’; today, the situation looks different: beyond the usual rhetoric of media revolutions, new forms of practice, organisation and order have emerged around digital te...
Conference Paper
The Specialized Information Service Biodiversity Research (BIOfid) has been launched to mobilize valuable biological data from printed literature hidden in German libraries for over the past 250 years. In this project, we annotate German texts converted by OCR from historical scientific literature on the biodiversity of plants, birds, moths and but...
Poster
Full-text available
We present VAnnotatoR, a framework for the multimodal reconstruction of historical events and spaces. The aim of VAnnotatoR, which is implemented in Unity3D, is to develop, implement and test a theory of processing, understanding (multimedia mining) and generating (text2scene) multimodal signs in a single environment that comprises Augmented Realit...
Preprint
Full-text available
The recognition of pharmacological substances, compounds and proteins is an essential preliminary work for the recognition of relations between chemicals and other biomedically relevant units. In this paper, we describe an approach to Task 1 of the PharmaCoNER Challenge, which involves the recognition of mentions of chemicals and drugs in Spanish m...
Preprint
Full-text available
The recognition of pharmacological substances, compounds and proteins is an essential preliminary work for the recognition of relations between chemicals and other biomedically relevant units. In this paper, we describe an approach to Task 1 of the PharmaCoNER Challenge, which involves the recognition of mentions of chemicals and drugs in Spanish m...
Chapter
Projects in the field of Natural Language Processing (NLP), the Digital Humanities (DH) and related disciplines dealing with machine learning of complex relationships between data objects need annotations to obtain sufficiently rich training and test sets. The visualization of such data sets and their underlying Human Computer Interaction (HCI) are...
Preprint
Full-text available
We introduce a neural network-based system of Word Sense Disambiguation (WSD) for German that is based on SenseFitting, a novel method for optimizing WSD. We outperform knowledge-based WSD methods by up to 25% F1-score and produce a new state-of-the-art on the German sense-annotated dataset WebCAGe. Our method uses three feature vectors consisting...
Article
Full-text available
The challenge of POS tagging and lemmatization in morphologically rich languages is examined by comparing German and Latin. We start by defining an NLP evaluation roadmap to model the combination of tools and resources guiding our experiments. We focus on what a practitioner can expect when using state-of-the-art solutions. These solutions are then...
Conference Paper
Full-text available
We describe current developments of Corpus2Wiki. Corpus2Wiki is a tool for generating so-called Wikiditions out of text corpora. It provides text analyses, annotations and their visualizations without requiring programming or advanced computer skills. By using TextImager as a back-end, Corpus2Wiki can automatically analyze input documents at differ...
Article
Full-text available
The challenge of POS tagging and lemmatization in morphologically rich languages is examined by comparing German and Latin. We start by defining an NLP evaluation roadmap to model the combination of tools and resources guiding our experiments. We focus on what a practitioner can expect when using state-of-the-art solutions. These solutions are then...
Presentation
Full-text available
We present the Frankfurt Latin Lexicon (FLL) as a lexical resource used by us in a number of NLP tasks of preprocessing Latin texts such as morphological tagging, lemmatization, and POS tagging. FLL was developed with the help of several source lexicons and taggers. First, a large number of so-called superlemmas were collected, then variants (lemma...
Presentation
Full-text available
In recent years, the generation of linguistic networks from text corpora has attracted much attention in the historical sciences, social sciences, linguistics and philologies. These works aim, for example, to model social relations as a function of linguistic interactions or to discover thematic trends in larger communities. The presentation takes...
Conference Paper
Full-text available
Modern annotation tools should meet at least the following general requirements: they can handle diverse data and annotation levels within one tool, and they support the annotation process with automatic (pre-)processing outcomes as much as possible. We developed a framework that meets these general requirements and that enables versatile and brows...
Article
Full-text available
Die Untersuchung der Beziehungsstruktur in internetbasierten Sozialräumen bringt fast immer ähnliche Relationenmuster hervor, nämliche Zentrum-Peripheriestrukturen. Dieses Muster hat erhebliche Konsequenzen für die Möglichkeiten gleicher Beteiligung an Diskussionen. Informationen werden bei wenigen Teilnehmenden konzentriert. Das führt zu einer Ung...
Conference Paper
Full-text available
We introduce a method for computing classifier-based semantic spaces on top of text2ddc. To this end, we optimize text2ddc, a neural network-based classifier for the Dewey Decimal Classification (DDC). By using a wide range of linguistic features, including sense embeddings, we achieve an F-score of 87,4%. To show that our approach is language inde...
Poster
Full-text available
In diesem Poster geht es um die thematische Analyse und Visualisierung literarischer Werke mithilfe automatisierter Klassifikationsalgorithmen. Hierfür wird ein bereits entwickelter Algorithmus namens text2ddc [3, 1] verwendet, um die Themenverteilungen literarischer Werke zu identifizieren. Darüber hinaus thematisiert der Beitrag, wie diese Vertei...
Poster
Full-text available
Der TextImager als Front- und Backend für das verteilte NLP von Big Digital Humanities Data
Article
Full-text available
Background Gene and protein related objects are an important class of entities in biomedical research, whose identification and extraction from scientific articles is attracting increasing interest. In this work, we describe an approach to the BioCreative V.5 challenge regarding the recognition and classification of gene and protein related objects...
Preprint
Full-text available
Die Untersuchung der Beziehungsstruktur in internetbasierten Sozialräumen bringt fast immer ähnliche Relationenmuster hervor, nämliche Zentrum-Peripheriestrukturen. Dieses Muster hat erhebliche Konsequenzen für die Möglichkeiten gleicher Beteiligung an Diskussionen. Informationen werden bei wenigen Teilnehmenden konzentriert. Das führt zu einer Ung...
Article
Full-text available
Background Chemical and biomedical named entity recognition (NER) is an essential preprocessing task in natural language processing. The identification and extraction of named entities from scientific articles is also attracting increasing interest in many scientific disciplines. Locating chemical named entities in the literature is an essential st...
Chapter
Full-text available
Die Entstehungsgeschichte des Cours ist einzigartig. Denn Saussure hat bekanntermaßen niemals ein solches Buch geschrieben. Der Cours ist vielmehr auf der Basis von studentischen Vorlesungsmitschriften und Notizen aus seinem Nachlass entstanden. Diese Entstehungsgeschichte war und ist Gegenstand von Kritik wie auch von Versuchen, den » authentische...
Chapter
The Internet has become the main informational entity, i.e., a public source of information. The Internet offers many new benefits and opportunities for human learning, teaching, and research. However, by providing a vast amount of information from innumerable sources, it also enables the manipulation of information; there are countless examples of...
Chapter
Full-text available
We develop a framework for modeling the context sensitivity of text interpretation. As a point of reference, we focus on the complexity of educational texts. To open up a broader basis for representing phenomena of context sensitivity, we integrate a learning theory (i.e., the Cognitive Load Theory) with a theory of discourse comprehension (i.e., t...
Conference Paper
Full-text available
This study improves the performance of neural named entity recognition by a margin of up to 11% in terms of F-score on the example of a low-resource language like German, thereby outperforming existing baselines and establishing a new state-of-the-art on each single open-source dataset (CoNLL 2003, GermEval 2014 and Tübingen Treebank 2018). Rather...
Article
Full-text available
In this paper, we study the limit of compactness which is a graph index originally introduced for measuring structural characteristics of hypermedia. Applying compactness to large scale small-world graphs [1] observed its limit behaviour to be equal 1. The striking question concerning this finding was whether this limit behaviour resulted from the...
Conference Paper
Full-text available
In this paper, we present Corpus2Wiki, a tool which automatically creates a MediaWiki site for a given corpus of texts. The texts, along with automatically generated annotations and visualisations associated with them, are displayed on this MediaWiki site, locally hosted on the user's own machine. Several different software components are used to t...