
Egon W. Stemle- M.Sc. Cognitive Science
- Researcher at Eurac Research
Egon W. Stemle
- M.Sc. Cognitive Science
- Researcher at Eurac Research
About
74
Publications
8,442
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
538
Citations
Introduction
I study skills like perception, thinking, learning & language by combining the humanistic and analytical methods of the arts and the formal sciences.
I am working on ontologies, the technological feasibility of their construction and the utilization of structured data in applications, and on tools for processing, and annotating linguistic data.
My driving force is the question why humans handle incomplete and often inconsistent concepts quite fine - but computational processes often not so much.
Current institution
Additional affiliations
April 2009 - January 2012
Publications
Publications (74)
Metadata is critical throughout the research process, from study design to corpus selection/compilation, result interpretability and cumulative research. To date, however, learner corpus research has not developed community standards or best practices for metadata collection and sharing. In this article, we present the results of a collaborative pr...
Social media is an essential part of people's lives, and communication has become increasingly participatory, interactive, and multimodal. The FAIR principles are essential for digital preservation and archiving, with Findability & Accessibility already well covered technologically. To ensure Reusability to future researchers and other stakeholders...
Part-of-speech (PoS) tagging constitutes a common task in Natural Language Processing (NLP) given its widespread applicability. However, with the advance of new information technologies and language variation, the contents and methods for PoS-tagging have changed. The majority of Italian existing data for this task originate from standard texts, wh...
Up until today research in various educational and linguistic domains such as learner corpus research, writing research, or second language acquisition has produced a substantial amount of research data in the form of L1 and L2 learner corpora. However, the multitude of individual solutions combined with domain-inherent obstacles in data sharing ha...
Cultural Ecosystem Services (CESs), such as aesthetic and recreational enjoyment, as well as sense of place and cultural heritage, play an outstanding role in the contribution of landscapes to human well‐being.
Scientists, however, still often struggle to understand how landscape characteristics contribute to deliver these intangible benefits, larg...
In this article, we examine the
current situation of data dissemination and provision for CMC
corpora. By that we aim to give a guiding grid for future projects that
will improve the transparency and replicability of research results as
well as the reusability of the created resources. Based on the FAIR
guiding principles for research data manageme...
In recent years, research data management has also become an important topic in the less data-intensive areas of the Social Sciences and Humanities (SSH). Funding agencies as well as research communities demand that empirical data collected and used for scientific research is managed and preserved in a way that research results are reproducible. In...
Multilingual speakers communicate in more than one language in daily life and on social media. In order to process or investigate multilingual communication, there is a need for language identification. This study compares the performance of human annotators with automatic ways of language identification on a multilingual (mainly German-Italian-Eng...
In this article we provide an overview of first-hand experiences and vantage points for best practices from projects in seven European countries dedicated to learner corpus research (LCR) and the creation of language learner corpora. The corpora and tools involved in LCR are becoming more and more important, as are careful preparation and easy retr...
The goal of the STyrLogism Project is to semi-automatically extract neologism candidates (new lexemes) for the German standard variety used in South Tyrol, and generally to create the basis for long-term monitoring of its development. We use automatic lexico-semantic analytics for the lexicographic processing, but instead of continuing to develop o...
The DiDi corpus of South Tyrolean data of computer-mediated communication (CMC) is a multilingual sociolinguistic language corpus. It consists of around 600,000 tokens collected from 136 profiles of Facebook users residing in South Tyrol, Italy. In conformity with the multilingual situation of the territory, the main languages of the corpus are Ger...
Poster presented at the 3rd Italian Conference on Computational Linguistics, 5-6 December 2016, Università degli Studi di Napoli, Federico II
English. This paper describes an extended version of the KoKo corpus (version KoKo4, Dec 2015), a corpus of written German L1 learner texts from three different German-speaking regions in three different countries. The KoKo corpus is richly annotated with learner language features on different linguistic levels such as errors or other linguistic ch...
20. Michael Beißwenger, Thierry Chanier, Isabella Chiari, Tomaž Erjavec, Darja Fișer, Axel Herold, Nikola Lubešić, Harald Lüngen, Céline Poudat, Egon Stemle, Angelika Storrer and Ciara Wigham, 2016, “Integrating corpora of computer-mediated communication into the language resources landscape: Initiatives and best practices from French, German, Ital...
At the beginning of the corpus building process is the selection of
appropriate software tools and data formats for the acquisition and
annotation of original linguistic data. This initial phase is characterised by challenging decisions, for the software needs to be flexible (to
facilitate intuitive and speedy transcription), powerful (to meet anno...
Available at http://aclweb.org/anthology/W16-2614
The annual conference CLIC–it (''Italian Conference on Computational Linguistics'') is an initiative of the ''Italian Association of Computational Linguistics'' (AILC – www.ai-lc.it) which is intended to meet the need for a national and international forum for the promotion and dissemination of high-level original research in the field of Computati...
The annual conference CLIC–it (''Italian Conference on Computational Linguistics'') is an initiative of the ''Italian Association of Computational Linguistics'' (AILC – www.ai-lc.it) which is intended to meet the need for a national and international forum for the promotion and dissemination of high-level original research in the field of Computati...
EVALITA is the evaluation campaign of Natural Language Processing and Speech Tools for the Italian language: since 2007 shared tasks have been proposed covering the analysis of both written and spoken language with the aim of enhancing the development and dissemination of resources and technologies for Italian. EVALITA is an initiative of the Itali...
This article focuses on automatic text classification which aims at identifying the first language (L1) background of learners of English. A particular question arising in the context of automated L1 identification is whether any features that are informative for a machine learning algorithm relate to L1-specific transfer...
In this paper, we present on-going experiments for correcting OCR errors on German newspapers in Fraktur font. Our approach borrows from techniques for spelling correction in context using a probabilistic edit-operation error model and lexical resources. We highlight conditions in which high error reduction rates can be obtained and where the appro...
Decisions at the outset of preparing a learner corpus are of crucial importance for how the corpus can be built and how it can be analysed later on. This paper presents a generic workflow to build learner corpora while taking into account the needs of the users. The workflow results from an extensive collaboration between linguists that annotate an...
In this paper, we propose an integrated web strategy for mixed sociolinguistic research methodologies in the context of social media corpora. After stating the particular challenges for building corpora of private, non-public computer-mediated communication, we will present our solution to these problems: a Facebook web application for the acquisit...
Talk at the pre-conference workshop “NLP 4 CMC: Natural Language Processing for Computer-Mediated Communication / Social Media” at the 12th edition of KONVENS, Hildesheim, Germany
English. In this paper, we present ongoing experiments for correcting OCR errors on German newspapers in Fraktur font. Our approach borrows from techniques for spelling correction in context using a prob-abilistic edit-operation error model and lexical resources. We highlight conditions in which high error reduction rates can be obtained and where...
Special Issue: Building and annotating corpora of computer-mediated discourse. Issues and Challenges at the Inteface of Corpus and Computational Linguistics
We introduce the KoKo corpus, a collection of German L1 learner texts annotated with learner errors, along with the methods and tools used in its construction and evaluation. The corpus contains both texts and corresponding survey information from 1,319 pupils and amounts to around 716,000 tokens. The evaluation of the quality of the performed tran...
In this article, we present interHist, a compact visualization for the interactive exploration of results to complex corpus queries. Integrated with a search interface to the PAISA corpus of Italian web texts, interHist aims at facilitating the exploration of large results sets to linguistic corpus searches. This objective is approached by providin...
Talk in the Work in Progress Series at the Kompetenzzentrum Sprachen, Freie Universität Bozen
Talk at the 7th workshop of the DFG scientific network Empirikom ”Social Media Corpora for the eHumanities: Standards, Challenges, and Perspectives”, TU Dortmund University
PAISA' is a Creative Commons licensed, large web corpus of contemporary Italian. We describe the design, harvesting, and processing steps involved in its creation.
Talk at the Workshop from the "Arbeitsgruppe: Korpusbasierte Linguistik" at the 40. Österreichische Linguistiktagung, Nov 22-24, 2013, Universität Salzburg, Salzburg, Austria.
Der Vortrag stellt den iterativen Workflow zur Erstellung eines lemmatisierten, POS-getaggten und nach ausgewählten sprachlichen Merkmalen annotierten Lernerkorpus vor und g...
In this article, we present the multi-faceted interface to the open PAISÀ corpus of Italian. Created within the project PAISÀ (Piattaforma per l’Apprendimento dell’Italiano Su corpora Annotati) [1], the corpus is designed to be freely available for non-commercial processing, usage and distribution by the public. Hence, the automatically annotated c...
Talk at the Workshop on "Verabeitung und Annotation von Sprachdaten aus Genres internetbasierter Kommunikation" at the International Conference of the German Society for Computational Linguistics and Language Technology (GSCL 2013), Sep 23, 2013, TU Darmstadt.
Die automatische Verarbeitung von IBK-Daten stellt herkömmliche Verfahren im Bereich der...
In this paper, we report on an unsupervised greedy-style process for acquiring phrase translations from sentence-aligned parallel corpora. Thanks to innovative selection strategies, this process can acquire multiple translations without size criteria, i.e. phrases can have several translations, can be of any size, and their size is not considered w...
In this paper, we report on an unsupervised greedy-style process for acquiring phrase translations from sentence-aligned parallel corpora. Thanks to innovative selection strategies, this process can acquire multiple translations without size criteria, i.e. phrases can have several translations, can be of any size, and their size is not considered w...
Web corpora and other Web-derived data have become a gold mine for corpus linguistics and natural language processing. The Web is an easy source of unprecedented amounts of linguistic data from a broad range of registers and text types. However, a collection of Web pages is not immediately suitable for exploration in the same way a traditional corp...
Talk at BootCaTters of the world unite! (BOTWU), A workshop (and a survey) on the BootCaT toolkit, Jun 24, 2013, Department of Interpreting and Translation, University of Bologna Forlì, Italy.
"Copyright issues remain a gray area in compiling and distributing Web corpora"[1]; and even though "If a Web corpus is infringing copyright, then it is mer...
Talk at the DGfS 2013 Workshop on Modellierung nicht-standardisierter Schriftlichkeit, the 35th Annual Conference of the German Linguistic Society (DGfS 2013)
Talk at the international workshop "Building Corpora of Computer-Mediated Communication: Issues, Challenges, and Perspectives", Department of German Language and Literature, Faculty of Culture Studies, TU Dortmund University
We report on on-going work to derive translations of phrases from parallel corpora. We describe an unsupervised and knowledge-free greedy-style process relying on innovative strategies for choosing and discarding candidate translations. This process manages to acquire multiple translations combining phrases of equal or different sizes. The prelimin...
Plenary Talk at Computer Applications in Linguistics: Student Research Workshop (CSRW2012), Jul 13, 2012, English Corpus Linguistics Group at the Institute of Linguistics and Literary Studies, Technische Universität Darmstadt.
Short presentation at the 3rd workshop of the academic network on "Internet Lexicography" - May 4, 2012.
Developing content extraction methods for Humanities domains raises a number of challenges, from the abundance of non-standard entity types to their complexity to the scarcity of data. Close collaboration with Humanities scholars is essential to address these challenges. We discuss an annotation schema for Archaeological texts developed in collabor...
Live Demo with Poster at the LiveMemories Final Event - Internet, Memoria e Futuro and The Semantic Way
The entities mentioned in collections of scholarly articles in the Humanities (and in other scholarly domains) belong to different types from those familiar from news corpora, hence new resources need to be annotated to create supervised taggers for tasks such as ne extraction. However, in such domains there is a great need for making the best use...
Most existing HLT pipelines assume the input is pure text or, at most, HTML and either ignore (logical) document structure or remove it. We argue that identifying the structure of documents is essential in digital library and other types of applications, and show that it is relatively straightforward to extend existing pipelines to achieve ones in...
Small, manually assembled corpora may be available for less dominant languages and dialects, but producing web-scale resources remains a challenge. Even when considerable quantities of text are present on the web, finding this text, and distinguishing it from related languages in the same region can be difficult. For example less dominant variants...
The Live Memories corpus is an Italian corpus annotated for anaphoric relations. The corpus includes manual annotated information about morphosyntactic agreement, anaphoricity, and semantic class of the NPs. For the annotation of the anaphoric links the corpus takes into account specific phenomena of the Italian language like incorporated clitics a...
The KrdWrd System comprises an extensive system for automated Web cleaning tasks. For training the KrdWrd ML Engine, a substantial amount of hand-annotated data, viz. Web pages, are needed. Following, we present the parts of the system that cover the acquisition of training data, i.e. the steps before training data can be fed into a ML Engine. Henc...
Algorithmic processing of Web content mostly works on textual contents, neglecting visual information. Annotation tools largely share this deficit as well. We specify requirements for an architecture to overcome both problems and propose an implementation, the KrdWrd system. It uses the Gecko rendering engine for both annotation and feature extract...
This thesis discusses the KrdWrd Project. The Project goals are to provide tools and infrastructure for acquisition, visual annotation, merging and storage of Web pages as parts of bigger corpora, and to develop a classification engine that learns to automatically annotate pages, operate on the visual rendering of pages, and provide visual tools fo...
Several bootstrapping-based relation extraction algorithms working on large corpora or on the Web have been presented in the literature. A crucial issue for such algorithms is to avoid the introduction of too much noise into further iter- ations. Typically, this is achieved by applying appropriate pattern and tuple evaluation measures, henceforth c...