Egon W. Stemle

Egon W. Stemle
  • M.Sc. Cognitive Science
  • Researcher at Eurac Research

About

74
Publications
8,442
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
538
Citations
Introduction
I study skills like perception, thinking, learning & language by combining the humanistic and analytical methods of the arts and the formal sciences. I am working on ontologies, the technological feasibility of their construction and the utilization of structured data in applications, and on tools for processing, and annotating linguistic data. My driving force is the question why humans handle incomplete and often inconsistent concepts quite fine - but computational processes often not so much.
Current institution
Eurac Research
Current position
  • Researcher
Additional affiliations
April 2009 - January 2012
University of Trento
Position
  • Researcher

Publications

Publications (74)
Article
Full-text available
Metadata is critical throughout the research process, from study design to corpus selection/compilation, result interpretability and cumulative research. To date, however, learner corpus research has not developed community standards or best practices for metadata collection and sharing. In this article, we present the results of a collaborative pr...
Conference Paper
Full-text available
Social media is an essential part of people's lives, and communication has become increasingly participatory, interactive, and multimodal. The FAIR principles are essential for digital preservation and archiving, with Findability & Accessibility already well covered technologically. To ensure Reusability to future researchers and other stakeholders...
Chapter
Part-of-speech (PoS) tagging constitutes a common task in Natural Language Processing (NLP) given its widespread applicability. However, with the advance of new information technologies and language variation, the contents and methods for PoS-tagging have changed. The majority of Italian existing data for this task originate from standard texts, wh...
Article
Full-text available
Up until today research in various educational and linguistic domains such as learner corpus research, writing research, or second language acquisition has produced a substantial amount of research data in the form of L1 and L2 learner corpora. However, the multitude of individual solutions combined with domain-inherent obstacles in data sharing ha...
Article
Full-text available
Cultural Ecosystem Services (CESs), such as aesthetic and recreational enjoyment, as well as sense of place and cultural heritage, play an outstanding role in the contribution of landscapes to human well‐being. Scientists, however, still often struggle to understand how landscape characteristics contribute to deliver these intangible benefits, larg...
Chapter
In this article, we examine the current situation of data dissemination and provision for CMC corpora. By that we aim to give a guiding grid for future projects that will improve the transparency and replicability of research results as well as the reusability of the created resources. Based on the FAIR guiding principles for research data manageme...
Conference Paper
In recent years, research data management has also become an important topic in the less data-intensive areas of the Social Sciences and Humanities (SSH). Funding agencies as well as research communities demand that empirical data collected and used for scientific research is managed and preserved in a way that research results are reproducible. In...
Chapter
Multilingual speakers communicate in more than one language in daily life and on social media. In order to process or investigate multilingual communication, there is a need for language identification. This study compares the performance of human annotators with automatic ways of language identification on a multilingual (mainly German-Italian-Eng...
Conference Paper
Full-text available
In this article we provide an overview of first-hand experiences and vantage points for best practices from projects in seven European countries dedicated to learner corpus research (LCR) and the creation of language learner corpora. The corpora and tools involved in LCR are becoming more and more important, as are careful preparation and easy retr...
Conference Paper
The goal of the STyrLogism Project is to semi-automatically extract neologism candidates (new lexemes) for the German standard variety used in South Tyrol, and generally to create the basis for long-term monitoring of its development. We use automatic lexico-semantic analytics for the lexicographic processing, but instead of continuing to develop o...
Conference Paper
Full-text available
The DiDi corpus of South Tyrolean data of computer-mediated communication (CMC) is a multilingual sociolinguistic language corpus. It consists of around 600,000 tokens collected from 136 profiles of Facebook users residing in South Tyrol, Italy. In conformity with the multilingual situation of the territory, the main languages of the corpus are Ger...
Poster
Full-text available
Poster presented at the 3rd Italian Conference on Computational Linguistics, 5-6 December 2016, Università degli Studi di Napoli, Federico II
Conference Paper
Full-text available
English. This paper describes an extended version of the KoKo corpus (version KoKo4, Dec 2015), a corpus of written German L1 learner texts from three different German-speaking regions in three different countries. The KoKo corpus is richly annotated with learner language features on different linguistic levels such as errors or other linguistic ch...
Conference Paper
20. Michael Beißwenger, Thierry Chanier, Isabella Chiari, Tomaž Erjavec, Darja Fișer, Axel Herold, Nikola Lubešić, Harald Lüngen, Céline Poudat, Egon Stemle, Angelika Storrer and Ciara Wigham, 2016, “Integrating corpora of computer-mediated communication into the language resources landscape: Initiatives and best practices from French, German, Ital...
Article
Full-text available
At the beginning of the corpus building process is the selection of appropriate software tools and data formats for the acquisition and annotation of original linguistic data. This initial phase is characterised by challenging decisions, for the software needs to be flexible (to facilitate intuitive and speedy transcription), powerful (to meet anno...
Chapter
The annual conference CLIC–it (''Italian Conference on Computational Linguistics'') is an initiative of the ''Italian Association of Computational Linguistics'' (AILC – www.ai-lc.it) which is intended to meet the need for a national and international forum for the promotion and dissemination of high-level original research in the field of Computati...
Chapter
Full-text available
The annual conference CLIC–it (''Italian Conference on Computational Linguistics'') is an initiative of the ''Italian Association of Computational Linguistics'' (AILC – www.ai-lc.it) which is intended to meet the need for a national and international forum for the promotion and dissemination of high-level original research in the field of Computati...
Chapter
EVALITA is the evaluation campaign of Natural Language Processing and Speech Tools for the Italian language: since 2007 shared tasks have been proposed covering the analysis of both written and spoken language with the aim of enhancing the development and dissemination of resources and technologies for Italian. EVALITA is an initiative of the Itali...
Article
This art­icle focuses on auto­matic text clas­si­fic­a­tion which aims at identi­fy­ing the first lan­guage (L1) back­ground of learners of Eng­lish. A par­tic­u­lar ques­tion arising in the con­text of auto­mated L1 iden­ti­fic­a­tion is whether any fea­tures that are inform­at­ive for a machine learn­ing algorithm relate to L1-spe­cific trans­fer...
Conference Paper
Full-text available
In this paper, we present on-going experiments for correcting OCR errors on German newspapers in Fraktur font. Our approach borrows from techniques for spelling correction in context using a probabilistic edit-operation error model and lexical resources. We highlight conditions in which high error reduction rates can be obtained and where the appro...
Article
Full-text available
Decisions at the outset of preparing a learner corpus are of crucial importance for how the corpus can be built and how it can be analysed later on. This paper presents a generic workflow to build learner corpora while taking into account the needs of the users. The workflow results from an extensive collaboration between linguists that annotate an...
Conference Paper
Full-text available
In this paper, we propose an integrated web strategy for mixed sociolinguistic research methodologies in the context of social media corpora. After stating the particular challenges for building corpora of private, non-public computer-mediated communication, we will present our solution to these problems: a Facebook web application for the acquisit...
Presentation
Full-text available
Talk at the pre-conference workshop “NLP 4 CMC: Natural Language Processing for Computer-Mediated Communication / Social Media” at the 12th edition of KONVENS, Hildesheim, Germany
Article
Full-text available
English. In this paper, we present ongoing experiments for correcting OCR errors on German newspapers in Fraktur font. Our approach borrows from techniques for spelling correction in context using a prob-abilistic edit-operation error model and lexical resources. We highlight conditions in which high error reduction rates can be obtained and where...
Article
Full-text available
Special Issue: Building and annotating corpora of computer-mediated discourse. Issues and Challenges at the Inteface of Corpus and Computational Linguistics
Conference Paper
Full-text available
We introduce the KoKo corpus, a collection of German L1 learner texts annotated with learner errors, along with the methods and tools used in its construction and evaluation. The corpus contains both texts and corresponding survey information from 1,319 pupils and amounts to around 716,000 tokens. The evaluation of the quality of the performed tran...
Conference Paper
Full-text available
In this article, we present interHist, a compact visualization for the interactive exploration of results to complex corpus queries. Integrated with a search interface to the PAISA corpus of Italian web texts, interHist aims at facilitating the exploration of large results sets to linguistic corpus searches. This objective is approached by providin...
Article
Talk in the Work in Progress Series at the Kompetenzzentrum Sprachen, Freie Universität Bozen
Presentation
Full-text available
Talk at the 7th workshop of the DFG scientific network Empirikom ”Social Media Corpora for the eHumanities: Standards, Challenges, and Perspectives”, TU Dortmund University
Conference Paper
Full-text available
PAISA' is a Creative Commons licensed, large web corpus of contemporary Italian. We describe the design, harvesting, and processing steps involved in its creation.
Data
Talk at the Workshop from the "Arbeitsgruppe: Korpusbasierte Linguistik" at the 40. Österreichische Linguistiktagung, Nov 22-24, 2013, Universität Salzburg, Salzburg, Austria. Der Vortrag stellt den iterativen Workflow zur Erstellung eines lemmatisierten, POS-getaggten und nach ausgewählten sprachlichen Merkmalen annotierten Lernerkorpus vor und g...
Conference Paper
Full-text available
In this article, we present the multi-faceted interface to the open PAISÀ corpus of Italian. Created within the project PAISÀ (Piattaforma per l’Apprendimento dell’Italiano Su corpora Annotati) [1], the corpus is designed to be freely available for non-commercial processing, usage and distribution by the public. Hence, the automatically annotated c...
Data
Talk at the Workshop on "Verabeitung und Annotation von Sprachdaten aus Genres internetbasierter Kommunikation" at the International Conference of the German Society for Computational Linguistics and Language Technology (GSCL 2013), Sep 23, 2013, TU Darmstadt. Die automatische Verarbeitung von IBK-Daten stellt herkömmliche Verfahren im Bereich der...
Conference Paper
Full-text available
In this paper, we report on an unsupervised greedy-style process for acquiring phrase translations from sentence-aligned parallel corpora. Thanks to innovative selection strategies, this process can acquire multiple translations without size criteria, i.e. phrases can have several translations, can be of any size, and their size is not considered w...
Conference Paper
Full-text available
In this paper, we report on an unsupervised greedy-style process for acquiring phrase translations from sentence-aligned parallel corpora. Thanks to innovative selection strategies, this process can acquire multiple translations without size criteria, i.e. phrases can have several translations, can be of any size, and their size is not considered w...
Book
Web corpora and other Web-derived data have become a gold mine for corpus linguistics and natural language processing. The Web is an easy source of unprecedented amounts of linguistic data from a broad range of registers and text types. However, a collection of Web pages is not immediately suitable for exploration in the same way a traditional corp...
Data
Talk at BootCaTters of the world unite! (BOTWU), A workshop (and a survey) on the BootCaT toolkit, Jun 24, 2013, Department of Interpreting and Translation, University of Bologna Forlì, Italy. "Copyright issues remain a gray area in compiling and distributing Web corpora"[1]; and even though "If a Web corpus is infringing copyright, then it is mer...
Article
Talk at the DGfS 2013 Workshop on Modellierung nicht-standardisierter Schriftlichkeit, the 35th Annual Conference of the German Linguistic Society (DGfS 2013)
Article
Talk at the international workshop "Building Corpora of Computer-Mediated Communication: Issues, Challenges, and Perspectives", Department of German Language and Literature, Faculty of Culture Studies, TU Dortmund University
Conference Paper
Full-text available
We report on on-going work to derive translations of phrases from parallel corpora. We describe an unsupervised and knowledge-free greedy-style process relying on innovative strategies for choosing and discarding candidate translations. This process manages to acquire multiple translations combining phrases of equal or different sizes. The prelimin...
Data
Plenary Talk at Computer Applications in Linguistics: Student Research Workshop (CSRW2012), Jul 13, 2012, English Corpus Linguistics Group at the Institute of Linguistics and Literary Studies, Technische Universität Darmstadt.
Data
Short presentation at the 3rd workshop of the academic network on "Internet Lexicography" - May 4, 2012.
Conference Paper
Full-text available
Developing content extraction methods for Humanities domains raises a number of challenges, from the abundance of non-standard entity types to their complexity to the scarcity of data. Close collaboration with Humanities scholars is essential to address these challenges. We discuss an annotation schema for Archaeological texts developed in collabor...
Article
Full-text available
Live Demo with Poster at the LiveMemories Final Event - Internet, Memoria e Futuro and The Semantic Way
Article
Full-text available
The entities mentioned in collections of scholarly articles in the Humanities (and in other scholarly domains) belong to different types from those familiar from news corpora, hence new resources need to be annotated to create supervised taggers for tasks such as ne extraction. However, in such domains there is a great need for making the best use...
Conference Paper
Most existing HLT pipelines assume the input is pure text or, at most, HTML and either ignore (logical) document structure or remove it. We argue that identifying the structure of documents is essential in digital library and other types of applications, and show that it is relatively straightforward to extend existing pipelines to achieve ones in...
Conference Paper
Small, manually assembled corpora may be available for less dominant languages and dialects, but producing web-scale resources remains a challenge. Even when considerable quantities of text are present on the web, finding this text, and distinguishing it from related languages in the same region can be difficult. For example less dominant variants...
Conference Paper
Full-text available
The Live Memories corpus is an Italian corpus annotated for anaphoric relations. The corpus includes manual annotated information about morphosyntactic agreement, anaphoricity, and semantic class of the NPs. For the annotation of the anaphoric links the corpus takes into account specific phenomena of the Italian language like incorporated clitics a...
Data
The KrdWrd System comprises an extensive system for automated Web cleaning tasks. For training the KrdWrd ML Engine, a substantial amount of hand-annotated data, viz. Web pages, are needed. Following, we present the parts of the system that cover the acquisition of training data, i.e. the steps before training data can be fed into a ML Engine. Henc...
Conference Paper
Algorithmic processing of Web content mostly works on textual contents, neglecting visual information. Annotation tools largely share this deficit as well. We specify requirements for an architecture to overcome both problems and propose an implementation, the KrdWrd system. It uses the Gecko rendering engine for both annotation and feature extract...
Thesis
This thesis discusses the KrdWrd Project. The Project goals are to provide tools and infrastructure for acquisition, visual annotation, merging and storage of Web pages as parts of bigger corpora, and to develop a classification engine that learns to automatically annotate pages, operate on the visual rendering of pages, and provide visual tools fo...
Conference Paper
Several bootstrapping-based relation extraction algorithms working on large corpora or on the Web have been presented in the literature. A crucial issue for such algorithms is to avoid the introduction of too much noise into further iter- ations. Typically, this is achieved by applying appropriate pattern and tuple evaluation measures, henceforth c...

Network

Cited By