
Els LefeverGhent University | UGhent · Department of Translation, Interpreting and Communication
Els Lefever
Dr. computer science
About
100
Publications
23,734
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,591
Citations
Citations since 2017
Introduction
I started my career as a computational linguist at the R&D-department of Lernout & Hauspie. I hold a PhD in computer science on Parallel Corpora for WSD (2012). I have a strong expertise in machine learning of natural language and multilingual NLP, with a special interest for computational semantics, cross-lingual word sense disambiguation, multilingual terminology extraction and sentiment analysis. I teach Localization, Language Technology, Computer-assisted translation and Python.
Additional affiliations
January 2007 - February 2012
Publications
Publications (100)
This paper reports on a set of proof-of-concept experiments performed to evaluate and improve the alignment of monolingual embeddings for a specialised domain, viz. the medical use case of heart failure. The presented approach, which creates domain-specific dictionaries on-the-fly from cross-lingual Wikipedia links, achieves good results for cross-...
In this paper, we explore the feasibility of irony detection in Dutch social media. To this end, we investigate both transformer models with embedding representations, as well as traditional machine learning classifiers with extensive feature sets. Our feature-based methodology implements a variety of information sources including lexical, semantic...
This contribution presents version 1.5 of the Annotated Corpora for Term Extraction Research (ACTER) dataset. It includes domain-specific corpora in three languages (English, French, and Dutch) and four domains (corruption, dressage (equitation), heart failure, and wind energy). Manual annotations are available of terms and Named Entities for each...
This contribution presents D-Terminer: an open access, online demo for monolingual and multilingual automatic term extraction from parallel corpora. The monolingual term extraction is based on a recurrent neural network, with a supervised methodology that relies on pretrained embeddings. Candidate terms can be tagged in their original context and t...
Since the rise of social media, the authority of traditional professional literary critics has beensupplemented – or undermined, depending on the point of view – by technological developmentsand the emergence of community-driven online layperson literary criticism. So far, relatively littleresearch (Allington 2016, Kellermann et al. 2016, Kellerman...
We describe the creation of CLARIN Belgium (CLARIN-BE) and, associated with that, the plans of the CLARIN-VL consortium within the CLARIAH-VL infrastructure for which funding was secured for the period 2021-2025.
The detection of online cyberbullying has seen an increase in societal importance, popularity in research, and available open data. Nevertheless, while computational power and affordability of resources continue to increase, the access restrictions on high-quality data limit the applicability of state-of-the-art techniques. Consequently, much of th...
Automatic term extraction (ATE) is an important task within natural language processing, both separately, and as a preprocessing step for other tasks. In recent years, research has moved far beyond the traditional hybrid approach where candidate terms are extracted based on part-of-speech patterns and filtered and sorted with statistical termhood a...
As with many tasks in natural language processing, automatic term extraction (ATE) is increasingly approached as a machine learning problem. So far, most machine learning approaches to ATE broadly follow the traditional hybrid methodology, by first extracting a list of unique candidate terms, and classifying these candidates based on the predicted...
This paper presents two different systems for the SemEval shared task 7 on Assessing Humor in Edited News Headlines, sub-task 1, where the aim was to estimate the intensity of humor generated in edited headlines. Our first system is a feature-based machine learning system that combines different types of information (e.g. word embeddings, string si...
This paper describes our contribution to the SemEval-2020 Task 9 on Sentiment Analysis for Code-mixed Social Media Text. We investigated two approaches to solve the task of Hinglish sentiment analysis. The first approach uses cross-lingual embeddings resulting from projecting Hinglish and pre-trained English FastText word embeddings in the same spa...
Automatic term extraction is a productive field of research within natural language processing, but it still faces significant obstacles regarding datasets and evaluation, which require manual term annotation. This is an arduous task, made even more difficult by the lack of a clear distinction between terms and general language, which results in lo...
The TermEval 2020 shared task provided a platform for researchers to work on automatic term extraction (ATE) with the same dataset: the Annotated Corpora for Term Extraction Research (ACTER). The dataset covers three languages (English, French, and Dutch) and four domains, of which the domain of heart failure was kept as a held-out test set on whic...
Despite the rich history of research into medical translation, there is a notable lack of empirical studies on the best workflow for this task, especially in a modern translation setting involving post-editing of machine translation. This pilot study was conducted in preparation for a large translation project of medical guidelines for laypeople fr...
The detection of online cyberbullying has seen an increase in societal importance, popularity in research, and available open data. Nevertheless, while computational power and affordability of resources continue to increase, the access restrictions on high-quality data limit the applicability of state-of-the-art techniques. Consequently, much of th...
Talking about odors and flavors is difficult for most people, yet experts appear to be able to convey critical information about wines in their reviews. This seems to be a contradiction, and wine expert descriptions are frequently received with criticism. Here, we propose a method for probing the language of wine reviews, and thus offer a means to...
When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technol...
When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technol...
Cross-disciplinary communication is often impeded by terminological ambiguity. Hence, cross-disciplinary teams would greatly benefit from using a language technology-based tool that allows for the (at least semi-) automated resolution of ambiguous terms. Although no such tool is readily available, an interesting theoretical outline of one does exis...
Tools that automatically extract terms and their equivalents in other languages from parallel corpora can contribute to multilingual professional communication in more than one way. By means of a use case with data from a medical web site with point of care evidence summaries (Ebpracticenet), we illustrate how hybrid multilingual automatic term ext...
Translation is an age old multilingual activity whose increasingly more important relevance is being captured by today's multidisciplinary character of translation studies. This contribution first sketches the linguistic product-oriented approach, focusing on texts in different languages (translations, their source texts and comparable texts) and i...
Keywords: terminology; automatic term extraction; ATR;
While social media offer great communication opportunities, they also increase the vulnerability of young people to threatening situations online. Recent studies report that cyberbullying constitutes a growing problem among youngsters. Successful prevention depends on the adequate detection of potentially harmful messages and the information overlo...
Although common sense and connotative knowledge come naturally to most people, computers still struggle to perform well on tasks for which such extratextual information is required. Automatic approaches to sentiment analysis and irony detection have revealed that the lack of such world knowledge undermines classification performance. In this articl...
To push the state of the art in text mining applications, research in natural language processing has increasingly been investigating automatic irony detection, but manually annotated irony corpora are scarce. We present the construction of a manually annotated irony corpus based on a fine-grained annotation scheme for irony that allows to identify...
Terms are notoriously difficult to identify, both automatically and manually. This complicates the evaluation of the already challenging task of automatic term extraction. With the advent of multilingual automatic term extraction from comparable corpora, accurate evaluation becomes increasingly difficult, since term linking must be evaluated as wel...
It is widely held that smells and flavors are impossible to put into words. In this paper we test this claim by seeking predictive patterns in wine reviews, which ostensibly aim to provide guides to perceptual content. Wine reviews have previously been critiqued as random and meaningless. We collected an English corpus of wine reviews with their st...
We present the highlights of the now finished 4-year SCATE project. It was completed in February 2018 and funded by the We present key results of SCATE (Smart Computer Aided Translation Environment). The project investigated algorithms, user interfaces and methods that can contribute to the development of more efficient tools for translation work.
While social media offer great communication opportunities, they also increase the vulnerability of young people to threatening situations online. Recent studies report that cyberbullying constitutes a growing problem among youngsters. Successful prevention depends on the adequate detection of potentially harmful messages and the information overlo...
In the past decade, sentiment analysis research has thrived, especially on social media. While this data genre is suitable to extract opinions and sentiment, it is known to be noisy. Complex normalisation methods have been developed to transform noisy text into its standard form, but their effect on tasks like sentiment analysis remains underinvest...
This paper presents a corpus-driven, statistical method for the visualization of semantic structure, thereby tackling the under-researched issue of semantics in corpus-based Translation Studies. We aim to investigate the influence of translation on the structure of semantic fields and in particular the extent to which the structure of the semantic...
We present the SCATE prototype: A Smart Computer-Aided Translation Environment, developed in the SCATE research project. Its user interface displays translation suggestions coming from different resources, in an intelligible and interactive way. It contains carefully designed representations that show relevant context to clarify why certain suggest...
The Third International Conference on Human and Social Analytics (HUSO 2017), held between July 23 - 27, 2017 - Nice, France continued the inaugural event bridging the concepts and the communities dealing with emotion-driven systems, sentiment analysis, personalized analytics, social human analytics, and social computing.
The recent development of...
CLIN27 conference poster with intermediate results on cyberbullying detectection in the AMiCA project.
This paper presents an integrated ABSA pipeline for Dutch that has been developed and tested on qualitative user feedback coming from three domains: retail, banking and human resources. The two latter domains provide service-oriented data, which has not been investigated before in ABSA. By performing in-domain and cross-domain experiments the valid...
Social media provide an increasingly used platform for crisis communication. Governments need to understand how publics consume and react to crisis information via social media. One option to do this is by applying emotion analysis. In this pilot study, we target the November 2015 terrorist attacks in Paris as a case study for emotion analysis and...
Creating domain ontologies is usually performed by teams of knowledge engineers and domain experts, and is considered to be a time-consuming and difficult task. As a result, scientists have started to develop automatic approaches to ontology learning and population. For the proposed research, we focus on the central subtask of ontology learning, be...
Handling figurative language like irony is currently a challenging task in natural language processing. Since irony is commonly used in user-generated content, its presence can significantly undermine accurate analysis of opinions and sentiment in such texts. Understanding irony is therefore important if we want to push the state-of-the-art in task...
This research presents experiments carried out to improve the precision and recall of Dutch hypernym detection. To do so, we applied a data-driven semantic relation finder that starts from a list of automatically extracted domain-specific terms from technical corpora, and generates a list of hypernym relations between these terms. As Dutch technica...
The recent development of social media poses new challenges to the research community in analyzing online interactions between people. Social networking sites offer great opportunities for connecting with others, but also increase the vulnerability of young people to undesirable phenomena, such as cybervictimization. Recent research reports that on...
In the current era of online interactions, both positive and negative experiences are abundant on the Web. As in real life, negative experiences can have a serious impact on youngsters. Recent studies have reported cybervictimization rates among teenagers that vary between 20% and 40%. In this paper, we focus on cyberbullying as a particular form o...
This paper aims to visualize the semantic field of inchoativity in Dutch, for both translated and non-translated language. Two methodological solutions, a context-based and a translation-based approach, will be assessed and consequently compared to each other. Such a comparison can possibly generate interesting insights into the accuracy of the res...
HypoTerm is a data-driven semantic relation finder that starts from a list of automatically extracted domain- and user-specific terms from technical corpora, and generates a list of relations between these terms. This research study focused on the detection of hypernym relations between relevant terms and named entities. In order to detect all rele...
We present a multilingual approach to Word Sense Disambiguation (WSD), which automatically assigns the contextually appropriate sense to a given word. Instead of using a predefined monolingual sense-inventory, we use a language-independent framework by deriving the senses of a given word from word alignments on a multilingual parallel corpus, which...
We present a new cross-lingual task for SemEval concerning the translation of L1 fragments in an L2 context. The task is at the boundary of Cross-Lingual Word Sense Disambiguation and Machine Translation. It finds its application in the field of computer-assisted translation, particularly in the context of second language learning. Translating L1 f...
Het artikel geeft een overzicht van de activiteiten en projecten binnen het vakgebied van de terminologie in de vakgroep VTC en zijn voorgangers. Zowel terminografische projecten als taaltechnologische toepassingen en termextractie komen aan bod.
This paper describes our contribution to the SemEval-2014 Task 9 on sentiment analysis in Twitter. We participated in both strands of the task, viz. classification at message-level (subtask B), and polarity disambiguation of particular text spans within a message (subtask A). Our experiments with a variety of lexical and syntactic features show tha...
This paper presents the LeTs Preprocess Toolkit, a suite of robust high-performance preprocessing modules including Part-of-Speech Taggers, Lemmatizers and Named Entity Recognizers. The currently supported languages are Dutch, English, French and German. We give a detailed description of the architecture of the LeTs Preprocess pipeline and describe...
We report on TExSIS, a flexible bilingual terminology extraction system that uses a sophisticated chunk-based alignment method for the generation of candidate terms, after which the specificity of the candidate terms is determined by combining several statistical filters. Although the set-up of the architecture is largely language-independent, we p...
This paper presents a multilingual classification-based approach to Word Sense Disambiguation that directly incorporates translational evidence from four other languages. The need of a large predefined monolingual sense inventory (such as WordNet) is avoided by taking a language-independent approach where the word senses are derived automatically f...
This paper proposes a two-step approach to find hypernym relations between pairs of noun phrases in Dutch text. We first apply a pattern-based approach that combines lexical and shallow syntactic information to extract a list of candidate hypernym pairs from the input text. In a second step, distributional similarity information is used to filter t...
This paper describes a phrase-based machine translation approach to normalize Dutch user-generated content (UGC). We compiled a corpus of three different social media genres (text messages, message board posts and tweets) to have a sample of this recent domain. We describe the various characteristics of this noisy text material and explain how it h...