Ilia Markov

Ilia Markov
Vrije Universiteit Amsterdam | VU · Language, Literature and Communication

PhD
Assistant Professor, Vrije Universiteit Amsterdam

About

59
Publications
18,339
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
573
Citations
Citations since 2016
49 Research Items
561 Citations
2016201720182019202020212022020406080100120
2016201720182019202020212022020406080100120
2016201720182019202020212022020406080100120
2016201720182019202020212022020406080100120
Introduction
I am an Assistant Professor at the Vrije Universiteit Amsterdam. Research interests: natural language processing, text mining and computational linguistics; specifically, hate speech detection and authorship analysis-related tasks such as authorship attribution and author profiling.
Additional affiliations
June 2019 - December 2021
University of Antwerp
Position
  • PostDoc Position
June 2018 - May 2019
INRIA
Position
  • PostDoc Position
March 2017 - July 2017
Fondazione Bruno Kessler
Position
  • Research intern
Education
August 2014 - April 2018
Instituto Politécnico Nacional
Field of study
  • Computer Sciences
September 2012 - July 2014
Universidade do Algarve
Field of study
  • Language Sciences
September 2001 - June 2006
Kaliningrad State Technical University
Field of study
  • Computer Engineering

Publications

Publications (59)
Conference Paper
Full-text available
To determine author demographics of texts in social media such as Twitter, blogs, and reviews, we use doc2vec document embeddings to train a logistic regression classifier. We experimented with age and gender identification on the PAN author profiling 2014–2016 corpora under both single- and cross-genre conditions. We show that under certain settin...
Conference Paper
Full-text available
We present the CIC-FBK system, which took part in the Native Language Identification (NLI) Shared Task 2017. Our approach combines features commonly used in previous NLI research, i.e., word n-grams, lemma n-grams, part-of-speech n-grams, and function words, with recently introduced character n-grams from misspelled words, and features that are nov...
Conference Paper
Full-text available
The effectiveness of character n-gram features for representing the stylistic properties of a text has been demonstrated in various independent Authorship Attribution (AA) studies. Moreover, it has been shown that some categories of character n-grams perform better than others both under single and cross-topic AA conditions. In this work, we presen...
Conference Paper
Full-text available
We explore the hypothesis that emotion is one of the dimensions of language that surfaces from the native language into a second language. To check the role of emotions in native language identification (NLI), we model emotion information through polarity and emotion load features, and use document representations using these features to classify t...
Conference Paper
Full-text available
In this paper, we describe experiments designed to explore and evaluate the impact of punctuation marks on the task of native language identification. Punctuation is specific to each language, and is part of the indicators that overtly represent the manner in which each language organizes and conveys information. Our experiments are organized in va...
Conference Paper
Full-text available
Online hate speech detection is an inherently challenging task that has recently received much attention from the natural language processing community. Despite a substantial increase in performance, considerable challenges remain and include encoding contextual information into automated hate speech detection systems. In this paper, we focus on de...
Chapter
Full-text available
Over the past years, the amount of online hate speech has been growing steadily. Among multiple approaches to automatically detect hateful content online, ensemble learning is considered one of the best strategies, as shown by several studies on English and other languages. In this paper, we evaluate state-of-the-art approaches for Dutch hate speec...
Article
Full-text available
We present an online tool-"Vaccinpraat"-that monitors messages expressing skepticism towards COVID-19 vaccination on Dutch-language Twitter and Facebook. The tool provides live updates, statistics and qualitative insights into opinions about vaccines and arguments used to justify anti-vaccination opinions. An annotation task was set up to create tr...
Article
Full-text available
In this paper, we examine the importance of word category information for the age detection task-the task of identifying the age of a person based on their writing-both under in-domain and cross-domain conditions. We remove entire word classes and study its effect using both Support Vector Machines (SVM) and pre-trained contextual word embeddings (...
Conference Paper
Full-text available
Idiosyncrasies in human writing styles make it difficult to develop systems for authorship identification that scale well across individuals. In this year's edition of PAN, the authorship identification track focused on open-set authorship verification, so that systems are applied to unknown documents by previously unseen authors in a new domain. A...
Chapter
The paper gives a brief overview of the three shared tasks organized at the PAN 2021 lab on digital text forensics and stylometry hosted at the CLEF conference. The tasks include authorship verification across domains, author profiling for hate speech spreaders, and style change detection for multi-author documents. In part the tasks are new and in...
Conference Paper
Full-text available
We study the usefulness of hateful metaphors as features for the identification of the type and target of hate speech in Dutch Facebook comments. For this purpose, all hateful metaphors in the Dutch LiLaH corpus were annotated and interpreted in line with Conceptual Metaphor Theory and Critical Metaphor Analysis. We provide SVM and BERT/RoBERTa res...
Conference Paper
Full-text available
Hate speech detection is an actively growing field of research with a variety of recently proposed approaches that allowed to push the state-of-the-art results. One of the challenges of such automated approaches – namely recent deep learning models – is a risk of false positives (i.e., false accusations), which may lead to over-blocking or removal...
Conference Paper
Full-text available
In this paper, we describe experiments designed to evaluate the impact of stylometric and emotion-based features on hate speech detection: the task of classifying textual content into hate or non-hate speech classes. Our experiments are conducted for three languages-English, Slovene, and Dutch-both in in-domain and cross-domain setups, and aim to i...
Chapter
The paper gives a brief overview of the three shared tasks to be organized at the PAN 2021 lab on digital text forensics and stylometry hosted at the CLEF conference. The tasks include authorship verification across domains, author profiling for hate speech spreaders, and style change detection for multi-author documents. In part the tasks are new...
Conference Paper
Full-text available
In this paper, we present emotion lexicons of Croatian, Dutch and Slovene, based on manually corrected automatic translations of the English NRC Emotion lexicon. We evaluate the impact of the translation changes by measuring the change in supervised classification results of socially unacceptable utterances when lexicon information is used for feat...
Conference Paper
Full-text available
Native language identification (NLI) – identifying the native language (L1) of a person based on his/her writing in the second language (L2) – is useful for a variety of purposes, including marketing, security, and educational applications. From a traditional machine learning perspective, NLI is usually framed as a multi-class classification task,...
Article
Full-text available
Native language identification (NLI) — the task of automatically identifying the native language (L1) of persons based on their writings in the second language (L2) — is based on the hypothesis that characteristics of L1 will surface and interfere in the production of texts in L2 to the extent that L1 is identifiable. We present an in-depth investi...
Conference Paper
Full-text available
Authorship identification remains a highly topical research problem in computational text analysis with many relevant applications in contemporary society and industry. For this edition of PAN, we focused on authorship verification , where the task is to assess whether a pair of documents has been authored by the same individual. Like in previous e...
Chapter
We briefly report on the four shared tasks organized as part of the PAN 2020 evaluation lab on digital text forensics and authorship analysis. Each tasks is introduced, motivated, and the results obtained are presented. Altogether, the four tasks attracted 230 registrations, yielding 83 successful submissions. This, and the fact that we continue to...
Conference Paper
Full-text available
We present an ensemble approach for the detection of sarcasm in Reddit and Twitter responses in the context of The Second Workshop on Figurative Language Processing held in conjunction with ACL 2020. The ensemble is trained on the predicted sarcasm probabilities of four component models and on additional features, such as the sentiment of the comme...
Data
The lexicon is available here: https://clarin.si/repository/xmlui/handle/11356/1318
Conference Paper
Full-text available
In this paper, we present experiments that estimate the impact of specific lexical choices of people writing in a second language (L2). In particular, we look at misspelled words that indicate lexical uncertainty on the part of the author , and separate them into three categories: misspelled cognates, "L2-ed" (in our case, anglicized) words, and al...
Conference Paper
Full-text available
We present the INRIA approach to the suggestion mining task at SemEval 2019. The task consists of two subtasks: suggestion mining under single-domain (Subtask A) and cross-domain (Subtask B) settings. We used the Support Vector Machines algorithm trained on handcrafted features, function words, sentiment features, digits, and verbs for Subtask A, a...
Article
In spite of having been investigated for over fifty years, developing a robust spoken dialog management system remains an open research issue in robotics and natural language processing. In this paper, we present a language-independent spoken dialog management module integrated into a human-robot interaction system. We adopt an algorithmic approach...
Article
We present a method for gender and language variety identification using a convolutional neural network (CNN). We compare the performance of this method with a traditional machine learning algorithm – support vector machines (SVM) trained on character n-grams (n = 3–8) and lexical features (unigrams and bigrams of words), and their combinations. We...
Conference Paper
Full-text available
In this paper, we describe the CIC-IPN submissions to the shared task on Indian Native Language Identification (INLI 2018). We use the Support Vector Machines algorithm trained on numerous feature types: word, character, part-of-speech tag, and punctuation mark n-grams, as well as character n-grams from misspelled words and emotion-based features....
Thesis
Full-text available
Native language identification (NLI) is the task of identifying the native language (L1) of a person based on his or her texts written in a second language (L2). NLI is useful for a variety of purposes, including marketing, security, and educational applications. Identifying the native language relies on the assumption that the native language infl...
Conference Paper
Full-text available
We present the CIC-GIL approach to the author profiling (AP) task at MEX-A3T 2018. The task consists of two subtasks: identification of authors' location (6-way) and occupation (8-way) in a corpus of Mexican Spanish tweets. We used the logistic regression algorithm trained on typed character n-grams, function-word n-grams, and regionalisms for loca...
Chapter
The effectiveness of character n-gram features for representing the stylistic properties of a text has been demonstrated in various independent Authorship Attribution (AA) studies. Moreover, it has been shown that some categories of character n-grams perform better than others both under single and cross-topic AA conditions. In this work, we presen...
Conference Paper
Full-text available
We present the CIC systems submitted to the 2017 PAN shared task on Cross-Genre Gender Identification in Russian texts (RUSProfiling). We submitted five systems. One of them was based on a statistical approach using only lexical features, and other four on machine-learning techniques using some combinations of gender-specific Russian grammatical fe...
Article
Full-text available
For the Authorship Attribution (AA) task, character n-grams are considered among the best predictive features. In the English language, it has also been shown that some types of character n-grams perform better than others. This paper tackles the AA task in Por-tuguese by examining the performance of different types of character n-grams, and variou...
Conference Paper
Full-text available
We compare the performance of character n-gram features (n = 3–8) and lexical features (unigrams and bigrams of words), as well as their combinations, on the tasks of authorship attribution, author profiling, and discriminating between similar languages. We developed a single multi-labeled corpus for the three aforementioned tasks, composed of news...
Conference Paper
Full-text available
We present the CIC's approach to the Author Profiling (AP) task at PAN 2017. This year task consists of two subtasks: gender and language variety identification in English, Spanish, Portuguese, and Arabic. We use typed and untyped character n-grams, word n-grams, and non-textual features (domain names). We experimented with various feature represen...
Conference Paper
Full-text available
This paper presents the CIC UALG's system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year's task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all la...
Article
Full-text available
We present a method for measuring similarity between source codes. We approach this task from the machine learning perspective using character and word n-grams as features and examining different machine learning algorithms. Furthermore, we explore the contribution of the latent semantic analysis in this task. We developed a corpus in order to eval...
Chapter
Full-text available
Dans cet article, nous examinons systématiquement l'ensemble (fermé?) des noms partie-du-corps (coeur, main, tête, etc.) et les mots morphologiquement associés, en particulier des adjectifs, par référence au portugais européen. Nous démontrons que le cadre harrissien général peut être utilisé pour cartographier la plupart des structures syntaxiques...
Conference Paper
Full-text available
This paper presents our approach to the Author Profiling (AP) task at PAN 2016. The task aims at identifying the author's age and gender under cross-genre AP conditions in three languages: English, Spanish, and Dutch. Our pre-processing stage includes reducing non-textual features to their corresponding semantic classes. We exploit typed character...
Article
Full-text available
Este artículo presenta un método para calcular la similitud entre programas (código fuente). La tarea es útil, por ejemplo, para la clasificación temática de programas o detección de reuso de código (digamos, en el caso de plagio). Usamos para los experimentos el lenguaje de programación Karel. Para determinar la similitud entre programas y/o ideas...
Article
Full-text available
En este trabajo presentamos un recurso léxico para el pre-procesamiento de textos publicados en redes sociales desarrollado para los idiomas: inglés, español, holandés e italiano. El recurso se compone de diccionarios de palabras slang, abreviaturas, contracciones y emoti-cones utilizados comúnmente en redes sociales. Los diccionarios fueron utiliz...
Article
Full-text available
We introduce a lexical resource for preprocessing social media data. We show that a neural network-based feature representation is enhanced by using this resource. We conducted experiments on the PAN 2015 and PAN 2016 author profiling corpora and obtained better results when performing the data preprocessing using the developed lexical resource. Th...
Article
Full-text available
In this article, we improve the extraction of semantic relations between textual elements as it is currently performed by STRING, a hybrid statistical and rule-based Natural Language Processing (NLP) chain for Portuguese, by targeting whole-part relation (meronymy), that is, a semantic relation between two entities of which one is perceived as a co...
Conference Paper
Full-text available
The paper describes our approach for the Authorship Identification task at the PAN CLEF 2015. We extract textual patterns based on features obtained from shortest path walks over Integrated Syntactic Graphs (ISG). Then we calculate a similarity between the unknown document and the known document with these patterns. The approach uses a predefined t...
Conference Paper
Full-text available
This paper describes our approach to tackle the Author Profiling task at PAN 2015. Our method relies on syntactic features, such as syntactic based n-grams of various types in order to predict the age, gender and personality traits that has the author of a given text. In this paper, we describe the used features, the employed classification algorit...
Conference Paper
Full-text available
In this paper, we propose the application of the Tree Edit Distance (TED) for calculation of similarity between syntactic n-grams for further detection of soft similarity between texts. The computation of text similarity is the basic task for many natural language processing problems, and it is an open research field. Syntactic n-grams are text fea...
Conference Paper
Full-text available
In this paper, we focus on the most frequent errors that occurred during the implementation of a rule-based module for semantic relations extraction, which has been integrated in STRING, a hybrid statistical and rule-based Natural Language Processing chain for Portuguese. We focus on whole-part relations (meronymy), that is, a semantic relation bet...
Conference Paper
Full-text available
Abstract. This paper describes the integration of verbal idioms into an Natural Language Processing (NLP) system, adopting a construction approach, which is based on the prior parsing stage, so that these Multi- Word Expressions (MWE) can be taken into account in subsequent tasks, such as semantic role labeling or whole-part relation extraction. Th...
Conference Paper
Full-text available
Abstract. In this paper, we target the extraction of whole-part rela- tions involving human entities and body-part nouns occurrences in texts using STRING, a hybrid statistical and rule-based Natural Language Processing chain for Portuguese. Whole-part relation is a semantic rela- tion between an entity that is perceived as a constituent part of an...
Thesis
Full-text available
In this work, we improve the extraction of semantic relations between textual elements as it is currently performed by STRING, a hybrid statistical and rule-based Natural Language Processing (NLP) chain for Portuguese, by targeting whole-part relations (meronymy), that is, a semantic relation between an entity that is perceived as a constituent par...
Conference Paper
Full-text available
In this paper, we improve the extraction of semantic relations between textual elements as it is currently performed by STRING, a hybrid statistical and rule-based Natural Language Processing chain for Portuguese, by targeting whole-part relations (meronymy), that is, a semantic relation between an entity that is perceived as a constituent part of...
Conference Paper
Full-text available
Dealing with idioms in Natural Language Processing systems is difficult, among other reasons, because their architecture must be conceived in such a way that it should not preclude the processing of both free word combinations and these, more constraint, expressions. On the other hand, many idioms do have syntactic structure, and can undergo severa...

Network

Cited By

Projects

Project (1)
Project
LiLaH (The Linguistic Landscape of Hate Speech in Social Media) is an FWO (Flemish NSF) and SSF (Slovenian Science Foundation) funded project focusing on building systems that can automatically recognize and analyse hate speech in social media texts. We are interested in the linguistic properties of the language that is being used to express hate in social media, specifically hate against migrants and LGBT people, and in automatically detecting it. The languages addressed are English, Dutch, Slovene, Croatian and French.