Fabrizio Sebastiani

Fabrizio Sebastiani
Italian National Research Council | CNR · Institute of Information Science and Technology "Alessandro Faedo" ISTI

About

290
Publications
91,341
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
21,504
Citations
Additional affiliations
March 2006 - present
Italian National Research Council
Position
  • Senior Researcher

Publications

Publications (290)
Article
Fine-tuning transformer-based deep learning models is currently at the forefront of natural language processing (NLP) and information retrieval (IR) tasks. However, fine-tuning these transformers for specific tasks, especially when dealing with ever-expanding volumes of data, constant retraining requirements, and budget constraints, can be computat...
Article
Full-text available
Quantification, i.e., the task of predicting the class prevalence values in bags of unlabeled data items, has received increased attention in recent years. However, most quantification research has concentrated on developing algorithms for binary and multi-class problems in which the classes are not ordered. Here, we study the ordinal case, i.e., t...
Article
While a substantial amount of work has recently been devoted to improving the accuracy of computational Authorship Identification (AId) systems for textual data, little to no attention has been paid to endowing AId systems with the ability to explain the reasons behind their predictions. This substantially hinders the practical application of AId m...
Article
The 3rd International Workshop on Learning to Quantify (LQ 2023)1 took place on September 18, 2023 in Torino, IT, where it was organised as a satellite event of the 34th European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2023). Like the main program of the conference, the workshop empl...
Article
Full-text available
Quantification is the supervised learning task that consists of training predictors of the class prevalence values of sets of unlabelled data, and is of special interest when the labelled data on which the predictor has been trained and the unlabelled data are not IID, i.e., suffer from dataset shift. To date, quantification methods have mostly bee...
Article
In this paper we investigate the effects on authorship identification tasks (including authorship verification, closed-set authorship attribution, and closed-set and open-set same-author verification) of a fundamental shift in how to conceive the vectorial representations of documents that are given as input to a supervised learner. In “classic” au...
Article
Quantification, variously called supervised prevalence estimation or learning to quantify , is the supervised learning task of generating predictors of the relative frequencies (a.k.a. prevalence values ) of the classes of interest in unlabelled data samples. While many quantification methods have been proposed in the past for binary problems and,...
Article
Algorithms and models are increasingly deployed to inform decisions about people, inevitably affecting their lives. As a consequence, those in charge of developing these models must carefully evaluate their impact on different groups of people and favour group fairness, that is, ensure that groups determined by sensitive demographic attributes, suc...
Article
Native language identification (NLI) is the task of training (via supervised machine learning) a classifier that guesses the native language of the author of a text. This task has been extensively researched in the last decade, and the performance of NLI systems has steadily improved over the years. We focus on a different facet of the NLI task, i....
Chapter
Full-text available
In this chapter we look at a number of “advanced” (or niche) topics in quantification, including quantification for ordinal data, “regression quantification” (the task that stands to regression as “standard” quantification stands to classification), cross-lingual quantification for textual data, quantification for networked data, and quantification...
Chapter
Full-text available
This chapter sets the stage for the rest of the book by introducing notions fundamental to quantification, such as class proportions, class distributions and their estimation, dataset shift, and the various subtypes of dataset shift which are relevant to the quantification endeavour. In this chapter we also argue why using classification techniques...
Chapter
Full-text available
This chapter is possibly the central chapter of the book, and looks at the various supervised learning methods for learning to quantify that have been proposed over the years. These methods belong to two main categories, depending on whether they have an aggregative nature (i.e., they require the classification of all individual unlabelled items as...
Chapter
Full-text available
In this chapter we discuss the experimental evaluation of quantification systems. We look at evaluation measures for the various types of quantification systems (binary, single-label multiclass, multi-label multiclass, ordinal), but also at evaluation protocols for quantification, that essentially consist in ways to extract multiple testing samples...
Chapter
Quantification, i.e., the task of training predictors of the class prevalence values in sets of unlabelled data items, has received increased attention in recent years. However, most quantification research has concentrated on developing algorithms for binary and multiclass problems in which the classes are not ordered. We here study the ordinal ca...
Chapter
Full-text available
This chapter provides the motivation for what is to come in the rest of the book by describing the applications that quantification has been put at, ranging from improving classification accuracy in domain adaptation, to measuring and improving the fairness of classification systems with respect to a sensitive attribute, to supporting research and...
Chapter
Full-text available
This chapter looks at other aspects of the “quantification landscape” that have not been covered in the previous chapters, and discusses the evolution of quantification research, from its beginnings to the most recent quantification-based “shared tasks”; the landscape of quantification-based, publicly available software libraries; visualization too...
Chapter
Full-text available
This chapter concludes the book, discussing possible future developments in the quantification arena.
Article
MINECORE is a recently proposed decision-theoretic algorithm for technology-assisted review that attempts to minimise the expected costs of review for responsiveness and privilege in e-discovery. In MINECORE, two probabilistic classifiers that classify documents by responsiveness and by privilege, respectively, generate posterior probabilities. The...
Article
This is a report on the thirteenth edition of the Conference and Labs of the Evaluation Forum (CLEF 2022), held on September 5--8, 2022, in Bologna, Italy. CLEF was a four-day hybrid event combining a conference and an evaluation forum. The conference featured keynotes by Benno Stein and Rita Cucchiara, and presentation of peer-reviewed research pa...
Preprint
Full-text available
We investigate the effects on authorship identification tasks of a fundamental shift in how to conceive the vectorial representations of documents that are given as input to a supervised learner. In ``classic'' authorship analysis a feature vector represents a document, the value of a feature represents (an increasing function of) the relative freq...
Preprint
Full-text available
Quantification, variously called "supervised prevalence estimation" or "learning to quantify", is the supervised learning task of generating predictors of the relative frequencies (a.k.a. "prevalence values") of the classes of interest in unlabelled data samples. While many quantification methods have been proposed in the past for binary problems a...
Article
Full-text available
Sentiment quantification is the task of training, by means of supervised learning, estimators of the relative frequency (also called “prevalence”) of sentiment-related classes (such as Positive, Neutral, Negative) in a sample of unlabelled texts. This task is especially important when these texts are tweets, since the final goal of most sentiment c...
Preprint
Full-text available
Native language identification (NLI) is the task of training (via supervised machine learning) a classifier that guesses the native language of the author of a text. This task has been extensively researched in the last decade, and the performance of NLI systems has steadily improved over the years. We focus on a different facet of the NLI task, i....
Chapter
LeQua 2022 is a new lab for the evaluation of methods for “learning to quantify” in textual datasets, i.e., for training predictors of the relative frequencies of the classes of interest \(\mathcal {Y}=\{y_{1}, ..., y_{n}\}\) in sets of unlabelled textual documents. While these predictions could be easily achieved by first classifying all documents...
Article
Funnelling ( Fun ) is a recently proposed method for cross-lingual text classification (CLTC) based on a two-tier learning ensemble for heterogeneous transfer learning (HTL). In this ensemble method, 1st-tier classifiers, each working on a different and language-dependent feature space, return a vector of calibrated posterior probabilities (with on...
Article
The 1st International Workshop on Learning to Quantify (LQ 2021 - https://cikmlq2021.github.io/), organized as a satellite event of the 30th ACM International Conference on Knowledge Management (CIKM 2021), took place on two separate days, November 1 and 5, 2021. As the main CIKM 2021 conference, the workshop was held entirely online, due to the CO...
Article
We present and make available MedLatinEpi and MedLatinLit , two datasets of medieval Latin texts to be used in research on computational authorship analysis. MedLatinEpi and MedLatinLit consist of 294 and 30 curated texts, respectively, labelled by author; MedLatinEpi texts are of epistolary nature, while MedLatinLit texts consist of literary comme...
Article
It is well known that, within the Latin production of written text, peculiar metric schemes were followed not only in poetic compositions, but also in many prose works. Such metric patterns were based on so‐called syllabic quantity, that is, on the length of the involved syllables, and there is substantial evidence suggesting that certain authors h...
Chapter
LeQua 2022 is a new lab for the evaluation of methods for “learning to quantify” in textual datasets, i.e., for training predictors of the relative frequencies of the classes of interest in sets of unlabelled textual documents. While these predictions could be easily achieved by first classifying all documents via a text classifier and then countin...
Technical Report
Full-text available
The Artificial Intelligence for Media and Humanities laboratory (AIMH) has the mission to investigate and advance the state of the art in the Artificial Intelligence field, specifically addressing applications to digital media and digital humanities, and taking also into account issues related to scalability. This report summarize the 2021 activiti...
Preprint
Full-text available
LeQua 2022 is a new lab for the evaluation of methods for ``learning to quantify'' in textual datasets, i.e., for training predictors of the relative frequencies of the classes of interest in sets of unlabelled textual documents. While these predictions could be easily achieved by first classifying all documents via a text classifier and then count...
Preprint
Full-text available
It is well known that, within the Latin production of written text, peculiar metric schemes were followed not only in poetic compositions, but also in many prose works. Such metric patterns were based on so-called syllabic quantity, i.e., on the length of the involved syllables, and there is substantial evidence suggesting that certain authors had...
Preprint
Full-text available
Models trained by means of supervised learning are increasingly deployed in high-stakes domains, and, when their predictions inform decisions about people, they inevitably impact (positively or negatively) on their lives. As a consequence, those in charge of developing these models must carefully evaluate their impact on different groups of people...
Preprint
Full-text available
\emph{Funnelling} (Fun) is a recently proposed method for cross-lingual text classification (CLTC) based on a two-tier learning ensemble for heterogeneous transfer learning (HTL). In this ensemble method, 1st-tier classifiers, each working on a different and language-dependent feature space, return a vector of calibrated posterior probabilities (wi...
Article
We study an approach to tweet classification based on distant supervision, whereby we automatically transfer labels from one social medium to another. In particular, we apply classes assigned to YouTube videos to tweets linking to these videos. This provides for free a virtually unlimited number of labelled instances that can be used as training da...
Article
Obtaining high-quality labelled data for training a classifier in a new application domain is often costly. Transfer Learning (a.k.a. “Inductive Transfer”) tries to alleviate these costs by transferring, to the “target” domain of interest, knowledge available from a different “source” domain. In transfer learning the lack of labelled information fr...
Preprint
Full-text available
QuaPy is an open-source framework for performing quantification (a.k.a. supervised prevalence estimation), written in Python. Quantification is the task of training quantifiers via supervised learning, where a quantifier is a predictor that estimates the relative frequencies (a.k.a. prevalence values) of the classes of interest in a sample of unlab...
Article
The 43rd European Conference on Information Retrieval (ECIR 2021), organized under the auspices of the Information Retrieval Specialist Group of the British Computer Society (BCS IRSG), took place between March 28 and April 1, 2021. As sadly customary in these dark times, the conference was held entirely online, due to the COVID-19 pandemic. Accord...
Article
Full-text available
Pre-trained word embeddings encode general word semantics and lexical regularities of natural language, and have proven useful across many NLP tasks, including word sense disambiguation, machine translation, and sentiment analysis, to name a few. In supervised tasks such as multiclass text classification (the focus of this article) it seems appeali...
Chapter
Learning to quantify (a.k.a. quantification) is a task concerned with training unbiased estimators of class prevalence via supervised learning. This task originated with the observation that “Classify and Count” (CC), the trivial method of obtaining class prevalence estimates, is often a biased estimator, and thus delivers suboptimal quantification...
Book
This two-volume set LNCS 12656 and 12657 constitutes the refereed proceedings of the 43rd European Conference on IR Research, ECIR 2021, held virtually in March/April 2021, due to the COVID-19 pandemic. The 50 full papers presented together with 11 reproducibility papers, 39 short papers, 15 demonstration papers, 12 CLEF lab descriptions papers, 5...
Book
This two-volume set LNCS 12656 and 12657 constitutes the refereed proceedings of the 43rd European Conference on IR Research, ECIR 2021, held virtually in March/April 2021, due to the COVID-19 pandemic. The 50 full papers presented together with 11 reproducibility papers, 39 short papers, 15 demonstration papers, 12 CLEF lab descriptions papers, 5...
Article
We critically re-examine the Saerens-Latinne-Decaestecker (SLD) algorithm, a well-known method for estimating class prior probabilities (“priors”) and adjusting posterior probabilities (“posteriors”) in scenarios characterized by distribution shift, i.e., difference in the distribution of the priors between the training and the unlabelled documents...
Technical Report
Full-text available
The Artificial Intelligence for Media and Humanities laboratory (AIMH) has the mission to investigate and advance the state of the art in the Artificial Intelligence field, specifically addressing applications to digital media and digital humanities, and taking also into account issues related to scalability. This report summarize the 2020 activiti...
Preprint
Full-text available
Learning to quantify (a.k.a.\ quantification) is a task concerned with training unbiased estimators of class prevalence via supervised learning. This task originated with the observation that "Classify and Count" (CC), the trivial method of obtaining class prevalence estimates, is often a biased estimator, and thus delivers suboptimal quantificatio...
Preprint
Sentiment quantification is the task of estimating the relative frequency (or "prevalence") of sentiment-related classes (such as Positive, Neutral, Negative) in a sample of unlabelled texts; this is especially important when these texts are tweets, since most sentiment classification endeavours carried out on Twitter data actually have quantificat...
Preprint
Full-text available
We present and make available MedLatin1 and MedLatin2, two datasets of medieval Latin texts to be used in research on computational authorship analysis. MedLatin1 and MedLatin2 consist of 294 and 30 curated texts, respectively, labelled by author, with MedLatin1 texts being of an epistolary nature and MedLatin2 texts consisting of literary comments...
Article
Almost all of the important literature on Information Retrieval (IR) is published in subscription-based journals and digital libraries. We argue that the lack of open access publishing in IR is seriously hampering progress and inclusiveness of the field. We propose that the IR community starts working on a road map for transitioning the IR literatu...
Article
We report on the organization and activities of the 2nd ACM SIGIR/SIGKDD Africa School on Machine Learning for Data Mining and Search, which took place at the University of Cape Town in South Africa January 27--31, 2020.
Article
Sentiment Quantification is the task of estimating the relative frequency of sentiment-related classes—such as ${\sf Positive}$Positive and ${\sf Negative}$Negative—in a set of unlabeled documents. It is an important topic in sentiment analysis, as the study of sentiment-related quantities and trends across a population is often of higher interest...
Preprint
Full-text available
This paper discusses the fourth year of the ``Sentiment Analysis in Twitter Task''. SemEval-2016 Task 4 comprises five subtasks, three of which represent a significant departure from previous editions. The first two subtasks are reruns from prior years and ask to predict the overall sentiment, and the sentiment towards a topic in a tweet. The three...
Preprint
Pre-trained word embeddings encode general word semantics and lexical regularities of natural language, and have proven useful across many NLP tasks, including word sense disambiguation, machine translation, and sentiment analysis, to name a few. In supervised tasks such as multiclass text classification (the focus of this article) it seems appeali...
Article
Full-text available
Quantification is the task of estimating, given a set \(\sigma \) of unlabelled items and a set of classes \({\mathcal {C}}=\{c_{1}, \ldots , c_{|{\mathcal {C}}|}\}\), the prevalence (or “relative frequency”) in \(\sigma \) of each class \(c_{i}\in {\mathcal {C}}\). While quantification may in principle be solved by classifying each item in \(\sigm...
Chapter
Quantification is the task of estimating, given a set \(\sigma \) of unlabelled items and a set of classes \(\mathcal {C}\), the relative frequency (or “prevalence”) \(p(c_{i})\) of each class \(c_{i}\in \mathcal {C}\). Quantification is important in many disciplines (such as e.g., market research, political science, the social sciences, and epidem...
Chapter
The Epistle to Cangrande is one of the most controversial among the works of Italian poet Dante Alighieri. For more than a hundred years now, scholars have been debating over its real paternity, i.e., whether it should be considered a true work by Dante or a forgery by an unnamed author. In this work we address this philological problem through the...
Conference Paper
Quantification (also known as "supervised prevalence estimation" [2], or "class prior estimation" [7]) is the task of estimating, given a set σ of unlabelled items and a set of classes C = c1, . . . , c |C| , the relative frequency (or "prevalence") p(ci ) of each class ci C, i.e., the fraction of items in σ that belong to ci . When each item belon...
Conference Paper
In recent years, the proliferation of smart mobile devices has lead to the gradual integration of search functionality within mobile platforms. This has created an incentive to move away from the "ten blue links" metaphor, as mobile users are less likely to click on them, expecting to get the answer directly from the snippets. In turn, this has rev...
Article
Full-text available
Cross-lingual Text Classification (CLC) consists of automatically classifying, according to a common set C of classes, documents each written in one of a set of languages L, and doing so more accurately than when “naïvely” classifying each document via its corresponding language-specific classifier. To obtain an increase in the classification accur...
Preprint
Full-text available
In recent years, the proliferation of smart mobile devices has lead to the gradual integration of search functionality within mobile platforms. This has created an incentive to move away from the "ten blue links'' metaphor, as mobile users are less likely to click on them, expecting to get the answer directly from the snippets. In turn, this has re...
Preprint
We discuss \emph{Cross-Lingual Text Quantification} (CLTQ), the task of performing text quantification (i.e., estimating the relative frequency $p_{c}(D)$ of all classes $c\in\mathcal{C}$ in a set $D$ of unlabelled documents) when training documents are available for a source language $\mathcal{S}$ but not for the target language $\mathcal{T}$ for...
Preprint
In information retrieval (IR) and related tasks, term weighting approaches typically consider the frequency of the term in the document and in the collection in order to compute a score reflecting the importance of the term for the document. In tasks characterized by the presence of training data (such as text classification) it seems logical that...
Preprint
Software systems trained via machine learning to automatically classify open-ended answers (a.k.a. verbatims) are by now a reality. Still, their adoption in the survey coding industry has been less widespread than it might have been. Among the factors that have hindered a more massive takeup of this technology are the effort involved in manually co...
Preprint
Polylingual Text Classification (PLC) consists of automatically classifying, according to a common set C of classes, documents each written in one of a set of languages L, and doing so more accurately than when naively classifying each document via its corresponding language-specific classifier. In order to obtain an increase in the classification...
Article
Software systems trained via machine learning to automatically classify open-ended answers (a.k.a. verbatims) are by now a reality. Still, their adoption in the survey coding industry has been less widespread than it might have been. Among the factors that have hindered a more massive takeup of this technology are the effort involved in manually co...
Article
In information retrieval (IR) and related tasks, term weighting approaches typically consider the frequency of the term in the document and in the collection in order to compute a score reflecting the importance of the term for the document. In tasks characterized by the presence of training data (such as text classification) it seems logical to de...
Preprint
This paper introduces PyDCI, a new implementation of Distributional Correspondence Indexing (DCI) written in Python. DCI is a transfer learning method for cross-domain and cross-lingual text classification for which we had provided an implementation (here called JaDCI) built on top of JaTeCS, a Java framework for text classification. PyDCI is a sta...
Preprint
Quantification is the task of estimating, given a set $\sigma$ of unlabelled items and a set of classes $\mathcal{C}=\{c_{1}, \ldots, c_{|\mathcal{C}|}\}$, the prevalence (or `relative frequency') in $\sigma$ of each class $c_{i}\in \mathcal{C}$. While quantification may in principle be solved by classifying each item in $\sigma$ and counting how m...
Preprint
Full-text available
Quantification is a supervised learning task that consists in predicting, given a set of classes C and a set D of unlabelled items, the prevalence (or relative frequency) p(c|D) of each class c in C. Quantification can in principle be solved by classifying all the unlabelled items and counting how many of them have been attributed to each class. Ho...
Article
Full-text available
We present a class of algorithms capable of directly training deep neural networks with respect to large families of task-specific performance measures such as the F-measure and the Kullback-Leibler divergence that are structured and non-decomposable. This presents a departure from standard deep learning techniques that typically use squared or cro...
Conference Paper
Full-text available
Polylingual Text Classification (PLC) is a supervised learning task that consists of assigning class labels to documents written in different languages, assuming that a representative set of training documents is available for each language. This scenario is more and more frequent, given the large quantity of multilingual platforms and communities...
Conference Paper
Domain Adaptation (DA) techniques aim at enabling machine learning methods learn effective classifiers for a “target” domain when the only available training data belongs to a different “source” domain. In this extended abstract, we briefly describe our new DA method called Distributional Correspondence Indexing (DCI) for sentiment classification....
Preprint
We present a class of algorithms capable of directly training deep neural networks with respect to large families of task-specific performance measures such as the F-measure and the Kullback-Leibler divergence that are structured and non-decomposable. This presents a departure from standard deep learning techniques that typically use squared or cro...
Chapter
During the last 35 years, data management principles such as physical and logical independence, declarative querying and cost-based optimization have led to profound pervasiveness of relational databases in any kind of organization. More importantly, these technical advances have enabled the first round of business intelligence applications and...
Article
Full-text available
Social media platforms provide continuous access to user generated content that enables real-time monitoring of user behavior and of events. The geographical dimension of such user behavior and events has recently caught a lot of attention in several domains: mobility, humanitarian, or infrastructural. While resolving the location of a user can be...
Preprint
Social media platforms provide continuous access to user generated content that enables real-time monitoring of user behavior and of events. The geographical dimension of such user behavior and events has recently caught a lot of attention in several domains: mobility, humanitarian, or infrastructural. While resolving the location of a user can be...
Article
Full-text available
Multilingual Text Classification (MLTC) is a text classification task in which documents are written each in one among a set L of natural languages, and in which all documents must be classified under the same classification scheme, irrespective of language. There are two main variants of MLTC, namely Cross-Lingual Text Classification (CLTC) and Po...