Sara Tonelli

Sara Tonelli
Fondazione Bruno Kessler | FBK · Digital Humanities

About

163
Publications
20,521
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,016
Citations

Publications

Publications (163)
Preprint
Full-text available
In the current era of social media and generative AI, an ability to automatically assess the credibility of online social media content is of tremendous importance. Credibility assessment is fundamentally based on aggregating credibility signals, which refer to small units of information, such as content factuality, bias, or a presence of persuasio...
Preprint
Hate speech is one of the main threats posed by the widespread use of social networks, despite efforts to limit it. Although attention has been devoted to this issue, the lack of datasets and case studies centered around scarcely represented phenomena, such as ableism or ageism, can lead to hate speech detection systems that do not perform well on...
Article
In this work, we explore the use of digital technologies and statistical analysis to monitor how Italian secondary school students’ writing changes over time and how comparisons can be made across different high school types. We analyzed more than 2,000 exam essays written by Italian high school students over 13 years and in five different school t...
Conference Paper
Full-text available
The Hate Speech Detection (HaSpeeDe3) task is the third edition of a shared task on the detection of hateful content in Italian tweets. It differs from the previous editions while maintaining continuity in analysing and contrasting hate speech (HS) on social media. While HaSpeeDe and HaSpeeDe2 were focused on HS against immigrants, Muslims and Roms...
Article
Full-text available
In this paper we present a novel treebank developed to analyse marked constructions in Italian called MarkIT. The resource contains almost 1,300 sentences manually annotated with dependency relations following the Universal Dependencies paradigm. The sentences have been extracted from essays written by high-school students along several years, whic...
Article
Recent studies have highlighted that private instant messaging platforms and channels are major media of cyber aggression, especially among teens. Due to the private nature of the verbal exchanges on these media, few studies have addressed the task of hate speech detection in this context. Moreover, the recent release of resources mimicking online...
Article
Abusive language in online social media is a pervasive and harmful phenomenon which calls for automatic computational approaches to be successfully contained. Previous studies have introduced corpora and natural language processing approaches for specific kinds of online abuse, mainly focusing on misogyny and racism. A current underexplored area in...
Conference Paper
Full-text available
In this work we present an analysis of abusive language annotations collected through a 3D video game. With this approach, we are able to involve in the annotation teenagers, i.e. typical targets of cyberbullying, whose data are usually not available for research purposes. Using the game in the framework of educational activities to empower teenage...
Conference Paper
Full-text available
Corpus-based studies on acceptability judgements have always stimulated the interest of researchers, both in theoretical and computational fields. Some approaches focused on spontaneous judgements collected through different types of tasks, others on data annotated through crowd-sourcing platforms, still others relied on expert annotated data avail...
Article
Metadata allows access to a wide variety of cultural heritage resources made available through repositories, digital libraries, and catalogues. Usually taking the form of a structured set of descriptive elements, metadata assist in the identification, location, processing, tracking, preserving, sharing, and retrieval of information, while facilitat...
Article
Full-text available
Gamification has been recently growing in popularity among researchers investigating Information and Communication Technologies. Scholars have been trying to take advantage of this approach in the field of natural language processing (NLP), developing Games With A Purpose (GWAPs) for corpus annotation that have obtained encouraging results both in...
Preprint
Since state-of-the-art approaches to offensive language detection rely on supervised learning, it is crucial to quickly adapt them to the continuously evolving scenario of social media. While several approaches have been proposed to tackle the problem from an algorithmic perspective, so to reduce the need for annotated data, less attention has been...
Preprint
Full-text available
The development of automated approaches to linguistic acceptability has been greatly fostered by the availability of the English CoLA corpus, which has also been included in the widely used GLUE benchmark. However, this kind of research for languages other than English, as well as the analysis of cross-lingual approaches, has been hindered by the l...
Article
Digital games have been used in the context of a cultural experience for several reasons, from learning to socialising and having fun. As a positive side effect, using digital games in a GLAM environment contributes to increasing the visitors’ engagement and making the collections more popular. Along this line, we present in this article an online...
Article
Full-text available
Metadata are fundamental for the indexing, browsing and retrieval of cultural heritage resources in repositories, digital libraries and catalogues. In order to be effectively exploited, metadata information has to meet some quality standards, typically defined in the collection usage guidelines. As manually checking the quality of metadata in a rep...
Conference Paper
Full-text available
In this paper we discuss several challenges related to the development of a 3D game, whose goal is to raise awareness on cyberbullying while collecting linguistic annotation on offensive language. The game is meant to be used by teenagers, thus raising a number of issues that need to be tackled during development. For example, the game aesthetics s...
Preprint
Full-text available
The datasets most widely used for abusive language detection contain lists of messages, usually tweets, that have been manually judged as abusive or not by one or more annotators, with the annotation performed at message level. In this paper, we investigate what happens when the hateful content of a message is judged also based on the context, give...
Article
Full-text available
Massive open online courses (MOOCs) provide hundreds of students with teaching materials, assessment tools, and collaborative instruments. The assessment activity, in particular, is demanding in terms of both time and effort; thus, the use of artificial intelligence can be useful to address and reduce the time and effort required. This paper report...
Conference Paper
Full-text available
In this paper we introduce the DaDoEval shared task at EVALITA 2020, aimed at automatically assigning temporal information to documents written in Italian. The evaluation exercise comprises three levels of temporal granularity, from coarse-grained to year-based, and includes two types of test sets, either having the same genre of the training set,...
Article
Grazie all’utilizzo dei metadati è possibile accedere ad un vasto numero di risorse rese disponibili attraverso archivi e biblioteche digitali. Normalmente i metadati sono strutturati secondo uno schema standardizzato e garantiscono l’interoperabilità e l’identificazione di un oggetto digitale facilitando l’accesso a determinati tipi di risorse. Tu...
Conference Paper
Full-text available
Gamification has been applied to many linguistic annotation tasks, as an alternative to crowdsourcing platforms to collect annotated data in an inexpensive way. However, we think that still much has to be explored. Games with a Purpose (GWAPs) tend to lack important elements that we commonly see in commercial games, such as 2D and 3D worlds or a st...
Preprint
In order to study online hate speech, the availability of datasets containing the linguistic phenomena of interest are of crucial importance. However, when it comes to specific target groups, for example teenagers, collecting such data may be problematic due to issues with consent and privacy restrictions. Furthermore, while text-only datasets of t...
Chapter
Full-text available
Per riconoscere i tratti linguistici di interesse su un corpus composto da quasi tremila temi e per annotarli in modo coerente si è reso necessario lo sviluppo di diversi strumenti informatici. Tali software appartengono a due tipologie: da un lato, si sono sviluppati alcuni moduli per l'analisi del testo, che in modo automatico riconoscono dei tra...
Article
The increasing popularity of social media platforms such as Twitter and Facebook has led to a rise in the presence of hate and aggressive speech on these platforms. Despite the number of approaches recently proposed in the Natural Language Processing research area for detecting these forms of abusive language, the issue of identifying hate speech a...
Poster
Full-text available
We present an ongoing project aimed at creating the National Edition of Alcide De Gasperi’s letters in digital format. Our main goal is to systematically collect and transcribe a large number of private and public letters, present in different archives, written or received by De Gasperi throughout his life, and to shed light into all the critical s...
Conference Paper
Full-text available
We present an ongoing project aimed at creating the National Edition of Alcide De Gasperi's letters in digital format. Our main goal is to systematically collect and transcribe a large number of private and public letters, present in different archives, written or received by De Gasperi throughout his life, and to shed light into all the critical s...
Chapter
We describe in this paper the system submitted by the DH-FBK team to the HaSpeeDe evaluation task, and dealing with Italian hate speech detection (Task A). While we adopt a standard approach for fine-tuning AlBERTo, the Italian BERT model trained on tweets, we propose to improve the final classification performance by two additional steps, i.e. sel...
Chapter
While text-only datasets are widely produced and used for research purposes, limitations set by image-based social media platforms like Instagram make it difficult for researchers to experiment with multimodal data. We therefore developed CREENDER, an annotation tool to create multimodal datasets with images associated with semantic tags and commen...
Chapter
In this paper, we present a novel dataset composed of images and comments in Italian, created with teenagers in classes using a simulated scenario to raise awareness on cyberbullying phenomena. Potentially offensive comments have been collected for more than 1,000 images and manually assigned to a semantic category. Our analysis shows that the pres...
Preprint
Full-text available
The steady growth of digitized historical information is continuously stimulating new different approaches to the fields of Digital Humanities and Computational Social Science. In this work, we use Natural Language Processing techniques to retrieve large amounts of historical information from Wikipedia. In particular, the pages of a set of historic...
Article
Full-text available
The steady growth of digitized historical information is continuously stimulating new different approaches to the fields of Digital Humanities and Computational Social Science. In this work we use Natural Language Processing techniques to retrieve large amounts of historical information from Wikipedia. In particular, the pages of a set of historica...
Article
Full-text available
Almost eight years after his untimely death, the scientific contribution of Emanuele Pianta still appears significant to us, in particular for the variety of the topics he dealt with and for his capacity to move cross-disciplinarily between different areas of computational linguistics. Today, retracing the steps of Emanuele’s scientific carrier has...
Chapter
Students learning Health Informatics in the degree course of Medicine and Surgery of the University of L’Aquila (Italy) are required – to pass the exam – to submit solutions to assignments concerning the execution and interpretation of statistical analyses. The paper presents a tool for the automated grading of such a kind of solutions, where the s...
Poster
Full-text available
In this paper we present a multigenre corpus spanning 50 years of European history. It contains a comprehensive collection of Alcide De Gasperi’s public documents, 2,762 in total, written or transcribed between 1901 and 1954. The corpus comprises different types of texts, including newspaper articles, propaganda documents, official letters and parl...
Presentation
Full-text available
Research communication at CLiC-it 2019, presenting the following paper: Sprugnoli, R., & Tonelli, S. (2019). Novel Event Detection and Classification for Historical Texts. Computational Linguistics, 45(2), 229-265.
Conference Paper
Full-text available
In this paper we present a multi-genre corpus spanning 50 years of European history. It contains a comprehensive collection of Alcide De Gasperi's public documents, 2,762 in total, written or transcribed between 1901 and 1954. The corpus comprises different types of texts, including newspaper articles, propaganda documents, official letters and par...
Article
Full-text available
Event processing is an active area of research in the Natural Language Processing community, but resources and automatic systems developed so far have mainly addressed contemporary texts. However, the recognition and elaboration of events is a crucial step when dealing with historical texts Particularly in the current era of massive digitization of...
Article
Full-text available
In this work, we present LOD Navigator, a data visualisation and exploration tool to track the lives and trajectories of Italian Shoah Victims. We take advantage of the work done at the Contemporary Jewish Documentation Center in Milan (CDEC), leading to the publication of a database of Linked Open Data (LOD) containing information about the life a...
Presentation
Full-text available
In this proposal we describe the results of a project aiming at tracing the movements of Trentino people that were deported to the 3rd Reich camps during World War II. More specifically, we performed the semantic annotation, georeferencing and visualization of data collected by expert historians. This work wants to shed light on the stories of peop...
Conference Paper
Full-text available
We present a project aimed at studying the evolution of students' writing skills in a temporal span of 15 years (from 2001 to 2016), analysing in particular the impact of neo-standard Italian. More than 2,500 essays have been transcribed and annotated by teachers according to 28 different linguistic traits. We present here the annotation process to...
Conference Paper
Full-text available
This paper reports on the systems the InriaFBK Team submitted to the EVALITA 2018-Shared Task on Hate Speech Detection in Italian Twitter and Facebook posts (HaSpeeDe). Our submissions were based on three separate classes of models: a model using a recurrent layer, an ngram-based neural network and a LinearSVC. For the Facebook task and the two cro...
Conference Paper
Full-text available
Although WhatsApp is used by teenagers as one major channel of cyberbullying, such interactions remain invisible due to the app privacy policies that do not allow ex-post data collection. Indeed, most of the information on these phenomena rely on surveys regarding self-reported data. In order to overcome this limitation, we describe in this paper t...
Conference Paper
Full-text available
In this paper, we describe two systems for predicting message-level offensive language in German tweets: one discriminates between offensive and not offensive messages, and the second performs a fine-grained classification by recognizing also classes of offense. Both systems are based on the same approach, which builds upon Recurrent Neural Network...
Article
In this work, we apply argumentation mining techniques, in particular relation prediction, to study political speeches in monological form, where there is no direct interaction between opponents. We argue that this kind of technique can effectively support researchers in history, social and political sciences, which must deal with an increasing amo...
Poster
Full-text available
The digitization of epistolaries is extremely important for the preservation and study of the cultural and historical patrimony of literary correspondence. In recent years, several small and large-scale projects have been carried out. Many of these initiatives are based on collaborative work adopting a crowdsourcing approach and using web-based tra...
Conference Paper
Full-text available
We present an overview and the results of a shared-task hackathon that took place as part of a research seminar bringing together a variety of experts and young researchers from the fields of political science, natural language processing and computational social science. The task looked at ways to develop novel methods for political text scaling t...
Chapter
On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-­‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-­‐it conference seri...
Chapter
We present an approach to improve the selection of complex words for automatic text simplification, addressing the need of L2 learners to take into account their native language during simplification. In particular, we develop a methodology that automatically identifies ‘difficult’ terms (i.e. false friends) for L2 learners in order to simplify the...
Chapter
Full-text available
EVALITA is a periodic evaluation campaign of Natural Language Processing (NLP) and speech tools for the Italian language. The general objective of EVALITA is to promote the development of language and speech technologies for the Italian language, providing a shared framework where different systems and approaches can be evaluated in a consistent ma...
Conference Paper
Full-text available
Code-mixing is the alternation between two or more languages in the same text. This phenomenon is very relevant in the travel domain, since it can provide new insight in the way foreign cultures are perceived and described to the readers. In this paper, we analyse English-Italian code-mixing in historical English travel writings about Italy. We ret...
Article
Full-text available
In this work, we describe a methodology to interpret large persons’ networks extracted from text by classifying cliques using the DBpedia ontology. The approach relies on a combination of NLP, Semantic web technologies, and network analysis. The classification methodology that first starts from single nodes and then generalizes to cliques is effect...
Article
The increasing demand of technological facilities for galleries, museums, and archives has led to the need for designing practical and effective solutions for managing the digital life cycle of cultural heritage collections. These facilities have to support users in addressing several challenges directly related to the creation, management, preserv...
Article
This work focuses on the extraction and integration of automatically aligned bilingual terminology into a Statistical Machine Translation (SMT) system in a Computer Aided Translation scenario. We evaluate the proposed framework that, taking as input a small set of parallel documents, gathers domain-specific bilingual terms and injects them into an...
Poster
Full-text available
We present RAMBLE ON, an application integrating a pipeline for frame-based information extraction and an interface to track and display movement trajectories. The code of the extraction pipeline and a navigator are freely available; moreover we display in a demonstrator the outcome of a case study carried out on trajectories of notable persons of...
Poster
Full-text available
This paper presents a new resource, called Content Types Dataset, to promote the analysis of texts as a composition of units with specific semantic and functional roles. By developing this dataset we also introduce a new NLP task for the automatic classification of Content Types. The annotation scheme and the dataset, available online, are describe...
Chapter
Automated lexical simplification has been performed so far focusing only on the replacement of single tokens with single tokens, and this choice has affected both the development of systems and the creation of benchmarks. In this paper, we argue that lexical simplification in real settings should deal both with single and multi-token terms, and pre...
Chapter
Full-text available
Code-mixing is the alternation between two or more languages in the same text. This phenomenon is very relevant in the travel domain, since it can provide new insight in the way foreign cultures are perceived and described to the readers. In this paper, we analyse English-Italian code-mixing in historical English travel writings about Italy. We ret...
Article
Full-text available
In 2013 a collaboration was started between the Digital Humanities research unit and the Italian-German Historical institute at Fondazione Bruno Kessler, whose goal was to develop tools and strategies to give new insight into the public documents written by Alcide De Gasperi. Through the analysis of textual occurrences, semantic structures, and tem...
Conference Paper
Full-text available
This paper presents L-KD, a tool that relies on available linguistic and knowledge resources to perform keyphrase clustering and labelling. The aim of L-KD is to help finding and tracing themes in English and Italian text data, represented by groups of keyphrases and associated domains. We perform an evaluation of the top-ranked domains using the 2...