Sara Tonelli

Sara Tonelli
Fondazione Bruno Kessler | FBK · Digital Humanities

About

149
Publications
13,626
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,293
Citations

Publications

Publications (149)
Conference Paper
Full-text available
In this work we present an analysis of abusive language annotations collected through a 3D video game. With this approach, we are able to involve in the annotation teenagers, i.e. typical targets of cyberbullying, whose data are usually not available for research purposes. Using the game in the framework of educational activities to empower teenage...
Conference Paper
Full-text available
Corpus-based studies on acceptability judgements have always stimulated the interest of researchers, both in theoretical and computational fields. Some approaches focused on spontaneous judgements collected through different types of tasks, others on data annotated through crowd-sourcing platforms, still others relied on expert annotated data avail...
Article
Metadata allows access to a wide variety of cultural heritage resources made available through repositories, digital libraries, and catalogues. Usually taking the form of a structured set of descriptive elements, metadata assist in the identification, location, processing, tracking, preserving, sharing, and retrieval of information, while facilitat...
Article
Full-text available
Gamification has been recently growing in popularity among researchers investigating Information and Communication Technologies. Scholars have been trying to take advantage of this approach in the field of natural language processing (NLP), developing Games With A Purpose (GWAPs) for corpus annotation that have obtained encouraging results both in...
Preprint
Since state-of-the-art approaches to offensive language detection rely on supervised learning, it is crucial to quickly adapt them to the continuously evolving scenario of social media. While several approaches have been proposed to tackle the problem from an algorithmic perspective, so to reduce the need for annotated data, less attention has been...
Preprint
Full-text available
The development of automated approaches to linguistic acceptability has been greatly fostered by the availability of the English CoLA corpus, which has also been included in the widely used GLUE benchmark. However, this kind of research for languages other than English, as well as the analysis of cross-lingual approaches, has been hindered by the l...
Article
Digital games have been used in the context of a cultural experience for several reasons, from learning to socialising and having fun. As a positive side effect, using digital games in a GLAM environment contributes to increasing the visitors’ engagement and making the collections more popular. Along this line, we present in this article an online...
Article
Full-text available
Metadata are fundamental for the indexing, browsing and retrieval of cultural heritage resources in repositories, digital libraries and catalogues. In order to be effectively exploited, metadata information has to meet some quality standards, typically defined in the collection usage guidelines. As manually checking the quality of metadata in a rep...
Conference Paper
Full-text available
In this paper we discuss several challenges related to the development of a 3D game, whose goal is to raise awareness on cyberbullying while collecting linguistic annotation on offensive language. The game is meant to be used by teenagers, thus raising a number of issues that need to be tackled during development. For example, the game aesthetics s...
Preprint
Full-text available
The datasets most widely used for abusive language detection contain lists of messages, usually tweets, that have been manually judged as abusive or not by one or more annotators, with the annotation performed at message level. In this paper, we investigate what happens when the hateful content of a message is judged also based on the context, give...
Article
Full-text available
Massive open online courses (MOOCs) provide hundreds of students with teaching materials, assessment tools, and collaborative instruments. The assessment activity, in particular, is demanding in terms of both time and effort; thus, the use of artificial intelligence can be useful to address and reduce the time and effort required. This paper report...
Conference Paper
Full-text available
In this paper we introduce the DaDoEval shared task at EVALITA 2020, aimed at automatically assigning temporal information to documents written in Italian. The evaluation exercise comprises three levels of temporal granularity, from coarse-grained to year-based, and includes two types of test sets, either having the same genre of the training set,...
Article
Grazie all’utilizzo dei metadati è possibile accedere ad un vasto numero di risorse rese disponibili attraverso archivi e biblioteche digitali. Normalmente i metadati sono strutturati secondo uno schema standardizzato e garantiscono l’interoperabilità e l’identificazione di un oggetto digitale facilitando l’accesso a determinati tipi di risorse. Tu...
Conference Paper
Full-text available
Gamification has been applied to many linguistic annotation tasks, as an alternative to crowdsourcing platforms to collect annotated data in an inexpensive way. However, we think that still much has to be explored. Games with a Purpose (GWAPs) tend to lack important elements that we commonly see in commercial games, such as 2D and 3D worlds or a st...
Preprint
In order to study online hate speech, the availability of datasets containing the linguistic phenomena of interest are of crucial importance. However, when it comes to specific target groups, for example teenagers, collecting such data may be problematic due to issues with consent and privacy restrictions. Furthermore, while text-only datasets of t...
Chapter
Full-text available
Per riconoscere i tratti linguistici di interesse su un corpus composto da quasi tremila temi e per annotarli in modo coerente si è reso necessario lo sviluppo di diversi strumenti informatici. Tali software appartengono a due tipologie: da un lato, si sono sviluppati alcuni moduli per l'analisi del testo, che in modo automatico riconoscono dei tra...
Article
The increasing popularity of social media platforms such as Twitter and Facebook has led to a rise in the presence of hate and aggressive speech on these platforms. Despite the number of approaches recently proposed in the Natural Language Processing research area for detecting these forms of abusive language, the issue of identifying hate speech a...
Poster
Full-text available
We present an ongoing project aimed at creating the National Edition of Alcide De Gasperi’s letters in digital format. Our main goal is to systematically collect and transcribe a large number of private and public letters, present in different archives, written or received by De Gasperi throughout his life, and to shed light into all the critical s...
Conference Paper
Full-text available
We present an ongoing project aimed at creating the National Edition of Alcide De Gasperi's letters in digital format. Our main goal is to systematically collect and transcribe a large number of private and public letters, present in different archives, written or received by De Gasperi throughout his life, and to shed light into all the critical s...
Chapter
We describe in this paper the system submitted by the DH-FBK team to the HaSpeeDe evaluation task, and dealing with Italian hate speech detection (Task A). While we adopt a standard approach for fine-tuning AlBERTo, the Italian BERT model trained on tweets, we propose to improve the final classification performance by two additional steps, i.e. sel...
Chapter
While text-only datasets are widely produced and used for research purposes, limitations set by image-based social media platforms like Instagram make it difficult for researchers to experiment with multimodal data. We therefore developed CREENDER, an annotation tool to create multimodal datasets with images associated with semantic tags and commen...
Chapter
In this paper, we present a novel dataset composed of images and comments in Italian, created with teenagers in classes using a simulated scenario to raise awareness on cyberbullying phenomena. Potentially offensive comments have been collected for more than 1,000 images and manually assigned to a semantic category. Our analysis shows that the pres...
Preprint
Full-text available
The steady growth of digitized historical information is continuously stimulating new different approaches to the fields of Digital Humanities and Computational Social Science. In this work, we use Natural Language Processing techniques to retrieve large amounts of historical information from Wikipedia. In particular, the pages of a set of historic...
Article
Full-text available
The steady growth of digitized historical information is continuously stimulating new different approaches to the fields of Digital Humanities and Computational Social Science. In this work we use Natural Language Processing techniques to retrieve large amounts of historical information from Wikipedia. In particular, the pages of a set of historica...
Article
Full-text available
Almost eight years after his untimely death, the scientific contribution of Emanuele Pianta still appears significant to us, in particular for the variety of the topics he dealt with and for his capacity to move cross-disciplinarily between different areas of computational linguistics. Today, retracing the steps of Emanuele’s scientific carrier has...
Chapter
Students learning Health Informatics in the degree course of Medicine and Surgery of the University of L’Aquila (Italy) are required – to pass the exam – to submit solutions to assignments concerning the execution and interpretation of statistical analyses. The paper presents a tool for the automated grading of such a kind of solutions, where the s...
Poster
Full-text available
In this paper we present a multigenre corpus spanning 50 years of European history. It contains a comprehensive collection of Alcide De Gasperi’s public documents, 2,762 in total, written or transcribed between 1901 and 1954. The corpus comprises different types of texts, including newspaper articles, propaganda documents, official letters and parl...
Presentation
Full-text available
Research communication at CLiC-it 2019, presenting the following paper: Sprugnoli, R., & Tonelli, S. (2019). Novel Event Detection and Classification for Historical Texts. Computational Linguistics, 45(2), 229-265.
Conference Paper
Full-text available
In this paper we present a multi-genre corpus spanning 50 years of European history. It contains a comprehensive collection of Alcide De Gasperi's public documents, 2,762 in total, written or transcribed between 1901 and 1954. The corpus comprises different types of texts, including newspaper articles, propaganda documents, official letters and par...
Article
Full-text available
Event processing is an active area of research in the Natural Language Processing community, but resources and automatic systems developed so far have mainly addressed contemporary texts. However, the recognition and elaboration of events is a crucial step when dealing with historical texts Particularly in the current era of massive digitization of...
Article
Full-text available
In this work, we present LOD Navigator, a data visualisation and exploration tool to track the lives and trajectories of Italian Shoah Victims. We take advantage of the work done at the Contemporary Jewish Documentation Center in Milan (CDEC), leading to the publication of a database of Linked Open Data (LOD) containing information about the life a...
Presentation
Full-text available
In this proposal we describe the results of a project aiming at tracing the movements of Trentino people that were deported to the 3rd Reich camps during World War II. More specifically, we performed the semantic annotation, georeferencing and visualization of data collected by expert historians. This work wants to shed light on the stories of peop...
Conference Paper
Full-text available
We present a project aimed at studying the evolution of students' writing skills in a temporal span of 15 years (from 2001 to 2016), analysing in particular the impact of neo-standard Italian. More than 2,500 essays have been transcribed and annotated by teachers according to 28 different linguistic traits. We present here the annotation process to...
Conference Paper
Full-text available
This paper reports on the systems the InriaFBK Team submitted to the EVALITA 2018-Shared Task on Hate Speech Detection in Italian Twitter and Facebook posts (HaSpeeDe). Our submissions were based on three separate classes of models: a model using a recurrent layer, an ngram-based neural network and a LinearSVC. For the Facebook task and the two cro...
Conference Paper
Full-text available
Although WhatsApp is used by teenagers as one major channel of cyberbullying, such interactions remain invisible due to the app privacy policies that do not allow ex-post data collection. Indeed, most of the information on these phenomena rely on surveys regarding self-reported data. In order to overcome this limitation, we describe in this paper t...
Conference Paper
Full-text available
In this paper, we describe two systems for predicting message-level offensive language in German tweets: one discriminates between offensive and not offensive messages, and the second performs a fine-grained classification by recognizing also classes of offense. Both systems are based on the same approach, which builds upon Recurrent Neural Network...
Article
In this work, we apply argumentation mining techniques, in particular relation prediction, to study political speeches in monological form, where there is no direct interaction between opponents. We argue that this kind of technique can effectively support researchers in history, social and political sciences, which must deal with an increasing amo...
Poster
Full-text available
The digitization of epistolaries is extremely important for the preservation and study of the cultural and historical patrimony of literary correspondence. In recent years, several small and large-scale projects have been carried out. Many of these initiatives are based on collaborative work adopting a crowdsourcing approach and using web-based tra...
Conference Paper
Full-text available
We present an overview and the results of a shared-task hackathon that took place as part of a research seminar bringing together a variety of experts and young researchers from the fields of political science, natural language processing and computational social science. The task looked at ways to develop novel methods for political text scaling t...
Conference Paper
Full-text available
Code-mixing is the alternation between two or more languages in the same text. This phenomenon is very relevant in the travel domain, since it can provide new insight in the way foreign cultures are perceived and described to the readers. In this paper, we analyse English-Italian code-mixing in historical English travel writings about Italy. We ret...
Article
Full-text available
In this work, we describe a methodology to interpret large persons’ networks extracted from text by classifying cliques using the DBpedia ontology. The approach relies on a combination of NLP, Semantic web technologies, and network analysis. The classification methodology that first starts from single nodes and then generalizes to cliques is effect...
Article
The increasing demand of technological facilities for galleries, museums, and archives has led to the need for designing practical and effective solutions for managing the digital life cycle of cultural heritage collections. These facilities have to support users in addressing several challenges directly related to the creation, management, preserv...
Article
This work focuses on the extraction and integration of automatically aligned bilingual terminology into a Statistical Machine Translation (SMT) system in a Computer Aided Translation scenario. We evaluate the proposed framework that, taking as input a small set of parallel documents, gathers domain-specific bilingual terms and injects them into an...
Poster
Full-text available
We present RAMBLE ON, an application integrating a pipeline for frame-based information extraction and an interface to track and display movement trajectories. The code of the extraction pipeline and a navigator are freely available; moreover we display in a demonstrator the outcome of a case study carried out on trajectories of notable persons of...
Poster
Full-text available
This paper presents a new resource, called Content Types Dataset, to promote the analysis of texts as a composition of units with specific semantic and functional roles. By developing this dataset we also introduce a new NLP task for the automatic classification of Content Types. The annotation scheme and the dataset, available online, are describe...
Article
Full-text available
In 2013 a collaboration was started between the Digital Humanities research unit and the Italian-German Historical institute at Fondazione Bruno Kessler, whose goal was to develop tools and strategies to give new insight into the public documents written by Alcide De Gasperi. Through the analysis of textual occurrences, semantic structures, and tem...
Conference Paper
Full-text available
This paper presents L-KD, a tool that relies on available linguistic and knowledge resources to perform keyphrase clustering and labelling. The aim of L-KD is to help finding and tracing themes in English and Italian text data, represented by groups of keyphrases and associated domains. We perform an evaluation of the top-ranked domains using the 2...
Article
We present an overview of event definition and processing spanning 25 years of research in NLP. We first provide linguistic background to the notion of event, and then present past attempts to formalize this concept in annotation standards to foster the development of benchmarks for event extraction systems. This ranges from MUC-3 in 1991 to the Ti...
Article
The application of research practices and methodologies from the Information and Communication Technologies to Humanities studies is having a great impact on the way humanities research is being conducted. However, although many applications have been developed to automatically analyse document collections from the historical or the literary domain...