About
87
Publications
21,072
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
536
Citations
Introduction
Johanna Monti is Associate Professor of Modern Languages Teaching at the "L'Orientale"University of Naples. She was the Computational Linguistics Research manager of the Thamus Consortium (Italy). She received her PhD in Computational Linguistics at the University of Salerno, Italy. Her research activities are in the field of hybrid approaches to Machine Translation and NLP applications.
Additional affiliations
May 2016 - May 2020
Publications
Publications (87)
The investigation of phraseology through corpus-based and computational approaches holds significant relevance for various professionals, including translators, interpreters, terminologists, lexicographers, language instructors, and learners. Computational Phraseology, and in particular the computational analysis of multiword expressions (also know...
Terminology translation plays a significant role in domain-specific machine translation. However, some knowledge domains and languages still suffer from the lack of high-quality machine translation results due to the mistranslation of terminology. This is the case in the legal domain and the Arabic language. Most machine translation systems fail in...
The lack of annotated datasets affects the development of Natural Language Processing applications and heavily impacts the access to textual data, in particular for specific domains and specific languages. In this paper, we propose a methodology to annotate texts concerning domain-specific knowledge, to provide a reliable source of data for the tas...
Learning idiomatic expressions is seen as one of the most challenging stages in second-language learning because of their unpredictable meaning. A similar situation holds for their identification within natural language processing applications such as machine translation and parsing. The lack of high-quality usage samples exacerbates this challenge...
Languages differ in terms of the absence or presence of gender features, the number of gender classes and whether and where gender features are explicitly marked. These cross-linguistic differences can lead to ambiguities that are difficult to resolve, especially for sentence-level MT systems. The identification of ambiguity and its subsequent reso...
Inspired by the historical models of artificial and auxiliary languages, Emojitaliano is the result of a social and crowdsourcing experiment which was conducted by a group of seventeen translators, followers of the “Scritture brevi” blog, and led to the creation of an international language based on emojis. The experiment was carried out during 201...
This report presents an analysis of #hashtags used by Italian Cultural Heritage institutions to promote and communicate cultural content during the COVID-19 lock-down period in Italy. Several activities to support and engage users' have been proposed using social media. Most of these activities present one or more #hashtags which help to aggregate...
Computational Stylometry develops techniques that allow scholars to find out information about authors of texts by means of an automatic stylistic analysis. Indeed, each author’s style is unique, and no two authors are characterized by the same set of stylistic features. Several scholars focus on the analysis of different stylistic features and spe...
Learning idiomatic expressions is seen as one of the most challenging stages in second language learning because of their unpredictable meaning. A similar situation holds for their identification within natural language processing applications such as machine translation and parsing. The lack of high-quality usage samples exacerbates this challenge...
Terminological resources, invaluable tools for language experts, translators, learners, among others, are widely employed in many applicative scenarios from Machine Translation (MT) to Natural Language Processing (NLP). Automatic terminology extraction from unstructured texts represents a useful, yet non-trivial task, in order to create terminologi...
https://uniornlp.carto.com/builder/04f2cca9-08cd-4b9f-90cd-79fc0d93af42/embed
A first map with preliminary results on alerts concerning illegal fires posted on Twitter between 2013-2020. This map was built by our research group on a model that is able to discriminate between alert and no-alert tweets on the basis of an annotated subsection of UNI...
This report presents an analysis of #hashtags used by Italian Cultural Heritage institutions to promote and communicate cultural content during the COVID-19 lock-down period in Italy. Several activities to support and engage users' have been proposed using social media. Most of these activities present one or more #hashtags which help to aggregate...
In this paper, we describe UniOr ExpSys team participation in TRAC-2 (Trolling, Aggression and Cyberbullying) shared task, a workshop organized as part of LREC 2020. TRAC-2 shared task is organized in two sub-tasks: Aggression Identification (a 3-way classification between "Overtly Aggressive", "Covertly Aggressive" and "Non-aggressive" text data)...
In this paper, we present a web service platform for disinformation detection in hotel reviews written in English. The platform relies on a hybrid approach of computational stylometry techniques, machine learning and linguistic rules written using COGITO, Expert System Corp.'s semantic intelligence software thanks to which it is possible to analyze...
In this paper, we describe a Telegram bot, Mago della Ghigliottina (Ghigliottina Wizard), able to solve La Ghigliottina game (The Guillotine), the final game of the Italian TV quiz show L'Eredità. Our system relies on linguistic resources and artificial intelligence and achieves better results than human players (and competitors of L'Eredità too)....
In this paper, we describe UniOr ExpSys team participation in TRAC-2 (Trolling, Aggression and Cyberbullying) shared task, a workshop organized as part of LREC 2020. TRAC-2 shared task is organized in two sub-tasks: Aggression Identification (a 3-way classification between "Overtly Aggressive", "Covertly Aggressive" and "Non-aggressive" text data)...
In this paper, we present a web service platform for disinformation detection in hotel reviews written in English. The platform relies on a hybrid approach of computational stylometry techniques, machine learning and linguistic rules written using COGITO , Expert System Corp.'s semantic intelligence software thanks to which it is possible to analyz...
In this paper, we describe a Telegram bot, Mago della Ghigliottina (Ghigliottina Wizard), able to solve La Ghigliottina game (The Guillotine), the final game of the Italian TV quiz show L'Eredità. Our system relies on linguistic resources and artificial intelligence and achieves better results than human players (and competitors of L'Eredità too)....
This paper describes Il mago della Ghigliottina, a bot which took part in the Ghigliottin-AI task of the Evalita 2020 evaluation campaign. The aim is to build a system able to solve the TV game “La Ghigliottina”. Our system has already participated in the Evalita 2018 task NLP4FUN. Compared to that occasion, it improved its accuracy from 61% to 68....
Evaluating Artificial Players for the Language Game “La Ghigliottina” (Ghigliottin-AI) task is one of the tasks organized in the context of the 2020 EVALITA edition, a periodic evaluation campaign of Natural Language Processing (NLP) and speech tools for the Italian language. Ghigliottin-AI participants are asked to build an artificial player able...
This paper presents the results of an evaluation of Google Translate, DeepL and Bing Microsoft Translator with reference to natural gender translation and provides statistics about the frequency of female, male and neutral forms in the translations of a list of personality adjectives, and nouns referring to professions and bigender nouns. The evalu...
“Spotted" posts represent one of the most popular forms of Computer-mediated Communication (CMC) among university students in Italy, and as such, they represent a privileged context to analyze the Italian language used by students on the Web. This kind of informal communication channels is active especially on Instagram, and provides relevant insig...
In this paper, we present the Archaeo-Term Project, along with one of its first efforts in enhancing multilingual access to Archaeological data, making available a resource of Archaeological terms within the framework of YourTerm CULT project. In order to enhance and promote the use of a terminological common ground across different languages the A...
This paper presents the results of research carried out on the UNIOR Eye corpus, a corpus which has been built by downloading tweets related to environmental crimes. The corpus is made up of 228,412 tweets organized into four different subsections, each one concerning a specific environmental crime. For the current study we focused on the subsectio...
In this paper, we show the results of a stylometric analysis conducted on Paul McCartney's interview transcriptions using three different approaches in order to detect differences and similarities in his speeches before and after 9th November 1966, the date of his supposed death. Our research is based on the Let IT Corpus, a corpus of Paul McCartne...
The paper describes the PARSEME-It corpus, developed within the PARSEME-It project which aims at the development of methods, tools and resources for multiword expressions (MWE) processing for the Italian language. The project is a spin-off of a larger multilingual project for more than 20 languages from several language families, namely the PARSEME...
The MUMTTT workshop will be held on the last day of the Europhras'2019 conference, namely on 27th September 2019. It will provide a forum for researchers and practitioners in the fields of (Computational) Linguistics, (Computational) Phraseology, Translation Studies and Translation Technology to discuss recent advances in the area of multi-word uni...
The aim of this paper is to show the importance of Computational Stylometry (CS) and Machine Learning (ML) support in author's gender and age detection in cyberbullying texts. We developed a cyberbullying detection platform and we show the results of performances in terms of Precision, Recall and F-Measure for gender and age detection in cyberbully...
Con questo contributo mostriamo come la stilometria computazionale può essere utile per individuare genere ed età dell'autore di un testo contenente cyberbullismo, grazie alla sinergia con l'intelligenza artificiale e ad un approccio basato su regole. Si riportano i risultati di una sperimentazione compiuta in occasione della manifestazione Futuro...
The aim of this paper is to show the importance of Computational Stylometry (CS) and Machine Learning (ML) support in author's gender and age detection in cyberbullying texts. We developed a cyberbullying detection platform and we show the results of performances in terms of Precision, Recall and F-Measure for gender and age detection in cyberbully...
The paper describes UNIOR4NLP a system developed to solve "La Ghigliottina" game which took part in the NLP4FUN task of the Evalita 2018 evaluation campaign. The system is the best performing one in the competition and achieves better results than human players.
On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-‐it conference seri...
ll contributo descrive il sistemaUNIOR4NLP, sviluppato per risolvere il gioco “La Ghigliottina”, che ha partecipato alla sfida NLP4FUN della campagna di valutazione Evalita 2018. Il sistema risulta il migliore della competizione e ha prestazioni più elevate rispetto agli umani.
Multiword expressions (MWEs) are known as a “pain in the neck” due to their idiosyncratic behaviour. While some categories of MWEs have been largely studied, verbal MWEs (VMWEs) such as to take a walk, to break one’s heart or to turn off have been relatively rarely modelled. We describe an initiative meant to bring about substantial progress in und...
This paper describes the PARSEME Shared Task 1.1 on automatic identification of verbal multi-word expressions. We present the annotation methodology, focusing on changes from last year's shared task. Novel aspects include enhanced annotation guidelines, additional annotated data for most languages, corpora for some new languages, and new evaluation...
On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-‐it conference seri...
On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-‐it conference seri...
On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-‐it conference seri...
EVALITA is a periodic evaluation campaign of Natural Language Processing (NLP) and speech tools for the Italian language. The general objective of EVALITA is to promote the development of language and speech technologies for the Italian language, providing a shared framework where different systems and approaches can be evaluated in a consistent ma...
This paper describes a new language resource annotated with verbal multiword expressions (VMWEs) in Italian. The paper discusses the state of the art in VMWE identification and annotation in Italian, the methodology adopted, the various VMWE categories annotated, the corpus and the annotation process. Finally, the paper ends with results, conclusio...
Topics of Interest The MUMTTT 2017 workshop invites the submission of papers reporting on original and unpublished research on topics related to MWU processing in machine translation and translation technology, including: Lexical, syntactic, semantic and translational aspects in MWU representation Theoretical approaches to MWUs (e.g., collostru...
Multiword expressions (MWEs) are a class of linguistic forms spanning conventional word boundaries that are both idiosyncratic and pervasive across different languages. The structure of linguistic processing that depends on the clear distinction between words and phrases has
to be re-thought to accommodate MWEs. The issue of MWE handling is crucial...
La collana pubblica gli atti del convegno annuale di Linguistica Computazionale (CLiC-it), che ha lo scopo di costituire un luogo di discussione di riferimento nel campo delle ricerce sulla linguistica computazionale. Gli atti includono interventi sul trattamento automatico della lingua, comprendenti le riflessioni teoriche e metodologiche sul tema...
This volume documents the proceedings of the 2nd Workshop on Multi-word Units in Machine Translation
and Translation Technology (MUMTTT 2015), held on 1-2 July 2015 as part of the EUROPHRAS 2015
conference: "Computerised and Corpus-based Approaches to Phraseology: Monolingual and Multilingual
Perspectives" (Málaga, 29 June – 1 July 2015). The works...
This paper summarizes the preliminary results of an ongoing survey on multiword resources carried out within the IC1207 Cost Action PARSEME (PARSing and Multi-word Expressions). Despite the availability of language resource catalogs and the inventory of multi-word datasets on the SIGLEX-MWE website, multiword resources are scattered and difficult t...
The annual conference CLIC–it (''Italian Conference on Computational Linguistics'') is an initiative of the ''Italian Association of Computational Linguistics'' (AILC – www.ai-lc.it) which is intended to meet the need for a national and international forum for the promotion and dissemination of high-level original research in the field of Computati...
The annual conference CLIC–it (''Italian Conference on Computational Linguistics'') is an initiative of the ''Italian Association of Computational Linguistics'' (AILC – www.ai-lc.it) which is intended to meet the need for a national and international forum for the promotion and dissemination of high-level original research in the field of Computati...
English. The translation of Multiword expressions (MWE) by Machine Translation (MT) represents a big challenge, and although MT has considerably improved in recent years, MWE mistranslations still occur very frequently. There is the need to develop large data sets, mainly parallel corpora, annotated with MWEs, since they are useful both for SMT tra...
Recent studies have highlighted that Multiword Units (MWU) Translation by Machine Translation (MT) is still an open challenge, whatever is the adopted approach (statistical, rule-based or example-based). The difficulties in translating automatically this recurrent, complex and varied lexical phenomenon originate from its lexical, syntactic, semanti...
Following the success of the MT SUMMIT 2013 Workshop on Multi-word Units in Machine Translation and Translation Technology, we are announcing the 2015 edition to be held in conjunction with the Europhras conference on Computerised and Corpus-based Approaches to Phraseology: Monolingual and Multilingual Perspectives (Malaga, Spain, 29 June – 1 July...
NooJ is a linguistic development environment that provides tools for linguists to construct linguistic resources that formalise a large gamut of linguistic phenomena: typography, orthography, lexicons for simple words, multiword units and discontinuous expressions, inflectional and derivational morphology, local, structural and transformational syn...
In the history of translation, since the classical age, the munus interpretum (the task of the translator) proposed by Cicero has to be considered as the departure point of the definition of the notion of equivalence. According to this point of view, the translator does not have to translate from a linguistic system to another, but he needs to refo...
CLiC-it 2015 is held in Trento on December 3-4 2015, hosted and locally organized by Fondazione Bruno Kessler (FBK), one the most important Italian research centers for what concerns CL. The organization of the conference is the result of a fruitful conjoint effort of different research groups (Università di Torino, Università di Roma Tor Vergata a...
This paper presents a systematic human evaluation of translations of English support verb constructions produced by a rule-based machine translation (RBMT) system (OpenLogos) and a statistical machine translation (SMT) system (Google Translate) for five languages: French, German, Italian, Portuguese and Spanish. We classify support verb constructio...
This paper aims to outline the current trends that are contributing to a rapid development of translation
technologies by promoting their wide dissemination among translation professionals and internauts,
i.e. cloud computing technologies that offer ubiquitous access to digital content and multi-language
translation tools within online collaborativ...
Two emerging phenomena of the internet, crowdsourcing, the exploitation of a community/group of people to perform tasks normally performed by employees and cloud computing, which allows users ubiquitous access to services and online tools for translation and multilingual digital content, have been widely adopted in the field of Machine Translation...
Crowdsourcing and cloud computing have been widely adopted in the field of Machine Translation and Computer Aided Translation in the last fifteen years. They are also more and more used for the development and maintenance of lexical and terminological resources. This paper aims to outline the state of the art of these two emerging phenomena of the...
In the last years important initiatives, like the development of the European Library and Europeana, aim to increase the availability of cultural content from various types of providers and institutions. The accessibility to these resources requires the development of environments which allow both to manage multilingual complexity and to preserve t...
This paper describes a computational linguistics-based approach for providing interoperability between multi-lingual systems in order to overcome crucial issues like cross-language and cross-collection retrieval. Our proposal is a system which improves capabilities of language-technology-based information extraction. In the last few years various t...
This paper addresses the impact of multiword translation errors in machine translation (MT). We have analysed translations of multiwords in the OpenLogos rule-based system (RBMT) and in the Google Translate statistical system (SMT) for the English-French, English-Italian, and English-Portuguese language pairs. Our study shows that, for distinct rea...
Machine Translation (MT) has evolved along with different types of computer-assisted translation tools and a notable progress has been achieved in improving the quality of translations. However, in spite of the recent positive developments in translation technologies, not all problems have been solved and in particular the identification, interpret...
Extracting relevant information in multilingual context from massive amounts of unstructured, structured and semi-structured data is a challenging task. Various theories have been developed and applied to ease the access to multicultural and multilingual resources. This papers describes a methodology for the development of an ontology-based Cross-L...
One of the most relevant problems with Information Retrieval (IR) softwares is the correct processing of complex lexical units, today also known as multiword units. The shortcomings are mainly due to the fact that such units are often considered as extemporaneous combinations of words retrievable by means of statistical routines. On the contrary, s...
This paper discusses the qualitative comparative evaluation performed on the results of two machine translation systems with different approaches to the processing of multi-word units. It proposes a solution for overcoming the difficulties multi-word units present to machine translation by adopting a methodology that combines the lexicon grammar ap...
With the rapid evolution of the Internet, translation has become part of the daily life of ordinary users, not only of professional translators. Machine translation has evolved along with different types of computer-assisted translation tools. Qualitative progress has been made in the field of machine translation, but not all problems have been sol...
Although a vast amount of contents and knowledge has been made available in electronic format and on the web in recent years, translators still do not have friendly and targeted tools at their disposal for the various aspects of a translation process, i.e., the analysis phase, automatic creation and management of the linguistic resources needed and...