Toni Badia

Toni Badia
University Pompeu Fabra | UPF · Department of Translation and Language Sciences

PhD

About

109
Publications
11,009
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
482
Citations
Citations since 2017
18 Research Items
203 Citations
201720182019202020212022202301020304050
201720182019202020212022202301020304050
201720182019202020212022202301020304050
201720182019202020212022202301020304050
Introduction

Publications

Publications (109)
Preprint
Divergent thinking (DT) is a fundamental part of creative ideation. Understanding its role in cognition and its attainment through language technology can provide the scaffolding to enhance creative endeavors. This study is a proof of concept on the automatic generation of keyword responses as found on the AUT (Alternative Uses Task), a test common...
Article
Full-text available
Deep neural networks as an end-to-end approach lack robustness from an application point of view, as it is very difficult to fix an obvious problem without retraining the model, for example, when a model consistently predicts positive when seeing the word “ terrible. ” Meanwhile, it is less stressed that the commonly used attention mechanism is lik...
Article
Divergent thinking (DT) is a fundamental part of creative ideation. Understanding its role in cognition and its attainment through language technology can provide the scaffolding to enhance creative endeavors. This study is a proof of concept on the automatic generation of keyword responses as found on the AUT (Alternative Uses Task), a test common...
Article
Full-text available
The recent improvements in neural MT (NMT) have driven a shift from statistical MT (SMT) to NMT. However, to assess the usefulness of MT models for post-editing (PE) and have a detailed insight of the output they produce, we need to analyse the most frequent errors and how they affect the task. We present a pilot study of a fine-grained analysis of...
Article
Full-text available
Actualmente, la posedición de traducción automática (TA) se considera una práctica habitual en el flujo de trabajo de traducción, sobre todo por la buena calidad que se obtiene con la traducción automática neuronal (TAN). Este hecho está asociado a los esfuerzos que han hecho los proveedores de servicios lingüísticos y los clientes para reducir los...
Conference Paper
Full-text available
Emotion intensity prediction determines the degree or intensity of an emotion that the author expresses in a text, extending previous categorical approaches to emotion detection. While most previous work on this topic has concentrated on English texts, other languages would also benefit from fine-grained emotion classification, preferably without h...
Conference Paper
Full-text available
The recent improvements in machine translation (MT) have boosted the use of post-editing (PE) in the translation industry. A new MT paradigm, neural MT (NMT), is displacing its corpus-based predecessor , statistical machine translation (SMT), in the translation workflows currently implemented because it usually increases the fluency and accuracy of...
Conference Paper
Full-text available
There is currently an extended use of post-editing of machine translation (PEMT) in the translation industry. This is due to the increase in the demand of translation and to the significant improvements in quality achieved in recent years. PEMT has been included as part of the translation work-flow because it increases translators' productivity and...
Preprint
Full-text available
Emotion intensity prediction determines the degree or intensity of an emotion that the author expresses in a text, extending previous categorical approaches to emotion detection. While most previous work on this topic has concentrated on English texts, other languages would also benefit from fine-grained emotion classification, preferably without h...
Poster
Full-text available
Catalan and Spanish are closely-related languages derived from Latin. Rule-based and statistical-based systems yield good results in MT. Post-editing of machine translation (PEMT) has been a regular practice for these languages because it increases productivity and reduces costs. In recent years, neural MT has gained popularity because of the good...
Conference Paper
Full-text available
In the last years, we have witnessed an increase in the use of post-editing of machine translation (PEMT) in the translation industry. It has been included as part of the translation workflow because it increases productivity of translators. Currently , many Language Service Providers offer PEMT as a service. For many years now, (closely) related l...
Preprint
Full-text available
Current state-of-the-art models for sentiment analysis make use of word order either explicitly by pre-training on a language modeling objective or implicitly by using recurrent neural networks (RNNs) or convolutional networks (CNNs). This is a problem for cross-lingual models that use bilingual embeddings as features, as the difference in word ord...
Conference Paper
Full-text available
Current state-of-the-art models for sentiment analysis make use of word order either explicitly by pre-training on a language modeling objective or implicitly by using recurrent neural networks (Rnns) or convolutional networks (Cnns). This is a problem for cross-lingual models that use bilingual embeddings as features, as the difference in word ord...
Article
Full-text available
While sentiment analysis has become an established field in the NLP community, research into languages other than English has been hindered by the lack of resources. Although much research in multi-lingual and cross-lingual sentiment analysis has focused on unsupervised or semi-supervised approaches, these still require a large number of resources...
Conference Paper
Full-text available
Cross-lingual sentiment classification (CLSC) seeks to use resources from a source language in order to detect sentiment and classify text in a target language. Almost all research into CLSC has been carried out at sentence and document level, although this level of granularity is often less useful. This paper explores methods for performing aspect...
Article
Full-text available
This paper presents a methodology for the design and implementation of user-centred language checking applications. The methodology is based on the separation of three critical aspects in this kind of application: functional purpose (educational or corrective goal), types of warning messages, and linguistic resources and computational techniques us...
Article
In this study we show how complex creative relations can arise from fairly frequent semantic relations observed in everyday language. By doing this, we reflect on some key cognitive aspects of linguistic and general creativity. In our experimentation, we automated the process of solving a battery of Remote Associates Test tasks. By applying Statist...
Conference Paper
Full-text available
We present the NewSoMe (News and Social Media) Corpus, a set of subcorpora with annotations on opinion expressions across genres (news reports, blogs, product reviews and tweets) and covering multiple languages (English, Spanish, Catalan and Portuguese). NewSoMe is the result of an effort to increase the opinion corpus resources available in langu...
Article
This paper describes the automatic process of building a dependency annotated corpus based on Ancora constituent structures. The Ancora corpus already has a dependency structure information layer, but the new annotated data applies a purely syntactic orientation and offers in this way a new resource to the linguistic research community. The paper d...
Conference Paper
While collaborative filtering often yields very good recommendation results, in many real-world recommendation scenarios cold-start and data sparseness remain important problems. This paper presents a hybrid recommender system that integrates user demographics and item characteristics, around a collaborative filtering core based on user-item intera...
Article
Full-text available
This paper aims at enlarging the semantic treatment standardly assumed in HPSG in order to deal with several issues still not adequately solved, such as: optional verbal and nominal complements, the implication of participants and events that take part in the denotation of lexical items but are not syntactically expressable, the selection restricti...
Chapter
This paper aims at enriching the semantic treatment standardly assumed in HPSG in order to deal with several issues not adequately solved, concerning the representation of: verbal and nominal complement optionality, non-intersective uses of adjectives, selection restrictions imposed by predicates to their arguments, and the implication of syntactic...
Article
Full-text available
We present a study on the automatic acquisition of semantic classes for Catalan adjectives from distributional and morphological information, with particular emphasis on polysemous adjectives. The aim is to distinguish and characterize broad classes, such as qualitative (gran ‘big’) and relational (pulmonar ‘pulmonary’) adjectives, as well as to id...
Article
Full-text available
Resumen: En esta demostración presentamos IAC (Interfaz de Acceso a Corpus), una herramienta on-line desarrollada por Barcelona Media-Centro de Innovación y la Universidad Pompeu Fabra que permite crear interfaces dinámicas para hacer búsquedas en corpus. Abstract: In this demo we present IAC (Corpus Access Interface), an on-line tool developed by...
Article
Full-text available
The aim of this research is to establish the role of linguistic information in data-scarce statistical machine translation for sign languages using freely available tools. The main challenge in statistical machine translation is the scarcity of suitable data, and this problem becomes more pronounced in sign languages. The available corpora are smal...
Conference Paper
Full-text available
This paper presents the second version of ESEDA, a speech emotion recognition tool. The current version of the tool has a number of novel capabilities as compared to the first version ESEDA.0 and other speech emotion recognition systems: firstly, it incorporates a novel classification method TGI+, and secondly, it includes a module for high level f...
Conference Paper
Full-text available
This demo paper presents a speech emotion recognition tool, based on standard supervised machine learning methods and enhanced with an additional block of classification error analysis and fixing. The fixing part incorporates two optimisations: classification decomposition and treatment of the minority class problem. Experimental results demonstrat...
Article
This article reports on a large-scale experiment for gathering human judgements with respect to a semantic classification of Catalan adjectives. The goal of our experiment was to classify 210 Catalan adjectives as basic, event-related, or object-related adjectives, allowing for multiple class assignments to account for polysemy. The experiment was...
Article
Full-text available
METIS-II was an EU-FET MT project running from October 2004 to September 2007, which aimed at translating free text input without resorting to parallel corpora. The idea was to use “basic” linguistic tools and representations and to link them with patterns and statistics from the monolingual target-language corpus. The METIS-II project has four par...
Conference Paper
Full-text available
In this paper we describe the METIS-II system and its evaluation on each of the language pairs: Dutch, German, Greek, and Spanish to English. The METIS-II system envisaged developing a data-driven approach in which no parallel corpus is required and in which no full parser or extensive rule sets are needed. We describe the evaluation on a developme...
Article
Full-text available
In this paper we present a prototype trans-lation system that uses only a source-language (SL) tagger, a bilingual dictio-nary and a lemmatised target-language (TL) corpus. In our approach, the TL corpus is innovatively exploited both for lexical selection (selecting among the dif-ferent translations proposed by the dictio-nary) and for structure b...
Conference Paper
This paper assesses the role of multi-label classification in modelling polysemy for lan- guage acquisition tasks. We focus on the ac- quisition of semantic classes for Catalan ad- jectives, and show that polysemy acquisition naturally suits architectures used for multi- label classification. Furthermore, we ex- plore the performance of information...
Article
Full-text available
We present an experimental Machine Translation prototype system that is able to translate between Span-ish and English, using very basic linguistic resources. In our approach, no structural transfer rules are used to deal with structural divergences between the two lan-guages: the target corpus is the basis both for lexical selection and for struct...
Article
Full-text available
This article explains the training possibilities in media accessibility in Spain, specifically in the fields of audio description, subtitling for the deaf and hard-of-hearing and sign language. A list of courses offered both at university and in companies is presented and a new proposal is put forward: an official master in media accessibility by t...
Article
L’article comença analitzant breument què entenem per TA. A continuació observa que la manera de treballar a què s’han acostumat molts traductors amb les memòries de traducció pot ser reproduïda perfectament amb un sistema de TA, sempre que tingui unes característiques determinades, que són analitzades. Finalment, es fa esment de dos aspectes compl...
Article
Full-text available
This paper describes a methodology aimed at grouping Catalan verbs according to their syntactic behavior. Our goal is to acquire a small number of basic classes with a high level of accuracy, using minimal resources. Information on syntactic class, expensive and slow to compile by hand, is useful for any NLP task requiring specific lexical informat...
Conference Paper
Full-text available
This paper presents the Multilingual Translation Service of eTITLE, a European eContent project, which has produced tools to assist in the multilingual subtitling of audiovisual material through the web. The eTITLE Translation Service combines state-of-the-art Machine translation and Translation memories, which may be tailored to the customer needs...
Article
Full-text available
In this paper we describe a machine translation prototype in which we use only minimal resources for both the source and the target language. A shallow source language analysis, combined with a translation dictionary and a mapping system of source language phe-nomena into the target language and a target language corpus for generation are all the r...
Conference Paper
Full-text available
This paper presents CUCWeb, a 166 mil-lion word corpus for Catalan built by crawling the Web. The corpus has been annotated with NLP tools and made avail-able to language users through a flexible web interface. The developed architecture is quite general, so that it can be used to create corpora for other languages.
Article
Full-text available
This paper presents the several strategies used and tested during the life of the ALLES project (IST-2001-34246) in order to achieve quality feedback in distance language learning. The purpose of the ALLES project was to show the feasibility to create more intel-ligent and individualised automatic correction modules on the basis of state-of-the-art...
Conference Paper
Full-text available
This paper presents CUCWeb, a 166 million word corpus for Catalan built by crawling the Web. The corpus has been annotated with NLP tools and made available to language users through a flexible web interface. The developed architecture is quite general, so that it can be used to create corpora for other languages.
Article
This paper describes an experiment devised to group Catalan verbs according to their syntactic behavior. Our goal is to acquire a small num-ber of basic classes with a high level of accuracy, from relatively knowledge-poor resources. This information, expensive and slow to compile by hand, is useful for any NLP task requiring spe-cific lexical info...
Conference Paper
Full-text available
In this paper we present an approach to Statistical Machine Translation that uses a bilingual dictionary and a target language model based on n-grams extracted from a monolingual corpus. This approach is still in an experimental stage and is being developed in the context of Metis-II, a UE project that aims at constructing free text translations by...
Conference Paper
Full-text available
This paper discusses the role of morpho- logical and syntactic information in the automatic acquisition of semantic classes for Catalan adjectives, using decision trees as a tool for exploratory data analysis. We show that a simple mapping from the derivational type to the semantic class achieves 70.1% accuracy; syntactic func- tion reaches a sligh...
Article
Full-text available
En este artículo presentamos un sistema experimental de traducción automática de tipo estadístico basado en n-gramas. El sistema utiliza un corpus paralelo y fue concebido inicialmente como una extensión de un sistema de Traducción Asistida (TAO). Los buenos resultados obtenidos para el par de lenguas catalán-castellano nos han impulsado a explorar...
Conference Paper
The purpose of this paper is to present a strategy for the provision of language courses to learners of Spanish for specific purposes with intelligent feedback. As a byproduct, students can be recommended to proceed through a slightly different learning path in order to overcome their shortcomings. The paper is structured in 5 sections: section 1 i...
Article
In this paper, we present a clustering exper-iment directed at the acquisition of semantic classes for adjectives in Catalan, using only shallow distributional features. We define a broad-coverage classification for adjectives based on Ontological Semantics. We classify along two parameters (number of ar-guments and ontological kind of denotation),...
Conference Paper
Full-text available
This paper describes how mature NLP that has been successfully applied in the area of controlled language checking can be used to deliver intelligent CALL applications1. It describe how an autonomous, long-distance second-language learning system for advanced learners 2 can be created. The architecture of the system consists of a multimodal user in...
Conference Paper
Full-text available
This paper describes how mature NLP that has been successfully applied in the area of controlled language checking can be used to deliver intelligent CALL applications. It describes how an autonomous, long-distance 2 nd language learning system for advanced learners can be created. The architecture of the system consists of a web-based multimodal u...
Article
Full-text available
This paper describes how mature NLP that has been successfully applied in the area of controlled language checking can be used to deliver intelligent CALL applications. It describes how an autonomous, long-distance 2 nd language learning system for advanced learners can be created. The architecture of the system consists of a web-based multimodal u...
Article
This paper presents the Corpus d'Ús del Català a la Web (CuCWeb), a 208 million word (125,000 documents) corpus automatically compiled from the Web. This corpus has been automatically processed so that additional linguistic information is available (apart from the word forms). A very flexible search interface has been implemented, which allows for...
Article
We address the representation of nouns having complex argument structures like deverbal nominalisations. In particular we address the semantic representation of syntactically unexpressed arguments.We put forward a treatment of this kind of optional complements in a framework that combines HPSG syntax and the semantic approach in GL (Pustejovsky, 19...
Article
rmalism (CEC, 1994). 2.1 The TLR module The main characteristics of the formalism is that it allows the linguist to express both the morphographemic and morphotacticai contexts thus constraining the application of TLRs. Thus a rule in CATMORF may make use of the following data structures: the Surface Left and Right morphographemic contexts; the Lex...
Article
Full-text available
CATCG es un sistema de análisis morfosintáctico superficial para el catalán, basado en el formalismo Constraint Grammar, que contiene tres herramientas básicas: un analizador morfológico, un etiquetador morfológico y un analizador sintáctico superficial. CATCG is a shallow parser for Catalan. It uses the Constraint Grammar formalism and contains th...
Article
Full-text available
This paper describes how mature NLP that has been successfully applied in the area of controlled language checking can be used to deliver intelligent CALL applications1. It describe how an autonomous, long-distance second-language learning system for advanced learners 2 can be created. The architecture of the system consists of a multimodal user in...
Article
Full-text available
En esta comunicación discutimos los límites de los tratamientos habituales de la sintaxis en PLN y proponemos la integración de semántica léxica en los mismos. Para ello, escogemos dos modelos compatibles tanto teóricamente como formalmente: HPSG para la representación sintáctica y GL para la representación de la semántica léxica. Finalmente mostra...
Article
Full-text available
Within two-level morphology, morphotac-tics is modelled either in continuation classes or in uniication word grammars. Though for the rst approach v ery eecient systems exist, the latter provides more el-egant morphosyntactic parsing. If the lat-ter approach i s t a k en, the present t wo-level formalisms have some problems in dealing with morpholo...
Article
Full-text available
Aquest article conté una proposta per a la representació de les estructures predicatives (és a dir, dels predicats amb els seus arguments) en els formalismes basats en estructures de trets tipificades. L’article comença amb una discussió dels objectius i del nivell de descripció de la representació que es proprosa; i després se centra en una exempl...
Article
Full-text available
In this paper we show the problems that the present two-level morphological formalisms have in dealing with several morphological phenomena involving interaction between two-level rules (TLR's) and the word grammar (WG). On the one hand, we demonstrate the non-modularity of their WG, since it cannot be designed according to linguistic criteria only...
Article
Resumen de la Tesis Doctoral presentada en el marco del programa de doctorado Formalització del lenguatge natural del ICE de la Universitat Politècnica de Catalunya en 1992.
Article
Full-text available
Substantial formal grammatical and lexical resources exist in various NLP systems and in the form of textbook specifications. In the present paper we report on experimental results obtained in manual, semi-automatic and automatic migration of entire computational or textbook descriptions (as opposed to a more informal reuse of ideas or the design o...
Article
Full-text available
Esta comunicación presenta las características de la gramática española producida en el marco del proyecto comunitario Eurotra. La realización de una gramática para el castellano ha constituido una de las tareas centrales para los grupos de Madrid y de Barcelona que participamos en este proyecto y nuestra ponencia presenta, pues, los resultados de...
Article
This paper focuses on the language processing tool being developed at our centre and briefly describes two of its applications. CATCG, our morphosyntactic analyser, is designed to deal with general written Catalan text. In CATCG the whole processing task has been divided into specific subtasks and for each one of them we try to apply the best strat...
Article
Full-text available
We present here a general-purpose spell and grammar error detection architecture for Catalan unrestricted text. This architecture is based on a previous existing shallow morphosyntactic parser, which had to be adapted in order to successfully handle ill-formed input. The goal of this research is to obtain an architecture that can be used for develo...
Article
The goal of BancTrad i is to offer the possibility to access and search through (parallel) annotated corpora via the Internet. This paper presents the design of the whole process: from text compilation and processing to actually performing queries via the web, while it describes as well its technical architecture. The languages we work with are Cat...
Article
This paper describes the free text pro-cessing strategy that is being set up in our institute. The system is designed to deal with general, written Catalan texts, as they appear in, say, daily newspa-pers. Our strategy has been to divide the whole processing into specific subtasks, applying to each of them the best strat-egy available. The main adv...
Article
Full-text available
We address the representation of nouns hav- ing complex argument structures like deverbal nominalisations. In particular we address the semantic representation of syntactically unex- pressed arguments.We put forward a treatment of this kind of optional complements in a frame- work that combines HPSG syntax and the se- mantic approach in GL (Pustejo...