
Antoni OliverUniversitat Oberta de Catalunya | UOC · Arts and Humanities Studies
Antoni Oliver
Phd. in Linguistics, Slavonic Philology, Telecommunication Engineering, Master in Free Software
About
113
Publications
37,562
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
446
Citations
Introduction
I'm a lecturer at the Universitat Oberta de Catalunya (UOC) in Barcelona. I'm the director of the master degree in Translation and Technologies. I'm teaching subjects related to translation technologies and natural language processing. I hold a PhD in Linguistics, a BA in Slavonic Philology, a BS in Telecommunication Engineering and a MS in Free Software.
I'm involved in several research projects related to Computational Linguistics applied to Translation and Language Learning.
Additional affiliations
October 2006 - present
Education
September 1999 - July 2004
Publications
Publications (113)
En aquest article presentem el procés de compilació de la nova versió del corpus paral·lel català-castellà creat a partir dels textos del Diari Oficial de la Generalitat de Catalunya (DOGC). Es descriuen els processos de descàrrega, conversió a text, segmentació i alineació automàtica. Tots els programes que s'han desenvolupat per dur a terme aques...
The recent improvements in neural MT (NMT) have driven a shift from statistical MT (SMT) to NMT. However, to assess the usefulness of MT models for post-editing (PE) and have a detailed insight of the output they produce, we need to analyse the most frequent errors and how they affect the task. We present a pilot study of a fine-grained analysis of...
Workshop on post-editing research methods and concepts at IATIS 2021 Congress.
Resumen
El presente estudio se basa en la teoría sobre la relación del complemento directo (CD) y de régimen (CR) en español –la posibilidad de su coocurrencia en el mismo predicado ( Alarcos, 1966 ; Bosque, 1983 ; Rojo, 1983 ) y los predicados en los que estos dos complementos pueden alternar– para analizar las diferencias en la traducción de esto...
This article presents Cadlaws, a new English–French corpus built from Canadian legal documents, and describes the corpus construction process and preliminary statistics obtained from it. The corpus contains over 16 million words in each language and includes unique features since it is composed of documents that are legally equivalent in both langu...
In this paper we describe the building, manual annotation and analysis of a balanced corpus to assess conceptual metaphors on mental illness as used in Spanish blogger writing by patients and mental health professionals. The corpus was structured as eight subgroups: four patient subgroups (composed of persons who declared having been diagnosed with...
Actualmente, la posedición de traducción automática (TA) se considera una práctica habitual en el flujo de trabajo de traducción, sobre todo por la buena calidad que se obtiene con la traducción automática neuronal (TAN). Este hecho está asociado a los esfuerzos que han hecho los proveedores de servicios lingüísticos y los clientes para reducir los...
In this chapter we build a machine translation (MT) system tailored to the literary domain, specifically to novels, based on the state-of-the-art architecture in neural MT (NMT), the Transformer (Vaswani et al., 2017), for the translation direction English-to-Catalan. Subsequently, we assess to what extent such a system can be useful by evaluating...
INMIGRA3 is a three-year project that builds on the work of two previous initiatives: INMIGRA2-CM 1 and CRISIS-MT 2. Together, they address the specific needs of NGOs in multilingual settings with a particular interest in migratory contexts. Work on INMIGRA3 concentrates in the analysis of how to use NMT for the purposes of translating NGOs documen...
In this paper we present a novel resource-inexpensive architecture for metaphor detection based on a residual bidirectional long short-term memory and conditional random fields. Current approaches on this task rely on deep neural networks to identify metaphorical words, using additional linguistic features or word embeddings. We evaluate our propos...
The recent improvements in machine translation (MT) have boosted the use of post-editing (PE) in the translation industry. A new MT paradigm, neural MT (NMT), is displacing its corpus-based predecessor , statistical machine translation (SMT), in the translation workflows currently implemented because it usually increases the fluency and accuracy of...
There is currently an extended use of post-editing of machine translation (PEMT) in the translation industry. This is due to the increase in the demand of translation and to the significant improvements in quality achieved in recent years. PEMT has been included as part of the translation work-flow because it increases translators' productivity and...
In this paper, a tool specifically designed to allow for complex searches in large parallel corpora is presented. The formalism for the queries is very powerful as it uses standard regular expressions that allow for complex queries combining word forms, lemmata and POS-tags. As queries are performed over POS-tags, at least one of the languages in t...
In this paper we propose a neural network approach to detect the metaphoricity of Adjective-Noun pairs using pre-trained word embeddings and word similarity using dot product. We found that metaphorical word pairs tend to have a lower dot product score while literal pairs a higher score. On this basis, we compared seven optimizers and two activatio...
Catalan and Spanish are closely-related languages derived from Latin. Rule-based and statistical-based systems yield good results in MT. Post-editing of machine translation (PEMT) has been a regular practice for these languages because it increases productivity and reduces costs. In recent years, neural MT has gained popularity because of the good...
In the last years, we have witnessed an increase in the use of post-editing of machine translation (PEMT) in the translation industry. It has been included as part of the translation workflow because it increases productivity of translators. Currently , many Language Service Providers offer PEMT as a service. For many years now, (closely) related l...
The more language service companies (LSCs) include machine translation post-editing (MTPE) in their workflows, the more important it is to know how the PE task is performed, who the post-editors are, and what skills they should have. This research is designed to address such questions. It aims to deepen our knowledge of current practices to later c...
Linguistic resources available in the form of open data are an essential source of information for creating e-dictionaries, but access to these linguistic resources is still limited. This paper presents a method for maximising use of open access linguistic resources and integrating them into specialised e-dictionaries. The method combines automatic...
This article is based on the theory about the relationship between the direct object (spa. complemento directo – CD) and the prepositional object (spa. complemento de régimen – CR) in Spanish, mainly the possibility of their co-occurrence in the same predicate (Alarcos, 1966; Bosque, 1983; Rojo, 1983), as well as the predicates in which these two c...
The MOMENT project aims to contribute to a better understanding of severe mental disorders by analyzing the discourse of the two main groups involved, affected people and mental health professionals, in the light of the Conceptual Metaphor Theory and Corpus Linguistics methodology. In this framework, a corpus of first-person accounts from both grou...
The identification of reliable terms from domain-specific corpora using computational methods is a task that has to be validated manually by specialists, which is a highly time-consuming activity. To reduce this effort and improve term candidate selection, we implemented the Token Slot Recognition method, a filtering method based on terminological...
Este artículo presenta un sistema de creación automática de libros bilingües con textos alineados. El sistema permite crear libros electrónicos en los que la oración en la lengua de partida está vinculada con la correspondiente oración en la lengua de llegada. Los usuarios pueden leer en la lengua original y ver la oración correspondiente en la len...
En l’àmbit de la traducció especialitzada es considera que la Viquipèdia no és un recurs d’informació especialitzat fiable degut al fet que qualsevol usuari, sigui especialista o no de la matèria, pot redactar un article. En aquest article es vol determinar el grau de fiabilitat de la Viquipèdia com a recurs terminològic especialitzat per als tradu...
In this paper we present a system for automatic terminology extraction and automatic detection of the equivalent terms in the target language to be used alongside a computer assisted translation (CAT) tool that provides term candidates and their translations in an automatic way each time the translator goes from one segment to the next one. The sys...
http://www.editorialuoc.com/herramientas-tecnologicas-para-traductores
Este libro presenta una panorámica general clara y en profundidad de las tecnologías que se aplican hoy en día en el mundo de la traducción: herramientas de traducción asistida, traducción automática, y extracción y gestión de terminología. La obra presenta tanto los principios...
In this paper we present an extension of the dictionary-based strategy for word-net construction implemented in the WN-Toolkit. This strategy allows the extraction of information for polysemous En-glish words if definitions and/or semantic relations are present in the dictionary. The WN-Toolkit is a freely available set of programs for the creation...
En este artículo presentamos el TMX (Translation Memory eXchange), el formato estándar de intercambio de memorias de traducción. Repasaremos el concepto de memoria de traducción y sus usos que las convierten en uno de los principales recursos para el traductor. Veremos las estrategias para recuperar de manera rápida los segmentos más similares a qu...
The manual identification of terminology from specialized corpora is a complex task that needs to be addressed by flexible tools, in order to facilitate the construction of multilingual terminologies which are the main resources for computer-assisted translation tools, machine translation or ontologies. The automatic terminology extraction tools de...
In this paper an automatic morphology learning system for complex and agglutinative languages is presented. We process complex agglutinative morphology of Indian languages using Adaptor Grammars and linguistic rules of morphology. Adaptor Grammars are a compositional Bayesian framework for grammatical inference, where we define a morphological gram...
Wordnet is a standard semantic resource for several Natural Language Processing tasks and it is available for an increasing number of languages. The Croatian Wordnet (CroWN) was a relatively small resource with 10.026 synsets and 31.367 synset-variant pairs covering only 45.91% of the so-called Core WordNet. Comparing these figures with the size of...
Presentation at Computational Lexicology & Terminology Lab (Vrije Universitiet, Amsterdam, The Neederlands). http://www.cltl.nl/publications/presentations/antoni-oliver-gonzalez/
In this paper the methodology and a detailed evaluation of the results of the expansion of the Galician WordNet using the WN-Toolkit are presented. This toolkit allows the creation and expansion of wordnets using the expand model. In our experiments we have used methodologies based on dictionaries and parallel corpora. The evaluation of the results...
The InLéctor project aims to promote reading in original language, offering an interactive scenario which facilitates foreign language teaching and selflearning, as well as the study of literature. In order to achieve this aim, the project develops computational techniques which provide automatically generation of bilingual e-books, incorporating d...
In this paper we present the evaluation results for the creation of WordNets for five languages (Spanish, French, German, Italian and Portuguese) using an approach based on parallel corpora. We have used three very large parallel corpora for our experiments: DGT-TM, EMEA and ECB. The English part of each corpus is semantically tagged using Freeling...
This paper presents a set of methodologies and algorithms to create WordNets following the expand model. We explore dictionary and BabelNet based strategies, as well as methodologies based on the use of parallel corpora. Evaluation results for six languages are presented: Catalan, Spanish, French, German, Italian and Portuguese. Along with the meth...
At times it is difficult to automatically identify the most representative terms ina specialized corpus and to validate them as correct due to the similarity of words and terms. In order to identify the most representative terms in a corpus that can be easily adapted to any language or terminology extraction tool, we explore the combination of toke...
Resumen:
En este artículo presentamos el proyecto InLéctor para la creación de libros electrónicos bilingües inter-activos. El objetivo del proyecto es desarrollar una serie de aplicaciones para la creación automática de libros electrónicos bilingües. Dichos libros permiten pasar del texto original al traducido con un solo clic y se publican en los...
En aquest article presentem un conjunt de programes que faciliten la creació de WordNets a partir de diccionaris bilingües mitjançant l'estratègia d'expansió. Els programes estan escrits en Python i són per tant multiplataforma. El seu ús, tot i que no disposen d'una interfície gràfica d'usuari, és molt senzill. Aquests programes s'han fet servir a...
Este artículo ofrece una revisión de métodos para la construcción de WordNets siguiendo la estrategia de expansión, es decir, mediante la traducción de las variants inglesas del Princeton WordNet. En el proceso de construcción se han utilizado recursos libres disponibles en Internet. El artículo presenta también los resultados de la evaluación de l...
Este proyecto pretende desarrollar un sistema que genere libros bilingües, con audio e interactivos. El sistema ofrecerá diversos formatos de salida que permitan leer y escuchar los libros en diferentes dispositivos, como libros electrónicos, tabletas y ordenadores. Asimismo, ofrecerá la posibilidad de obtener libros paralelos impresos. Palabras cl...
This paper presents a review of methods for building WordNets following the expand model, that is, by translating the English variants of the Princeton WordNet. Only free resources available online have been used. The paper also presents the evaluation of the techniques applied in the construction of Spanish and Catalan WordNets 3.0. These techniqu...
The aim of this project is the development of a system for the generation of interactive bilingual electronic books with audio support. The system will offer several output formats for reading and listening the books on different devices such as electronic books, tablets and computers. It will also allow printing a parallel bilingual book. © 2012 S...
En aquest article presentem l'estat de
la qüestió en l'ús de la
Viquipèdia per a tasques relacionades amb el processament del
llenguatge natural i tres aplicacions que hem creat per a l'enriquiment
d'un recurs lingüístic de gran abast: el WordNet
versió 3.0 per al català i castellà. Els
investigadors en aquesta àrea fa anys que cerquen vies
perquè...
In this paper we present a methodology for WordNet construction based on the exploitation of parallel corpora with semantic annotation of the English source text. We are using this methodology for the enlargement of the Spanish and Catalan versions of WordNet 3.0, but the methodology can also be used for other languages. As big parallel corpora wit...
This paper describes a methodology for the construction of WordNets based on machine translation of an English sense tagged corpus. For the construction of such a corpus we use two freely avail-able resources: the Semcor Corpus and the Princeton WordNet Gloss Corpus. This methodology is being used for the con-struction of Spanish and Catalan WordNe...
En este documento, se describen los criterios que determinan el conjunto de multipalabras ('multiwords') recogidas en el recurso electrónico WordNet 3.0 (http://adimen.si.ehu.es/cgi-bin/ wei/public/wei.consult.perl). Puesto que se trata de una propuesta aplicada y restringida a un recurso en concreto, se aleja de la idea de ser una propuesta teóric...
En aquest article presentem l'estat de la qüestió en l'ús de la Viquipèdia per a tasques relacionades amb el processament del llenguatge natural i tres aplicacions que hem creat per a l'enriquiment d'un recurs lingüístic de gran abast: el WordNet versió 3.0 per al català i castellà. Els investigadors en aquesta àrea fa anys que cerquen vies perquè...
Spelling and grammar checking has become a daily activity for almost all text processor users. Usually these tools offer limited information about the misspelling or the grammar error and in certain cases suggest one or more possible alternatives. Sometimes users make the same mistakes one day after the other because they don't know the real reason...
En esta demostración presentamos un primer prototipo de asistente para la mejora de la redacción en catalán. El sistema va más allá de un simple corrector gramatical, ya que propone enlaces a gramáticas y ejercicios que permiten al usuario practicar los aspectos donde presenta más carencias. El sistema funciona también como evaluador de nivel y per...
Resumen: Este artículo describe una metodología de construcción de WordNets que se basa en la traducción automática de un corpus en inglés desambiguado por sentidos. El corpus que utilizamos está formado por las propias glosas de WN 3.0 etiquetadas semánticamente y por el corpus Semcor. Los resultados de precisión son comparables a los obtenidos me...
This paper describes a methodology for the construction of WordNets based on machine translation of an English sense tagged corpus. We use the Semcor corpus and the WordNet 3.0 sense tagged glosses as a corpus. Precision results are comparable to those obtained by methods based on bilingual dictionaries for the same languages. This methodology is b...
In this demo we present a first prototype of an assistant for the improvement of writing skills in Catalan. The system is more than a grammatical checker as it proposes links to grammatical explanations and exercises, allowing the user to practice specific aspects. The program also works as a level evaluator and allows to track the user's improveme...
The automatic detection of lexical units of a specialised nature in a given area of knowledge is one of the key challenges in the organisation and retrieval of information.
This communication addresses the use of different statistics strategies with a view to be
able to automatically extract terminological units from a specialist area to retrieve...
Aquest curs d'actualització està concebut com a una introducció general als conceptes i les eines necessàries per a gestionar projectes de traducció: determinació de recursos humans i informàtics, càlcul de volum i cost, formats, control de qualitat, de flux de treball, etc. Este curso de actualización está concebido como una introducción general a...
Aquest curs d'actualització està concebut com a una introducció general als conceptes i les eines necessàries per a gestionar projectes de traducció: determinació de recursos humans i informàtics, càlcul de volum i cost, formats, control de qualitat, de flux de treball, etc. Este curso de actualización está concebido como una introducción general a...
The Linguamón-UOC Chair in Multilingualism of the Universitat Oberta de Catalunya has developed a project consisting of the automatic elab-oration of linguistic resources, those including Catalan. For this, we have created an automatic extractor of terminology, which is freely distributed, multi-platform and adaptable to users' needs. One of its mo...
Multilingualism is a reality in the XXIst Century and New Technolo-gies reveal as a new powerful way to cope with its main issues and the challenges its treatment implies. In this sense a great amount of work has been carried out for the last twenty years in the field of Language Engineering and Applied Linguistics. A big effort has been made to de...
English
The Linguamón-UOC Chair in Multilingualism of the Universitat Oberta de Catalunya has developed an automatic extractor of terminology, which is freely distributed, multi-platform and adaptable to users' needs. One of its most important useful applications is the elaboration of glossaries, both monolingual and multilingual, based on a set of...
El multilingüismo es una realidad del siglo XXI y el desarrollo de las nuevas tecnologías en las últimas décadas se revela como un método eficaz para hacerle frente. Actualmente, las comunidades son cada vez más multiculturales y la sociedad necesita abastecerse de herramientas capaces de gestionar el multilingüismo derivado de esta condición. La U...
This paper presents the complete and consistent annotation of the nominal part of the EuroWordNet (EWN). The annotation has been carried out using the semantic features defined in the EWN Top Concept Ontology. Up to now only an initial core set of 1024 synsets, the so-called Base Concepts, were ontologized in such a way.
This paper presents the complete and consistent ont ological annotation of the nominal part of WordNet. The annotation has been carried out using the semantic features defined in the EuroWordNet Top Concept Ontology and made available to the NLP community. Up to now only an initial core set of 1,024 synsets , the so-called Base Concepts, was ontolo...
The UOC, within the framework of the Linguamón-UOC Chair in Multilingualism, has developed a virtual learning environment with an integrated machine translation system. Thanks to this project, which works with free software applications Moodle and Apertium, a multilingual learning environment can be provided in Catalan, English, French and Spanish....
In this paper we will present a set of terminology extraction tools that are distributed under a Free Software License, so that users can freely download, use, distribute and modify them to meet their needs. The tools are mainly programmed in Perl and they will work under different platforms, such as Windows or Linux. These terminology ex-traction...
En este artículo presentamos LexTerm, una herramienta de extracción automática de terminología que es gratuita y de código abierto. Con esta herramienta se facilita la selección de los términos más relevantes que deben tener un equivalente de traducción consistente. Muchos traductores y algunas agencias de traducción realizan esta tarea todavía a m...
This chapter presents an in-depth linguistic evaluation of a corpus of messages posted in several bilingual newsgroups in Catalonia (Spain). The social context is a situation of bilingualism and language contact where Spanish seems to be progressively overtaking Catalan as the language of daily use. The decline of Catalan might be prevented by inte...
In this paper we present a prototype trans-lation system that uses only a source-language (SL) tagger, a bilingual dictio-nary and a lemmatised target-language (TL) corpus. In our approach, the TL corpus is innovatively exploited both for lexical selection (selecting among the dif-ferent translations proposed by the dictio-nary) and for structure b...
Translation into Spanish by Antoni Oliver in Editorial UOC ISBN: 978-84-9788-740-3
A Internet es poden trobar molts productes i serveis de traducció automàtica. En aquest article presentarem una classificació funcional d’aquests productes i serveis, posant un especial èmfasi en aquells que puguin ser més útils per a un traductor professional. No presentarem una guia exhaustiva de productes o serveis de traducció automàtica sinó ú...
This paper presents the Multilingual Translation Service of eTITLE, a European eContent project, which has produced tools to assist in the multilingual subtitling of audiovisual material through the web. The eTITLE Translation Service combines state-of-the-art Machine translation and Translation memories, which may be tailored to the customer needs...
Resumen La presente comunicación presenta el proyecto INTERLINGUA 1 : su diseño y el trabajo realizado hasta el momento. El objetivo de INTERLINGUA es implementar un entorno totalmente automático (sin preedición ni postedición) de mensajes de correo electrónico en el Campus Virtual de la Universitat Oberta de Catalunya (UOC). Se describe la estrate...
As the creation of computational (verb) lexicons is a huge time-consuming task, tagged corpora appear to be a very useful resource for inducing verb knowledge. In this paper we present a multilingual verb lexicon with syntactic and semantic infor-mation for four languages. For three of them (Catalan, Basque and Spanish) this lexicon is induced from...
In this paper we present an approach to Statistical Machine Translation that uses a bilingual dictionary and a target language model based on n-grams extracted from a monolingual corpus. This approach is still in an experimental stage and is being developed in the context of Metis-II, a UE project that aims at constructing free text translations by...
The Open University of Catalonia (UOC) has set up a programme for the integration of automatised translation techniques and assisted translation in order to process the large amount of Catalan and Spanish teaching documents that its virtual courses produce. After revising the problem and various experiences with the application of these linguistic...
En este artículo presentamos un sistema experimental de traducción automática de tipo estadístico basado en n-gramas. El sistema utiliza un corpus paralelo y fue concebido inicialmente como una extensión de un sistema de Traducción Asistida (TAO). Los buenos resultados obtenidos para el par de lenguas catalán-castellano nos han impulsado a explorar...
We present the METIS-II project, aimed at creating a Statistical Machine Translation system which uses only a monolingual corpus of the target language and a bilingual dictionary, thus eliminating the need for parallel corpora to train the system Presentamos el proyecto METIS-II, dirigido a la creación de un sistema de traducción automático estadís...
En aquesta tesi es presenten diverses metodologies d'adquisició automàtica d'informació lèxica i morfosintàctica i d'aprenentatge no supervisat de la morfologia a partir de corpus sense anotar. Les metodologies que presentem s'han provat per a dues llengües eslaves: el rus i el croat; llengües que es caracteritzen per tenir una morfologia molt rica...
In this paper we present an English grammar and style checker for non-native English speakers. The main characteristic of this checker is the use of an Internet search engine. As the number of web pages written in English is immense, the system hypothesizes that a piece of text not found on the Web is probably badly written. The system also hypothe...
En aquest article es presenten una sèrie de conceptes bàsics sobre qualitat aplicats a la traducció. Per assolir uns bons nivells de qualitat a les nostres traduccions haurem de tenir la màxima cura del procés, cosa que implica dissenyar uns fluxos de treball clars i fer servir les eines adequades, i haurem d’aplicar
estratègies encertades de contr...
This paper presents a linguistic analysis of a corpus of messages written in Catalan and Spanish, which come from several informal newsgroups on the Universitat Oberta de Catalunya (Open University of Catalonia; henceforth, UOC) Virtual Campus. The surrounding environment is one of extensive bilingualism and contact between Spanish and Catalan. The...
This paper presents a methodology for automatic acquisition of lexical resources from raw corpora. This methodology has proved to be efficient for those languages that, like Russian, present a rich and mainly concatenative morphology. This method can be applied for the creation of new resources, as well as in the enrichment of existing ones. We als...
En este artículo presentamos una metodología para la adquisición automática de información léxica y morfosintáctica a partir de corpus sin anotar. El sistema utiliza información sobre la morfología flexiva de la lengua que se desea tratar, así como información léxica y morfosintáctica de las palabras pertenecientes a clases no flexivas y de las pal...
In this paper, we present the INTERLINGUA Project: its design and current work. The goal of the project is achieving fully-automatic (no pre-edition, no post-edition) translation of e-mails in the Virtual Campus of the UOC. The problem of unsupervised machine translation of e-mails is discussed. Then we describe the strategy designed to build the s...
This paper presents a methodology for the automatic acquisition of lexical and morpho-syntactic information from raw corpora. The system uses information about the inflectional morphology declared by rules and is based on the co-occurrence of different forms of the same paradigm in the corpus. A direct application of this methodology gives very poo...