
Victoria Arranz- PhD in Language Engineering
- Head of R&D at Evaluations and Language resources Distribution Agency, Paris
Victoria Arranz
- PhD in Language Engineering
- Head of R&D at Evaluations and Language resources Distribution Agency, Paris
About
45
Publications
3,937
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
270
Citations
Introduction
Skills and Expertise
Current institution
Evaluations and Language resources Distribution Agency, Paris
Current position
- Head of R&D
Publications
Publications (45)
This deep dive on data, knowledge graphs (KGs) and language resources (LRs) is the final of the four technology deep dives, as data as well as related models are the basis for technologies and solutions in the area of Language Technology (LT) for European digital language equality (DLE). This chapter focuses on the data and LRs required to achieve...
This chapter provides an overview of what is available in ELG in terms of datasets, corpora and other language resources (LRs) and how this has been achieved. We look at the procedures and steps that have been followed to complete the full resource ingestion cycle, which goes from repository and LR identification to metadata description and ingesti...
The European MAPA (Multilingual Anonymisation for Public Administrations) project aims at developing an open-source solution for automatic de-identification of medical and legal documents. We introduce here the context, partners and aims of the project, and report on preliminary results.
With 24 official EU and many additional languages, multilingualism in Europe and an inclusive Digital Single Market can only be enabled through Language Technologies (LTs). European LT business is dominated by hundreds of SMEs and a few large players. Many are world-class, with technologies that outperform the global players. However, European LT b...
The current scientific and technological landscape is characterised by the increasing availability of data resources and processing tools and services. In this setting, metadata have emerged as a key factor facilitating management, sharing and usage of such digital assets. In this paper we present ELG-SHARE, a rich metadata schema catering for the...
With 24 official EU and many additional languages, multilingualism in Europe and an inclusive Digital Single Market can only be enabled through Language Technologies (LTs). European LT business is dominated by hundreds of SMEs and a few large players. Many are world-class, with technologies that outperform the global players. However, European LT b...
Machine translation (MT) has become increasingly important and popular in the past decade, leading to the development of MT evaluation metrics aiming at automatically assessing MT output. Most of these metrics use reference translations to compare systems output, therefore, they should not only detect MT errors but also be able to identify correct...
The objective of the PANACEA ICT-2007.2.2 EU project is to build a platform
that automates the stages involved in the acquisition, production, updating and
maintenance of the large language resources required by, among others, MT
systems. The development of a Corpus Acquisition Component (CAC) for extracting
monolingual and bilingual data from the...
This report elaborates on the exploitation of the PANACEA project assets. These assets have
been clustered into a few items (a) the PANACEA Factory/Platform, (b) the web services
integrated within the platform, (c) the associated workflows to manage the sequencing of web
services (d) the tools developed during the project and last but not least (e)...
This paper presents a metadata model for the description of language resources proposed in the framework of the META-SHARE infrastructure, aiming to cover both datasets and tools/technologies used for their processing. It places the model in the overall framework of metadata models, describes the basic principles and features of the model, elaborat...
This paper presents the metadata schema for describing language resources (LRs) cur-rently under development for the needs of META-SHARE, an open distributed facility for the exchange and sharing of LRs. An es-sential ingredient in its setup is the existence of formal and standardized LR descriptions, cornerstone of the interoperability layer of an...
This paper describes the joint submission of UniversitatPoli ecnica de Catalunya and Uni-versitat de Barcelona to the Metrics MaTr 2010 evaluation challenge, in collaboration with ELDA/ELRA. Our work is aimed at widening the scope of current automatic evaluation measures from sentence to document level. Preliminary ex-periments, based on an extensi...
This paper describes the joint submission of Universitat Politècnica de Catalunya and Universitat de Barcelona to the Metrics MaTr 2010 evaluation challenge, in collaboration with ELDA/ELRA. Our work is aimed at widening the scope of current automatic evaluation measures from sentence to document level. Preliminary experiments, based on an extensio...
Resumen: El presente trabajo muestra la evaluación cuantitativa y cualitativa de un grupo de analizadores de constituyentes y de dependencias con el objetivo de ser usados en el desarrollo de una métrica automática basada en conocimiento para evaluar la salida de sistemas de traducción automática. Primero se describe la metodología seguida en ambos...
15 years have gone by and ELRA continues embracing the needs of the HLT community to design its services and to implement them through its operational body, ELDA. The needs of the community have become much more ambitious...Larger language resources (LR), better quality ones (how do we reach a compromise between price – maybe free – and quality?),...
This paper presents the end-to-end evaluation of an automatic simultaneous translation system, built with state-of-the-art components. It shows whether, and for which situations, such a system might be advantageous when compared to a human interpreter. Using speeches in English translated into Spanish, we present the evaluation procedure and we dis...
The project described in this paper is funded by th e French Ministry of Research. It aims at providing producers of Language Resources, and HLT players in general, with a guide which offers technical, legal and strategic recomme ndations/guidelines for the reuse of their Language Resources. The guide is dedi cated in particular to academic laborat...
This paper describes the latest developments in ELRA's services within the field of Language Resources (LR). These developments focus on 4 main groups of activities: the identification and distribution of Language Resources; the production of LRs; the evaluation of Human Language Technology (HLT), and the dissemination of information in the field....
This paper describes the final evaluation of the FAME interlingua-based speech-to-speech translation system for Catalan, English and Spanish. It is an extension of the already existing NESPOLE! System that translates between English, French, German and Italian. However, the FAME modules have now been integrated in an Open Agent Architecture platfor...
This chapter provides an overview of available language resources, from both U.S. and European perspectives. Multilingual data repositories as well as large ongoing and planned collection efforts are introduced, along with a description of the major challenges of collection efforts, such as transcription issues due to inconsistent writing standards...
In 2008 the Olympics Games will be held in Beijing. For this purpose the city government of Beijing has launched the Special Pro- gramme for Construction of Digital Olympics. One of the objectives of the program is the use of artificial intelligence technology to overcome language barriers during the games. In order to demonstrate the con- tributio...
This paper describes the FAME Interlingua-based Speech-to-Speech Translation System for Catalan, English and Spanish. This is an extension of the already existing NESPOLE! that translates between English, French, German and Italian, but all modules have now been integrated in an Open Agen Architecture. This article describes the system architecture...
This paper studies the impact of multiword expressions on Word Sense Disambiguation (WSD). Several identification strategies
of the multiwords in WordNet2.0 are tested in a real Senseval-3 task: the disambiguation of WordNet glosses. Although we have
focused on Word Sense Disambiguation, the same techniques could be applied in more complex tasks, s...
This paper describes the “FAME” multi-modal demonstrator, which integrates multiple communication modes – vision, speech and object manipulation – by combining the physical and virtual worlds to provide support for multi-cultural or multi-lingual communication and problem solving.
The major challenges are automatic perception of human actions and u...
This paper describes the evaluation of the FAME interlingua-based speech-to-speech translation system for Catalan, English and Spanish. This system is an extension of the already existing NESPOLE! that translates between English, French, German and Italian. This article begins with a brief introduction followed by a description of the system archit...
Creation of lexica and corpora for Catalan, Spanish and US-English is described. A lexicon is being created for speech recognition and synthesis including relevant information. The lexicon contains 50K common words selected to achieve a wide coverage on the chosen domains, and 50K additional entries in- cluding special application words, and proper...
This paper focuses on the strategies adopted to tackle problematic input and ease communication between modules in a Spanish
railway information dialogue system for spontaneous speech. The paper describes the design and tuning considerations followed
by the understanding module, both from a language processing and semantic information extraction po...
This paper focuses on the strategies adopted to tackle problematic input and ease communication between modules in a Spanish railway information dialogue system for spontaneous speech. The paper describes the design and tuning considerations followed by the understanding module, both from a language processing and semantic information extraction po...
This paper describes on-going work on the development of two complementary resources: WordMed® and Scriptum®. The former is a lexico-conceptual knowledge base (KB) comprising information from four medical sub-domains (diagnostics, procedures, tumors and medicines). This resource is only accessible for the language and domain expert in charge of sup...
This paper focuses on the increasing need for a more natural and sophisticated human-machine interaction (HMI). The research here presented shows work on the development of a restricted-domain spontaneous speech dialogue system in Spanish.
This human-machine interface is oriented towards a semantically restricted domain: Spanish railway information...
This paper focuses on the increasing need for a more natural and sophisticated human-machine interaction (HMI). The research here presented shows work on the development of a restricted-domain spontaneous speech dialogue system in Spanish. This human-machine interface is oriented towards a semantically restricted domain: Spanish railway information...
This paper focuses on the general problem of the lexical bottleneck and, in particular, on the issues of semantic clustering and disambiguation by means of word usage cues obtained from sublanguage-specific corpora. Our approaches combines the use of numerical techniques with some symbolic modules. Our numerical tool Dynamic Context Matching is sup...
This paper describes the design and development of a trilingual spontaneous speech corpus for statistical speech-to-speech translation. The languages considered are Catalan, Spanish and US-English. This corpus has been built bearing in mind the strong need for multi-lingual collections of on-line data within the area of statistical translation, as...
This paper describes the creation of linguis-tically enriched aligned corpora for Catalan, Spanish and US-English for Speech-to-Speech Translation. These corpora are obtained from two diierent sources: US-English transcribed speech data and transcriptions of conversations recorded in Catalan and Spanish. After hu-man translation, a large trilingual...
Machine translation evaluation campaigns require the pro-duction of reference corpora to automatically measure sys-tem output. This paper describes recent efforts to create such data with the objective of measuring the quality of the sys-tems participating in the Quaero evaluations. In particular, we focus on the protocols behind such production as...
In the last decades, a wide range of automatic metrics that use linguistic knowledge has been developed. Some of them are based on lexical information, such as METEOR; others rely on the use of syntax, either using constituent or dependency analysis; and others use semantic information, such as Named Entities and semantic roles. All these metrics w...
In this document, we propose a new unique and universal identification schema for Language Resources to provide Language Resources with unique names using a standardized nomenclature. This will also ensure Language Resources to be identified, and consequently to be recognized with proper references in activities within Human Language Technologies a...
This paper emphasises the need to develop efficient lexical knowledge acquisition techniques in order to tackle problems related to the so-called lexical bottleneck. Bearing this in mind, a semi-automatic technique for semantic clustering and word sense disambiguation is proposed. The main principles behind this method are the extraction of knowled...