Marta Villegas

Marta Villegas
Barcelona Supercomputing Center

About

62
Publications
7,512
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
596
Citations

Publications

Publications (62)
Chapter
Natural language processing (NLP) is increasingly applied to a broad range of sensitive tasks, such as human resources, biomedicine, and healthcare. Accordingly, a growing body of research is investigating the impact of sex and gender bias in the models and the data on which such models are trained. As NLP systems become more pervasive in our socie...
Preprint
The Catalan Language Understanding Benchmark (CLUB) encompasses various datasets representative of different NLU tasks that enable accurate evaluations of language models, following the General Language Understanding Evaluation (GLUE) example. It is part of AINA and PlanTL, two public funding initiatives to empower the Catalan language in the Artif...
Preprint
Full-text available
There are many Language Models for the English language according to its worldwide relevance. However, for the Spanish language, even if it is a widely spoken language, there are very few Spanish Language Models which result to be small and too general. Legal slang could be think of a Spanish variant on its own as it is very complicated in vocabula...
Preprint
Full-text available
We introduce CoWeSe (the Corpus Web Salud Espa\~nol), the largest Spanish biomedical corpus to date, consisting of 4.5GB (about 750M tokens) of clean plain text. CoWeSe is the result of a massive crawler on 3000 Spanish domains executed in 2020. The corpus is openly available and already preprocessed. CoWeSe is an important resource for biomedical...
Preprint
Full-text available
This work presents biomedical and clinical language models for Spanish by experimenting with different pretraining choices, such as masking at word and subword level, varying the vocabulary size and testing with domain data, looking for better language representations. Interestingly, in the absence of enough clinical data to train a model from scra...
Preprint
Full-text available
Multilingual language models have been a crucial breakthrough as they considerably reduce the need of data for under-resourced languages. Nevertheless, the superiority of language-specific models has already been proven for languages having access to large amounts of data. In this work, we focus on Catalan with the aim to explore to what extent a m...
Preprint
Full-text available
This paper presents the Spanish RoBERTa-base and RoBERTa-large models, as well as the corresponding performance evaluations. Both models were pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the National Library of S...
Preprint
Full-text available
In this paper, we present an overview of the eighth edition of the BioASQ challenge, which ran as a lab in the Conference and Labs of the Evaluation Forum (CLEF) 2020. BioASQ is a series of challenges aiming at the promotion of systems and methodologies for large-scale biomedical semantic indexing and question answering. To this end, shared tasks a...
Preprint
Full-text available
The training of neural networks is usually monitored with a validation (holdout) set to estimate the generalization of the model. This is done instead of measuring intrinsic properties of the model to determine whether it is learning appropriately. In this work, we suggest studying the training of neural networks with Algebraic Topology, specifical...
Preprint
Full-text available
We computed both Word and Sub-word Embeddings using FastText. For Sub-word embeddings we selected Byte Pair Encoding (BPE) algorithm to represent the sub-words. We evaluated the Biomedical Word Embeddings obtaining better results than previous versions showing the implication that with more data, we obtain better representations.
Preprint
Full-text available
Artificial Neural Networks (ANNs) are widely used for approximating complex functions. The process that is usually followed to define the most appropriate architecture for an ANN given a specific function is mostly empirical. Once this architecture has been defined, weights are usually optimized according to the error function. On the other hand, w...
Preprint
Full-text available
Researchers that work for the same institution use their email as the main communication tool. Email can be one of the most fruitful attack vectors of research institutions as they also contain access to all accounts and thus to all private information. We propose an approach for analyzing in terms of security research institutions' communication n...
Preprint
Full-text available
A bstract Background The propagation of COVID-19 in Spain prompted the declaration of the state of alarm on March 14, 2020. On 2 December 2020, the infection had been confirmed in 1,665,775 patients and caused 45,784 deaths. This unprecedented health crisis challenged the ingenuity of all professionals involved. Decision support systems in clinica...
Chapter
In this paper, we present an overview of the eighth edition of the BioASQ challenge, which ran as a lab in the Conference and Labs of the Evaluation Forum (CLEF) 2020. BioASQ is a series of challenges aiming at the promotion of systems and methodologies for large-scale biomedical semantic indexing and question answering. To this end, shared tasks a...
Chapter
This paper describes the eighth edition of the BioASQ Challenge, which will run as an evaluation Lab in the context of CLEF2020. The aim of BioASQ is the promotion of systems and methods for highly precise biomedical information access. This is done through the organization of a series of challenges (shared tasks) on large-scale biomedical semantic...
Article
Full-text available
Clinical and biomedical text mining research efforts have so far focused mainly on documents written in English. These efforts benefited significantly from the availability, not only of domain-specific components such as a tokenizers or Part-of-Speech taggers, but particularly from the access to very large training corpora and terminological resour...
Article
Full-text available
Currently, most of the clinical data produced by healthcare professionals still consists of unstructured data in the form of clinical narrative texts [1]. In order to enable a better exploitation of information contained in electronic health records (EHRs) by precision medicine and data mining approaches, it is key to extract relevant information b...
Article
Full-text available
Healthcare professionals are generating a substantial volume of clinical data in narrative form. As healthcare providers are confronted with serious time constraints, they frequently use telegraphic phrases, domain-specific abbreviations and shorthand notes. Efficient clinical text processing tools need to cope with the recognition and resolution o...
Article
Bilingual electronic dictionaries contain collections of lexical entries in two languages, with explicitly declared translation relations between such entries. Nevertheless, they are typically developed in isolation, in their own formats and accessible through proprietary APIs. In this paper we propose the use of Semantic Web techniques to make tra...
Conference Paper
Full-text available
The experiments presented here exploit the properties of the Apertium RDF Graph, principally cycle density and nodes' degree, to automatically generate new translation relations between words, and therefore to enrich existing bilingual dictionaries with new entries. Currently, the Apertium RDF Graph includes data from 22 Apertium bilingual dictiona...
Presentation
Full-text available
https://ddd.uab.cat/record/147282
Conference Paper
META-SHARE is an infrastructure for sharing Language Resources (LRs) where significant effort has been made into providing carefully curated metadata about LRs. However, in the face of the flood of data that is used in computational linguistics, a manual approach cannot suffice. We present the development of the META-SHARE ontology, which transform...
Article
Full-text available
The PAROLE/SIMPLE 'lemon' Ontology and Lexicon are the OWL/RDF version of the PAROLE/SIMPLE lexicons (defined during the PAROLE (LE2-4017) and SIMPLE (LE4-8346) IV FP EU projects) once mapped onto lemon model and LexInfo ontology. Original PAROLE/SIMPLE lexicons contain morphological, syntactic and semantic information, organized according to a com...
Data
The Catalan Parole/Simple 'lemon' Lexicon is the OWL version of the Spanish Parole & Simple lexicons (defined during the PAROLE LE2-4017 and SIMPLE LE4-8346 projects) once mapped to Lexinfo Model (http://lexinfo.net/). This data set has been published as Linked Open Data in the Data Hub (http://thedatahub.org/en/dataset/parole-simple-ont). The goal...
Data
SOURCES: https://github.com/martavillegas/metadata More info: See "Metadata as Linked Open Data: mapping disparate XML metadata registries into one RDF/OWL registry" Villegas, Melero & Bel. published in LREC-2014
Conference Paper
Full-text available
The proliferation of different metadata schemas and models pose serious problems of interoperability. Maintaining isolated repositories with overlapping data is costly in terms of time and effort. In this paper, we describe how we have achieved a Linked Open Data version of metadata descriptions coming from heterogeneous sources, originally encoded...
Article
Full-text available
The PAROLE/SIMPLE 'lemon’ Ontology and Lexicon are the OWL/RDF version of the PAROLE/SIMPLE lexicons (defined during the PAROLE (LE2-4017) and SIMPLE (LE4-8346) IV FP EU projects) once mapped onto lemon model and LexInfo ontology. Original PAROLE/SIMPLE lexicons contain morphological, syntactic and semantic information, organized according to a com...
Chapter
This chapter deals with how the lexical markup framework (LMF), intended to be a common model for lexical resources, supports the exchange of data and enables the merging of individual resources to form larger resources. It provides report, on the one hand, about the experiments on automatic merging of resources. In such experiments, the availabili...
Article
Full-text available
This paper describes on-going work for the construction of a new treebank for Spanish, The IULA Treebank. This new resource will contain about 60,000 richly annotated sentences as an extension of the already existing IULA Technical Corpus which is only PoS tagged. In this paper we have focused on describing the work done for defining the annotation...
Article
Full-text available
The ISLE project is a continuation of the long standing EAGLES initiative, carried out under the Human Language Technology (HLT) programme in collaboration between American and European groups in the framework of the EU-US International Research Co-operation, supported by NSF and EC. In this paper we concentrate on the current position of the ISLE...
Article
Full-text available
This article presents an initial approach to open access publication of research data from the humanities. It is a case study of the creation of two data collections at the institutional repository of the Universitat Pompeu Fabra, one containing the annexes of doctoral theses and the other, language tools and resources. Finally, we analyze the expe...
Article
Full-text available
El artículo presenta una primera aproximación a la publicación en acceso abierto de datos resultantes de la investigación en el área de humanidades. Describe el estudio de caso implementado en el repositorio institucional de la Universitat Pompeu Fabra, a partir de la creación de dos colecciones con datos, una para los anexos presentes en las tesis...
Conference Paper
Full-text available
The research reported in this paper is part of the activities carried out within the CLARIN 1 (Common Language Resources and Technology Infrastructure) project. CLARIN is a large-scale pan-European project to create, coordinate and make language resources and technology available and readily useable. CLARIN is devoted to the creation of a persisten...
Conference Paper
Full-text available
This paper reports our experience when integrating differ resources and services into a grid environment. The use case we address implies the deployment of several NLP applications as web services. The ultimate objective of this task was to create a scenario where researchers have access to a variety of services they can operate. These services sho...
Conference Paper
Full-text available
The research reported in this paper is part of the activities carried out within the CLARIN (common language resources and technology infrastructure) project, a large-scale pan-European project to create, coordinate and make language resources and technologies (LRT) available and readily useable. CLARIN is devoted to the creation of a persistent an...
Article
This article presents the CLARIN (Common Language Resources and Technologies) project, a large-scale pan-European collaborative project that aims to promote the use of technological tools in research in the fields of the humanities and social sciences.CLARIN is one of the 35 projects selected by ESFRI (European Strategy Forum on Research Infrastruc...
Article
Full-text available
En aquest article presentem CLARIN (Common Language Resources and Technologies), un projecte de col·laboració europea a gran escala l'objectiu del qual és potenciar l'ús d'instruments tecnològics en la recerca en els àmbits de les humanitats i les ciències socials. CLARIN és un dels trenta-cinc projectes seleccionats pel Comitè ESFRI (European Stra...
Conference Paper
Full-text available
Despite of the importance of lexical resources for a number of NLP applications (Machine Translation, Information Extraction, Event Detection and Tracking, Question Answering, among others), there has been a traditional lack of generic tools for the creation, maintenance and management of computational lexica. The most direct obstacle for the devel...
Article
Full-text available
RESUMEN En este artículo presentamos el proyecto CLARIN (Common Language Resources and Technologies), un proyecto de colaboración europea a gran escala cuyo objetivo es potenciar el uso de tecnologías en la investigación en Humanidades y Ciencias Sociales. CLARIN pretende crear la infraestructura necesaria para llevar a estas disciplinas los benefi...
Conference Paper
Full-text available
This article deals with the problem of Cross-Lingual Text Categorization (CLTC), which arises when documents in different languages must be classified according to the same classification tree. We describe practical and cost-effective solutions for automatic Cross-Lingual Text Categorization, both in case a sufficient number of training examples is...
Article
This document is concerned with the acquisition of LMT's from the texts PEKING is to deal with and its storage for later cross-lingual linking: In section 2, a first definition of the lexical database will be given. In section 3 a motivation for the use of LMT's will be supplied as to justify the need of linguistic processing. Sections 4 and 5 repo...
Article
Full-text available
The project LE-SIMPLE is an innovative attempt at building harmonized syntacticsemantic lexicons for twelve European languages, intended for use in different Human Language Technology applications. SIMPLE provides a general design model for the encoding of a large amount of semantic information, ranging from ontological typing, to argument structur...
Article
Full-text available
This paper describes a procedure to convert the PAROLE-SIMPLE monolingual lexicons into bilingual interrelated lexicons where each word sense of a given language is linked to the pertinent sense of the right words in one or more target lexicons. Nowadays, SIMPLE lexicons are monolingual although the ultimate goal of these harmonised monolingual lex...
Article
U Collective Animal or Human = (Collective + O) V Plant or Animal = (P + A) W Inanimate Concrete or Abstract = (T + I) X Abstract or Human = (T + H) Y Abstract or Animate = (T + H) 82 CHAPTER 3. LEXICAL SEMANTIC RESOURCES Z Unmarked 1 Human or Solid = (H + S) 2 Abstract or Solid = (T + S) 4 Abstract Physical 5 Organic Material 6 Liquid or Abstract...
Article
Full-text available
This article addresses the question of how to deal with text categorization when the set of documents to be classified belong to different languages. The figures we provide demonstrate that cross-lingual classification where a classifier is trained using one language and tested against another is possible and feasible provided we translate a small...
Article
Full-text available
The ISLE Computational Lexicon Working Group is committed to the consensual definition of a standardized infrastructure to develop multilingual resources for HLT applications. In particular, the ISLE-CLWG pursues this goal by designing MILE (Multilingual ISLE Lexical Entry), a general schema for the encoding of multilingual lexical information. Thi...
Article
Full-text available
Although most of the Language Technologies applications need to develop and maintain large lexica, there has been a lack of generic tools for its creation, maintenance, and management which are independent of particular applications, and are well equipped for supporting lexicographic work. The most important obstacle to such generic tools was the p...

Network

Cited By

Projects

Projects (2)
Project
El objetivo general del Plan de Impulso de las Tecnologías del Lenguaje es desarrollar la industria del procesamiento del lenguaje natural y la traducción automática en España, y especialmente en lengua española y lenguas cooficiales, por medio de los siguientes objetivos específicos: Aumentar el número, calidad y disponibilidad de las infraestructuras lingüísticas en español y lenguas cooficiales. Impulsar la Industria del lenguaje fomentando la transferencia de conocimiento entre el sector investigador y la industria. Ayudar a la internacionalización de las empresas e instituciones que componen el sector. Fomentar la cooperación con la comunidad iberoamericana para liderar la implantación de las tecnologías de procesamiento de lenguaje natural y traducción automática en español. Facilitar la puesta en marcha de servicios digitales a nivel paneuropeo para garantizar un mecanismo de coordinación de información y acceso a recursos lingüísticos para las lenguas de Europa. Mejorar la calidad y capacidad del servicio público incorporando las tecnologías de procesamiento de lenguaje natural y de la traducción automática, actuando, además, como tractor de la demanda. Apoyar la generación, estandarización y difusión de recursos lingüísticos creados en el contexto de la actividad de gestión pública propia de la Administración. Facilitar la colaboración entre empresas y organismos públicos de investigación mediante iniciativas que fortalezcan el conocimiento mutuo de las capacidades y las necesidades.