About
102
Publications
45,144
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
641
Citations
Introduction
Jesús Vilares graduated in Computer Science Engineering from the University of A Coruña (Spain) in 2000. After a short period as a lecturer at the University of Vigo (Spain), he obtained a Ph.D. Grant from the Spanish Ministry of Education (FPU Grant) at the University of A Coruña, where he obtained his Ph.D. in Computer Science in 2005. He was a member of the founding committee of the Spanish Society for Information Retrieval. He is currently an Associate Professor at University of A Coruña, where he is a member of Language in the Information Society (LYS) Research Group on Natural Language Processing and Computational Linguistics, and member of the Scientific Board of the Centre for Information and Communications Technology Research (CITIC).
Additional affiliations
January 2012 - present
June 2017 - present
Publications
Publications (102)
Non-active adaptive sampling is a way of building machine learning models from a training data base which are supposed to dynamically and automatically derive guaranteed sample size. In this context and regardless of the strategy used in both scheduling and generating of weak predictors, a proposal for calculating absolute convergence and error thr...
In recent years, we have witnessed a rise in fake news, i.e., provably false pieces of information created with the intention of deception. The dissemination of this type of news poses a serious threat to cohesion and social well-being, since it fosters political polarization and the distrust of people with respect to their leaders. The huge amount...
Parsing is a core natural language processing technique that can be used to obtain the structure underlying sentences in human languages. Named entity recognition (NER) is the task of identifying the entities that appear in a text. NER is a challenging natural language processing task that is essential to extract knowledge from texts in multiple do...
Research on word embeddings has mainly focused on improving their performance on standard corpora, disregarding the difficulties posed by noisy texts in the form of tweets and other types of non-standard writing from social media. In this work, we propose a simple extension to the skipgram model in which we introduce the concept of bridge-words, wh...
The prominent graphic component of video games greatly limits the accessibility of this type of entertainment by visually impaired users. We make here an overview of the first games developed within an initiative for the development of roguelike games adapted to visually impaired players by using Natural Language Processing techniques. Our approach...
Research on word embeddings has mainly focused on improving their performance on standard corpora, disregarding the difficulties posed by noisy texts in the form of tweets and other types of non-standard writing from social media. In this work, we propose a simple extension to the skipgram model in which we introduce the concept of bridge-words, wh...
We introduce an adaptive scheduling for adaptive sampling as a novel way of machine learning in the construction of part-of-speech taggers. The goal is to speed up the training on large data sets, without significant loss of performance with regard to an optimal configuration. In contrast to previous methods using a random, fixed or regularly risin...
User–generated content published on microblogging social networks constitutes a priceless source of information. However, microtexts usually deviate from the standard lexical and grammatical rules of the language, thus making its processing by traditional intelligent systems very difficult. As an answer, microtext normalization consists in transfor...
In social media platforms special tokens abound such as hashtags and mentions in which multiple words are written together without spacing between them; e.g. # leapyear or @ryanreynoldsnet. Due to the way this kind of texts are written, this word assembly phenomenon can appear with its opposite, word segmentation, affecting any token of the text an...
In contrast with their monolingual counterparts, little attention has been paid to the effects that misspelled queries have on the performance of Cross-Language Information Retrieval (CLIR) systems. The present work makes a first attempt to fill this gap by extending our previous work on monolingual retrieval in order to study the impact that the p...
Digital Heritage deals with the use of computing and information technologies for the preservation and study of the human cultural legacy. Within this context, we present here a Text Retrieval system developed specifically to work with Egyptian hieroglyphic texts for its use by Egyptologists and Linguists in the study and preservation of Ancient Eg...
The field of Cross-Language Information Retrieval relates techniques close to both the Machine Translation and Information Retrieval fields, although in a context involving characteristics of its own. The present study looks to widen our knowledge about the effectiveness and applicability to that field of non-classical translation mechanisms that w...
La Minería de Opiniones es la disciplina que aborda el tratamiento automático de las opiniones contenidas en un texto. Permite, por ejemplo, determinar si en un texto se está opinando o no, o si la polaridad o sentimiento que se expresa en el mismo es positiva, negativa o mixta. También permite la extracción automática de características, lo que po...
Presentamos Ask Classora!, una interfaz en lenguaje natural para la Classora Knowledge Base, base de conocimiento comercial alimentada a partir de fuentes web. La interfaz permite al usuario interaccionar con dicha base en su propio idioma, de forma que sus consultas sean interpretadas y traducidas al lenguaje formal de consulta para la interrogaci...
Este trabajo describe, desde el punto de vista de la arquitectura y el diseño, el sistema de normalización desarrollado por nuestro grupo para el preprocesamiento de tuits en tareas de Minería de Texto. Nuestras premisas básicas durante su desarrollo han sido flexibilidad, escalabilidad y mantenibilidad. Presentamos también su aplicación práctica e...
This paper describes our participation at RepLab 2014, a competitive evaluation for reputation monitoring on Twitter. The fol-lowing tasks were addressed: (1) categorisation of tweets with respect to standard reputation dimensions and (2) characterisation of Twitter profiles, which includes: (2.1) identifying the type of those profiles, such as jou...
We describe here our partipation in Tweet LID. After having studied the problem of language identification, the resources available, and designed a text conflation approach for this kind of tasks, we joined the competition with two systems: the first one was based in the guesser langdetect, re-trained and adapted in order to work with conflated tex...
This work describes the system for the normalization of tweets in Spanish designed by the Language in the Information Society (LYS) Group of the University of A Coruña for Tweet-Norm 2013. It is a conceptually simple and flexible system, which uses few resources and that faces the problem from a lexical point of view.
En este trabajo se presentan las conclusiones extraídas tras evaluar los resultados académicos obtenidos por los estudiantes en la asignatura de Programación II, del primer curso del Grado en Ingeniería Informática en la Universidad de A Coruña. Los datos, pertenecientes al segundo año de implantación de la asignatura bajo las directrices del EEES,...
This work describes the system for the normalization of tweets in Spanish designed by the Language in the Information Society (LYS) Group of the University of A Coruna for Tweet-Norm 2013. It is a conceptually simple and flexible system, which uses few resources and that faces the problem from a lexical point of view.
Resumen Continuando nuestra investigación sobre el empleo de n-gramas de caracteres como unidad de traducción en sistemas de RI Multilingüe, este artículo analiza el comportamiento de nuestra solución en direcciones inversas de traducción a partir de sendos experimentos paralelos con consultas en inglés sobre textos en español y viceversa. Lo posit...
Introducción El Procesamiento del Lenguaje Natural (PLN), también en ocasiones referido como Lingüística Computacional [Jurafsky y Martin, 2009; Mitkov, 2005] es la disciplina encargada del diseño e implementación de los elementos software necesarios para el tratamiento computacional del lenguaje natural, entendiendo como tal todo lenguaje humano,...
Our work concerns the design of robust information retrieval environments that can successfully handle queries containing misspelled words. Our aim is to perform a comparative analysis of the efficacy of two possible strategies that can be adopted.A first strategy involves those approaches based on correcting the misspelled query, thus requiring th...
Resumen Presentamos la experiencia de un grupo de profesores de Informática en la impartición de dos asignaturas marcadamente interdisciplinares que unen dos ámbitos tan diferentes como la informática y las humanidades: una que se imparte en el segundo ciclo de la titulación de Ingeniería Informática, y otra que se imparte en un máster universitari...
Resumen Debido a su carácter marcadamente algebraico, la materia de Teoría de Autómatas y Lenguajes Formales puede ser percibida como excesivamente teórica, con el consiguiente riesgo de rechazo por parte del alumno. Compartimos aquí nuestra experiencia dentro de esta asignatura, a la que hemos logrado dar un carácter mucho más práctico mediante un...
En este artículo presentamos el trabajo que en el Grupo LYS
(Lengua y Sociedad de la Información) hemos venido desarrollando en
fechas recientes en las áreas de recuperación de información tolerante
a errores y recuperación de información multilingüe. El nexo común
entre ambas líneas de investigación es el empleo de n-gramas de
caracteres como unid...
Resumen: Gran parte de los docentes coincide en que los exámenes no permiten una evaluación adecuada de los conocimientos, competencias y habilidades adquiridos por los alumnos, si bien esto parece ser aceptado sin grandes preocupaciones. Repasamos aquí una década de trabajo dentro una materia optativa de segundo ciclo de Ingeniería Informática don...
El proceso de adaptación de las titulaciones universitarias españolas al Espacio
Europeo de Educación Superior ha propiciado el surgimiento de titulaciones con un marcado
carácter interdisciplinar. El caso que nos ocupa es el de un máster oficial que busca dotar de
competencias informáticas a filólogos para su incorporación al emergente mercado lab...
With increasingly higher numbers of non-English language web searchers the problems of efficient handling of non-English Web
documents and user queries are becoming major issues for search engines. The main aim of this review paper is to make researchers
aware of the existing problems in monolingual non-English Web retrieval by providing an overvie...
We present a compiler which can be used to automatically obtain efficient Java implementations of parsing
algorithms from formal specifications expressed as parsing schemata. The system performs an analysis of
the inference rules in the input schemata in order to determine the best data structures and indexes to use,
and ensure that the generated i...
The First iNEWS'07 Workshop took place in Amsterdam (The Netherlands) in conjunction with the 30th Annual International ACM SIGIR Conference (SIGIR'07). The workshop aims at bringing together researchers interested in the issues surrounding non-English web searching. Nowadays, over 60% of Internet users are non-English speakers and the number of no...
Our work relies on the design and evaluation of experimental
information retrieval systems able to cope with textual misspellings
in queries. In contrast to previous proposals, commonly based on the
consideration of spelling correction strategies and a word language model,
we also report on the use of character n-grams as indexing support.
The performance of information retrieval systems is limited by the linguistic variation present in natural language texts. Word-level natural language processing techniques have been shown to be useful in reducing this variation. In this article, we summarize our work on the extension of these techniques for dealing with phrase-level variation in E...
In this paper, we propose and evaluate two different alter- natives to deal with degraded queries on Spanish ir applica- tions. The first one is an n-gram-based strategy which has no dependence on the degree of available linguistic knowl- edge. On the other hand, we propose two spelling correc- tion techniques, one of which has a strong dependence...
We study how the use of syntactic information can improve the performance of Information Retrieval systems based on single-word terms. We consider two different approaches. The first one identifies the syntactic structure of the text by means of a shallow parser in order to extract the head-modifier pairs of the most relevant syntactic dependencies...
Resumen Presentamos en este trabajo una revisión del estado del arte de la familia de los modelos probabilísticos de recuperación de información. Partiendo de los principios básicos que sustentan estos modelos, estudiaremos diferentes modelos concretos: el modelo de independencia binaria —el más básico —, el ya clásico BM25 y, finalmente, los model...
This workshop attempted to promote the discussion and the research on non-English Web searching. Most search engines were first built for English. They do not take full account of inflectional semantics nor, for example, diacritics or the use of capitals. Our main aim was to discuss the additional problems faced in non-English Web queries and to su...
This paper describes an extension of our work presented in the robust English-to-French bilingual task of the CLEF 2007 workshop,
a knowledge-light approach for query translation in Cross-Language Information Retrieval systems. Our work is based on the
direct translation of character n-grams, avoiding the need for word normalization during indexing...
This is our third participation in CLEF, this time in the Spanish monolingual Question Answering track. We have continued applying Natural Language Processing techniques for single word conflation. Our approach for Question Answering is based on complex pattern matching either over forms, part-of-speech tags or lemmas of the words involved.
This paper describes the technique for translation of character n-grams we developed for our participation in CLEF 2006. This solution avoids the need for word normalization during indexing
or translation, and it can also deal with out-of-vocabulary words. Since it does not rely on language-specific processing,
it can be applied to very different l...
The parsing schemata formalism allows us to describe pars- ing algorithms in a simple, declarative way by capturing their fundamen- tal semantics while abstracting low-level detail. In this work, we present a compilation technique allowing the automatic transformation of parsing schemata to ecient executable implementations of their corresponding a...
The First Workshop on Improving Non English Web Searching (iNEWS’07) took place on July 27 in
Amsterdam (The Netherlands) in conjunction with the 30th Annual International ACM SIGIR
Conference (SIGIR’07) aiming at bringing together researchers interested in non-English web searching.
Nowadays, over 60% of the online population are non-English speak...
This paper describes a new technique for the direct translation of character n-grams for use in Cross-Language Information Retrieval systems. This solution avoids the need for word normalization during
indexing or translation, and it can also deal with out-of-vocabulary words. This knowledge-light approach does not rely on
language-specific process...
We present a technique for the construction
of efficient prototypes for natural language
parsing based on the compilation of parsing
schemata to executable implementations of
their corresponding algorithms. Taking a
simple description of a schema as input, Java
code for the corresponding parsing algorithm
is generated, including schema-specific ind...
This work is an extension of our proposal originally presented in CLEF 2006, which, unfortunately, could not be ready on time for the workshop. We describe here a knowledge-light approach for query translation in Cross-Language Information Re-trieval systems. This proposal itself can be considered as an extension of the previous work of the Johns H...
STAIRS 2006 is the third European Starting AI Researcher Symposium, an international meeting aimed at AI researchers, from all countries, at the beginning of their career: PhD students or people holding a PhD for less than one year. A total of 59 papers ...
In this our first joint participation as the CoLesIR group, our team has partici-pated in the Portuguese monolingual ad-hoc task and in all robust ad-hoc tasks —all monolingual tasks, the English-to-German bilingual task, and the multilingual task. We have developed an n-gram model inspired by the previous work of the Johns Hopkins University Appli...
To date, attempts for applying syntactic information in the document-based retrieval model dominant have led to little practical improvement, mainly due to the problems associated with the integration of this kind of information into the model. In this article we propose the use of a locality-based retrieval model for reranking, which deals with sy...
Resumen Presentamos un compilador capaz de gene-rar analizadores sintácticos a partir de esque-mas de análisis sintáctico. Dichos esquemas son representaciones de los analizadores en for-ma de sistemas deductivos, que abstraen los detalles de implementación y permiten definir y comparar fácilmente diferentes algoritmos. Nuestro compilador es capaz...
The paper introduces a robust spelling correction technique to deal with ill-formed input strings, including unknown parts of unknown length. In contrast to previous works, we derive profit from a finer dynamic programming construction, which takes advantage of the underlying grammatical structure, leading to an improved computational behavior and...
PhD Thesis in Computer Science written by Jesús Vilares Ferro under the supervision of Dr. Miguel Ángel Alonso Pardo and Dr. José Luis Freire Nistal (Universidade da Coruña, Spain). The author was examined on 20th May, 2005 by the commitee formed by Dr. Gabriel Pereira Lopes (Universidade Nova de Lisboa, Portugal), Dr. John Irving Tait (University...
In Information Retrieval (IR) systems, the correct representation of a document through an accurate set of index terms is
the basis for obtaining a good performance. If we are not able to both extract and weight appropriately the terms which capture
the semantics of the text, this shortcoming will have an effect on all the subsequent processing.
Information Retrieval systems are limited by the linguistic variation of language. The use of Natural Language Processing techniques to manage this problem has been studied for a long time, but mainly focusing on English. In this paper we deal with European languages, taking Spanish as a case in point. Two different sources of syntactic information...
The parsing schemata formalism allows us to describe parsing algorithms in a simple way by capturing their fundamental semantics while abstracting low-level detail. In this work, we present a compilation technique allowing automatic transformation of parsing schemata to executable implementations of their corresponding algorithms. Taking a simple d...
To date, attempts for applying syntactic information in the document-based retrieval model dominant have led to little practical improvement, mainly due to the problems associated with the integration of this kind of information into the model. In this article we propose the use of a locality-based retrieval model for reranking, which deals with sy...
This paper is a report on our third participation in CLEF. More precisely, this year we have participated in the Spanish Monolingual
Question Answering Track for the first time. As a result we have developed a prototype of a QA system. Our prototype continues
to apply the Natural Language Processing techniques we had already developed for single wo...
This article describes the application of lemmatization and shallow parsing as a linguistically-based alternative to stemming
in Text Retrieval, with the aim of managing linguistic variation at both word level and phrase level. Several alternatives
for selecting the index terms among the syntactic dependencies detected by the parser are evaluated....
This work intends to capture the concept of similarity between phrases. The algorithm is based on a dynamic programming approach
integrating both the edit distance between parse trees and single-term similarity. Our work stresses the use of the underlying
grammatical structure, which serves as a guide in the computation of semantic similarity betwe...
A robust parser for context-free grammars, based on a dynamic programming architecture, is described. We integrate a regional error repair algorithm and a strategy to deal with incomplete sentences including unknown parts of unknown length. Experimental tests prove the validity of the approach, illustrating the perspectives for its application in r...
A robust parser for context-free grammars, based on a dynamic programming architecture, is described. We integrate a regional error repair algorithm and a strategy to deal with incomplete sentences including unknown parts of unknown length. Experimental tests prove the validity of the approach, illustrating the perspectives for its application in r...
We describe a context-free parsing algorithm to deal with incomplete sentences, including unknown parts of unknown length. It produces a finite shared-forest compiling all parses, often infinite in number, that could account for both the error and the missing parts.
In contrast to previous works, we derive profit from a finer dynamic programming co...
The employment of Natural Language Processing techniques for Information Retrieval has been studied many times, but such works have mainly focused on English. In this article we describe the evolution of the research developed by our group for the case of Spanish: from our initial experiments, characterized by the lack of standard resources for eva...
The employment of Natural Language Processing techniques for Information Retrieval has been studied many times, but such works have mainly focused on English. In this article we describe the evolution of the research developed by our group for the case of Spanish: from our initial experiments, characterized by the lack of standard resources for eva...
The extraction of the keywords that character-ize each document in a given collection is one of the most important components of an Informa-tion Retrieval system. In this article, we pro-pose to apply shallow parsing, implemented by means of cascades of finite-state transducers, to extract complex index terms based on an ap-proximate grammar of Spa...
El rendimiento de los sistemas de Recuperación de Información se ve limitado por los fenómenos de variación lingüística presentes en los textos. Las técnicas de Procesamiento de Lenguaje Natural a nivel de palabra han mostrado su utilidad para reducir dicha variación. Proponemos en este artículo extender esta aproximación a la variación a nivel de...
In this our second participation in the CLEF Spanish monolingual track, we have continued ap- plying Natural Language Processing techniques for single word and multi-word term conflation. Two different conflation approaches have been tested. The first approach is based on the lemmatization of the text in order to avoid inflectional variation. Our s...
A robust parser for context-free grammars, based on a dynamic programming architecture, is described. We integrate a regional
error repair algorithm and a strategy to deal with incomplete sentences including unknown parts of unknown length. Experimental
tests prove the validity of the approach, illustrating the perspectives for its application in r...
The extraction of the keywords that characterize a document in a given collection is one of the most important components of an Information Retrieval system. In this article, we propose to apply shallow parsing, implemented by means of cascades of finite-state transducers, to extract complex index terms based on an approximated grammar of Spanish....
We consider a set of natural language processing techniques based on nite-state technology that can be used to analyze huge amounts of texts. These techniques include an advanced tokenizer, a part-of-speech tagger that can manage ambiguous streams of words, a system for conating words by means of derivational mechanisms, and a shallow parser to ext...
Abstract In this our rst participation in CLEF, we have applied Natural Language Processing techniques for single word and multi - word term con ation We have tested several approaches at dif - ferent levels of text processing in our experiments: rstly, we have lemmatized the text to avoid in ectional variation; secondly, we have expanded the queri...
In this paper we consider a set of natural language processing techniques that can be used to analyze large amounts of texts, focusing on the advanced tokenizer which accounts for a number of complex linguistic phenomena, as well as for pre-tagging tasks such as proper noun recognition. We also show the results of several experiments performed in o...
We consider a set of natural language processing techniques based on finite-state technology that can be used to analyze huge amounts of texts. These techniques include an advanced tokenizer, a part-of-speech tagger that can manage ambiguous streams of words, a system for conflating words by means of derivational mechanisms, and a shallow parser to...
In recent years, there has been a considerable amount of interest in using Natural Language Processing in Information Retrieval research, with specic implementations varying from the word-level mor- phological analysis to syntactic parsing to conceptual-level semantic anal- ysis. In particular, dieren t degrees of phrase-level syntactic information...
With the aim of removing the residuary errors made by pure stochastic disambiguation models, we put forward a hybrid system in which linguist users introduce high level contextual rules to be applied in combination with a tagger based on a Hidden Markov Model. The design of these rules is inspired in the Constraint Grammars formalism. In the presen...
One of the most important prior tasks for robust part-ofspeech tagging is the correct tokenization or segmentation of the
texts. This task can involve processes which are much more complex than the simple identification of the different sentences
in the text and each of their individual components, but it is often obviated in many current applicati...
This article presents two new approaches for term indexing which are particularly appropriate for languages with a rich lexis
and morphology, such as Spanish, and need few resources to be applied. At word level, productive derivational morphology is
used to conflate semantically related words. At sentence level, an approximate grammar is used to co...
The last years have seen a renewal of interest in applying dynamic programming to natural language processing. The main advantage is the compactness of the representations, which is turning this paradigm into a common way of dealing with highly redundant computations related to phenomena such as non-determinism.