Miguel Ángel Alonso PardoUniversidade da Coruña | UDC · Department of Computer Science
Miguel Ángel Alonso Pardo
PhD in Computer Science
About
164
Publications
35,103
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,404
Citations
Introduction
I am with LYS, the research group on Natural Language Processing (NLP) at the University of A Coruña, Spain. Currently, my fields of interest are: Multilingual text processing; Opinion Mining and Sentiment Analysis; Information Retrieval applying NLP techniques; Parsing.
Additional affiliations
Education
October 1993 - September 2000
Publications
Publications (164)
We present a novel unsupervised approach for multilingual sentiment analysis driven by compositional syntax-based rules. On the one hand, we exploit some of the main advantages of unsupervised algorithms: (1) the interpretability of their output, in contrast with most supervised models, which behave as a black box and (2) their robustness across di...
This article tackles the problem of performing multilingual polarity classification on Twitter, comparing three techniques: (1) a multilingual model trained on a multilingual dataset, obtained by fusing existing monolingual resources, that does not need any language recognition step, (2) a dual monolingual model with perfect language detection on m...
Parsing is a core natural language processing technique that can be used to obtain the structure underlying sentences in human languages. Named entity recognition (NER) is the task of identifying the entities that appear in a text. NER is a challenging natural language processing task that is essential to extract knowledge from texts in multiple do...
In recent years, we have witnessed a rise in fake news, i.e., provably false pieces of information created with the intention of deception. The dissemination of this type of news poses a serious threat to cohesion and social well-being, since it fosters political polarization and the distrust of people with respect to their leaders. The huge amount...
The COVID-19 pandemic has affected many aspects of human life. The pandemic not only caused millions of fatalities and problems but also changed public sentiment and behavior. Owing to the magnitude of this pandemic, governments worldwide adopted full lockdown measures that attracted much discussion on social media platforms. To investigate the eff...
To our knowledge, the majority of human language processing technologies for low-resource languages don’t have well-established linguistic resources for the development of sentiment analysis applications. Therefore, it is in dire need of such tools and resources to overcome the NLP barriers, so that, low-resource languages can deliver more benefits...
Making natural language processing technologies available for low-resource languages is an important goal to improve the access to technology in their communities of speakers. To our knowledge, there are no well-established linguistic resources for the development of sentiment analysis applications for the Uzbek language. In this paper, we fill tha...
We describe four systems to generate automatically bilingual dictionaries based on existing ones: three transitive systems differing only in the pivot language used, and a system based on a different approach which only needs monolingual corpora in both the source and target languages. All four methods make use of cross-lingual word embeddings trai...
This paper addresses the feasibility of cross-lingual parsing with Universal Dependencies (UD) between Romance languages, analyzing its performance when compared to the use of manually annotated resources of the target languages. Several experiments take into account factors such as the lexical distance between the source and target varieties, the...
Lexicon-based methods using syntactic rules for polarity classification rely on parsers that are dependent on the language and on treebank guidelines. Thus, rules are also dependent and require adaptation, especially in multilingual scenarios. We tackle this challenge in the context of the Iberian Peninsula, releasing the first symbolic syntax-base...
En este trabajo presentamos una nueva estrategia para crear treebanks de lenguas con pocos recursos para el análisis sintáctico. El método consiste en la adaptación y combinaci ón de diferentes treebanks anotados con dependencias universales de variedades lingïísticas próximas, con el objetivo de entrenar un analizador
sintáctico para la lengua ele...
This paper presents a novel strategy for creating a Universal Dependencies (UD) treebank of a low-resource language. The method consists of adapting and combining different UD treebanks from related varieties in order to train a parser for the target language. More precisely, the paper explores the influence of three different levels for the select...
We introduce an approach to train lexicalized parsers using bilingual corpora obtained by merging harmonized treebanks of different languages, producing parsers that can analyze sentences in either of the learned languages, or even sentences that mix both. We test the approach on the Universal Dependency Treebanks, training with MaltParser and Malt...
In contrast with their monolingual counterparts, little attention has been paid to the effects that misspelled queries have on the performance of Cross-Language Information Retrieval (CLIR) systems. The present work makes a first attempt to fill this gap by extending our previous work on monolingual retrieval in order to study the impact that the p...
In this paper we describe our deep learning approach for solving both two-, three- and fiveclass tweet polarity classification, and twoand five-class quantification. We first trained a convolutional neural network using pretrained Twitter word embeddings, so that we could extract the hidden activation values from the hidden layers once some input h...
Code-switching texts are those that contain terms in two or more different languages, and they appear increasingly often in social media. The aim of this paper is to provide a resource to the research community to evaluate the performance of sentiment classification techniques on this complex multilingual environment, proposing an English-Spanish c...
The field of Cross-Language Information Retrieval relates techniques close to both the Machine Translation and Information Retrieval fields, although in a context involving characteristics of its own. The present study looks to widen our knowledge about the effectiveness and applicability to that field of non-classical translation mechanisms that w...
In democratic countries, forecasting the voting intentions of citizens and knowing their opinions on major political parties and leaders is of great interest to the parties themselves, to the media, and to the general public. Traditionally, expensive polls based on personal interviews have been used for this purpose. The rise of social networks, pa...
We present a novel unsupervised approach for multilingual sentiment analysis driven by compositional syntax-based rules. On the one hand, we exploit some of the main advantages of unsupervised algorithms: (1) the interpretability of their output, in contrast with most supervised models, which behave as a black box and (2) their robustness across di...
Twitter is an important platform for sharing opinions about politicians, parties and political decisions. These opinions can be exploited as a source of information to monitor the impact of politics on society. This article analyses the sentiment of 2,704,523 tweets referring to Spanish politicians and parties from a month in 2014-15. The article m...
La Minería de Opiniones es la disciplina que aborda el tratamiento automático de las opiniones contenidas en un texto. Permite, por ejemplo, determinar si en un texto se está opinando o no, o si la polaridad o sentimiento que se expresa en el mismo es positiva, negativa o mixta. También permite la extracción automática de características, lo que po...
This paper describes the participation of the LyS group at tass 2015. In this year's edition, we used a long short-term memory neural network to address the two proposed challenges: (1) sentiment analysis at a global level and (2) aspect-based sentiment analysis on football and political tweets. The performance of this deep learning approach is com...
We introduce an approach to train parsers using bilingual corpora obtained by
merging harmonized treebanks of different languages, producing parsers that
effectively analyze sentences in any of the learned languages, or even
sentences that mix both languages. We test the approach on the Universal
Dependency Treebanks, training with MaltParser and M...
Millions of micro texts are published every day on Twitter. Identifying the sentiment present in them can be helpful for measuring the frame of mind of the public, their satisfaction with respect to a product or their support of a social event. In this context, polarity classification is a subfield of sentiment analysis focussed on determining whet...
The vast amount of opinions and reviews provided in Twitter is helpful in order to make interesting findings about a given industry, but given the huge number of messages published every day it is important to detect the relevant ones. In this respect, the Twitter search functionality is not a practical tool when we want to poll messages dealing wi...
We address the problem of performing polarity classification on Twitter over different languages, focusing on English and Spanish, comparing three techniques: (1) a monolingual model which knows the language in which the opinion is written, (2) a monolingual model that acts based on the decision provided by a language identification tool and (3) a...
This paper describes our participation at the third edition of the work-shop on Sentiment Analysis focused on Spanish tweets, tass 2014. This year's eval-uation campaign includes four challenges: (1) global sentiment analysis, (2) topic classification, (3) aspect-extraction and (4) aspect-based sentiment analysis. Tasks 1 and 2 are addressed from a...
Resumen Empresas y organizaciones están empezado a interesarse en monitorizar lo que los usuarios opinan sobre ellas en Twitter ya que los tuits constituyen una buena fuente de información para conocer la percepción que la sociedad tiene sobre sú area de negocio. Para ello, primero es necesario discriminar las opiniones no relacionadas, dada la gra...
We describe an opinion mining system which classifies the polarity of Spanish texts. We propose an NLP approach that undertakes pre-processing, tokenisation and POS tagging of texts to then obtain the syntactic structure of sentences by means of a dependency parser. This structure is then used to address three of the most significant linguistic con...
This paper describes our participation at RepLab 2014, a competitive evaluation for reputation monitoring on Twitter. The fol-lowing tasks were addressed: (1) categorisation of tweets with respect to standard reputation dimensions and (2) characterisation of Twitter profiles, which includes: (2.1) identifying the type of those profiles, such as jou...
This paper proposes an approach to solve
message- and phrase-level polarity classification
in Twitter, derived from an existing
system designed for Spanish. As a
first step, an ad-hoc preprocessing is performed.
We then identify lexical, psychological
and semantic features in order to
capture different dimensions of the human
language which are hel...
This article describes the approach developed by our group in order to resolve the sentiment analysis at a global level, topic identification and political tendency classification tasks on Spanish tweets; proposed at the Workshop of Sentiment Analysis at sepln (tass 2013). As a preliminary step, we carry out an ad-hoc preprocessing in order to norm...
This work describes the system for the normalization of tweets in Spanish designed by the Language in the Information Society (LYS) Group of the University of A Coruña for Tweet-Norm 2013. It is a conceptually simple and flexible system, which uses few resources and that faces the problem from a lexical point of view.
We describe a system that classifies the polarity of Spanish tweets. We adopt a hybrid approach, which combines machine learning and linguistic knowledge acquired by means of NLP. We use part-of-speech tags, syntactic dependencies and semantic knowledge as features for a supervised classifier. Lexical particularities of the language used in Twitter...
This article describes a system that classifies the polarity of Spanish tweets. We adopt a hybrid approach, which combines linguistic knowledge acquired by means of nlp with machine learning techniques. We carry out a preprocessing of the tweets as an initial step to address some characteristics of the language used in Twitter. Then, we apply part-...
En este trabajo se presentan las conclusiones extraídas tras evaluar los resultados académicos obtenidos por los estudiantes en la asignatura de Programación II, del primer curso del Grado en Ingeniería Informática en la Universidad de A Coruña. Los datos, pertenecientes al segundo año de implantación de la asignatura bajo las directrices del EEES,...
This article describes an opinion mining system that classifies the polarity of Spanish texts. We propose a nlp-based approach which performs segmentation, tokenization and pos tagging of texts to then obtain the syntactic structure of sentences by means of a dependency parser. The syntactic structure is then used to address three of the most signi...
En este artículo se describe un sistema de minería de opiniones que clasifica la polaridad
de textos en español. Se propone una aproximación basada enPLN que conlleva realizar una segmentación, tokenización y etiquetación de los textos para a continuación obtener la estructura sintáctica de las oraciones mediante algoritmos de análisis de dependenc...
En este art��ículo se describe un sistema para la clasi�caci�on de la polaridad de tuits escritos en español. Se adopta una aproximaci�ón h�í�brida, que combina conocimiento lingüí��stico obtenido mediante PLN con técnicas de aprendizaje automático. Como paso previo, se realiza una primera etapa de preprocesado para
tratar ciertas caracter��sticas...
This work describes the system for the normalization of tweets in Spanish designed by the Language in the Information Society (LYS) Group of the University of A Coruna for Tweet-Norm 2013. It is a conceptually simple and flexible system, which uses few resources and that faces the problem from a lexical point of view.
Introducción El Procesamiento del Lenguaje Natural (PLN), también en ocasiones referido como Lingüística Computacional [Jurafsky y Martin, 2009; Mitkov, 2005] es la disciplina encargada del diseño e implementación de los elementos software necesarios para el tratamiento computacional del lenguaje natural, entendiendo como tal todo lenguaje humano,...
Resumen Presentamos la experiencia de un grupo de profesores de Informática en la impartición de dos asignaturas marcadamente interdisciplinares que unen dos ámbitos tan diferentes como la informática y las humanidades: una que se imparte en el segundo ciclo de la titulación de Ingeniería Informática, y otra que se imparte en un máster universitari...
En este artículo presentamos el trabajo que en el Grupo LYS
(Lengua y Sociedad de la Información) hemos venido desarrollando en
fechas recientes en las áreas de recuperación de información tolerante
a errores y recuperación de información multilingüe. El nexo común
entre ambas líneas de investigación es el empleo de n-gramas de
caracteres como unid...
Robustness, the ability to analyze any input regardless of its grammaticality, is a desirable
property for any system dealing with unrestricted natural language text. Error-repair
parsing approaches achieve robustness by considering ungrammatical sentences as corrupted
versions of valid sentences. In this article we present a deductive formalism,
b...
Resumen: Gran parte de los docentes coincide en que los exámenes no permiten una evaluación adecuada de los conocimientos, competencias y habilidades adquiridos por los alumnos, si bien esto parece ser aceptado sin grandes preocupaciones. Repasamos aquí una década de trabajo dentro una materia optativa de segundo ciclo de Ingeniería Informática don...
El proceso de adaptación de las titulaciones universitarias españolas al Espacio
Europeo de Educación Superior ha propiciado el surgimiento de titulaciones con un marcado
carácter interdisciplinar. El caso que nos ocupa es el de un máster oficial que busca dotar de
competencias informáticas a filólogos para su incorporación al emergente mercado lab...
In order to produce efficient Natural Language Processing (NLP) tools, reliable linguistic resources are a preliminary requirement. When available for a given language, the resources are generally far below the expectations in terms of quality, coverage or usability. This paper presents a project whose ambition is to enhance the production capaciti...
La eficiencia de las herramientas dedicadas al Procesamiento de los Lenguajes Naturales (PLN) depende directamente de la calidad y la cobertura de los recursos lingüísticos sobre los cuales se basan. Presentamos un proyecto cuyo objetivo es mejorar las capacidades de producción de recursos lingüísticos.
We present a compiler which can be used to automatically obtain efficient Java implementations of parsing
algorithms from formal specifications expressed as parsing schemata. The system performs an analysis of
the inference rules in the input schemata in order to determine the best data structures and indexes to use,
and ensure that the generated i...
A desirable property for any system dealing with unrestricted natural language text is robustness, the ability to analyze
any input regardless of its grammaticality. In this paper we present a novel, general transformation technique to automatically
obtain robust, error-repair parsers from standard non-robust parsers. The resulting error-repair par...
The performance of information retrieval systems is limited by the linguistic variation present in natural language texts. Word-level natural language processing techniques have been shown to be useful in reducing this variation. In this article, we summarize our work on the extension of these techniques for dealing with phrase-level variation in E...
Logic programs share with context-free grammars a strong reliance on well-formedness conditions. Their proof procedures can
be viewed as a generalization of context-free parsing. In particular, definite clause grammars can be interpreted as an extension
of the classic context-free formalism where the notion of finite set of non-terminal symbols is...
We study how the use of syntactic information can improve the performance of Information Retrieval systems based on single-word terms. We consider two different approaches. The first one identifies the syntactic structure of the text by means of a shallow parser in order to extract the head-modifier pairs of the most relevant syntactic dependencies...
The parsing schemata formalism allows us to describe pars- ing algorithms in a simple, declarative way by capturing their fundamen- tal semantics while abstracting low-level detail. In this work, we present a compilation technique allowing the automatic transformation of parsing schemata to ecient executable implementations of their corresponding a...
Se presentan los esquemas de análisis sintáctico con corrección de errores, que permiten definir algoritmos de análisis sintáctico con corrección de errores de una manera abstracta y declarativa. Este formalismo puede utilizarse para describir dichos algoritmos de manera simple y uniforme, y proporciona una base formal para demostrar su corrección...
We present a technique for the construction
of efficient prototypes for natural language
parsing based on the compilation of parsing
schemata to executable implementations of
their corresponding algorithms. Taking a
simple description of a schema as input, Java
code for the corresponding parsing algorithm
is generated, including schema-specific ind...
We present a system allowing the automatic transformation of parsing schemata to effi cient executable implementations of their corresponding algorithms. This system can be used to easily prototype, test and compare different parsing algorithms. In this work, it has been used to generate several different parsers for Context Free Grammars and Tree...
Resumen: En este trabajo se estudia el comportamiento de los algoritmos de análisis sint´actico m´as utilizados en el tratamiento de las Gram´aticas de Adjunci´on de ´Arboles (TAG). Para ello se aplica una t´ecnica de compilaci´on que permite la transformaci ´on autom´atica de esquemas de an´alisis sint´actico en implementaciones eficientes de los...
Tree Adjoining Grammars (TAG) and Linear Indexed Grammars (LIG) are extensions of Context Free Grammars that generate the
class of Tree Adjoining Languages. Taking advantage of this property, and providing a method for translating a TAG into a
LIG, we define several parsing algorithms for TAG on the basis of their equivalent LIG parsers. We also ex...
In this paper, a generic system that generates
parsers from parsing schemata is applied
to the particular case of the XTAG
English grammar. In order to be able to
generate XTAG parsers, some transformations
are made to the grammar, and TAG
parsing schemata are extended with feature
structure unification support and a
simple tree filtering mechanism...
STAIRS 2006 is the third European Starting AI Researcher Symposium, an international meeting aimed at AI researchers, from all countries, at the beginning of their career: PhD students or people holding a PhD for less than one year. A total of 59 papers ...
To date, attempts for applying syntactic information in the document-based retrieval model dominant have led to little practical improvement, mainly due to the problems associated with the integration of this kind of information into the model. In this article we propose the use of a locality-based retrieval model for reranking, which deals with sy...
Resumen Presentamos un compilador capaz de gene-rar analizadores sintácticos a partir de esque-mas de análisis sintáctico. Dichos esquemas son representaciones de los analizadores en for-ma de sistemas deductivos, que abstraen los detalles de implementación y permiten definir y comparar fácilmente diferentes algoritmos. Nuestro compilador es capaz...
In Information Retrieval (IR) systems, the correct representation of a document through an accurate set of index terms is
the basis for obtaining a good performance. If we are not able to both extract and weight appropriately the terms which capture
the semantics of the text, this shortcoming will have an effect on all the subsequent processing.
Information Retrieval systems are limited by the linguistic variation of language. The use of Natural Language Processing techniques to manage this problem has been studied for a long time, but mainly focusing on English. In this paper we deal with European languages, taking Spanish as a case in point. Two different sources of syntactic information...
The parsing schemata formalism allows us to describe parsing algorithms in a simple way by capturing their fundamental semantics while abstracting low-level detail. In this work, we present a compilation technique allowing automatic transformation of parsing schemata to executable implementations of their corresponding algorithms. Taking a simple d...
To date, attempts for applying syntactic information in the document-based retrieval model dominant have led to little practical improvement, mainly due to the problems associated with the integration of this kind of information into the model. In this article we propose the use of a locality-based retrieval model for reranking, which deals with sy...
This article describes the application of lemmatization and shallow parsing as a linguistically-based alternative to stemming
in Text Retrieval, with the aim of managing linguistic variation at both word level and phrase level. Several alternatives
for selecting the index terms among the syntactic dependencies detected by the parser are evaluated....
Tree Adjoining Grammar (TAG) is a useful formalism for describing the syntactic structure of natural languages. In practice, a large part of wide coverage TAGs is formed by trees that satisfy the restrictions imposed by Tree Insertion Grammar (TIG), a simpler formalism. This characteristic can be used to reduce the practical complexity of TAG parsi...
Our goal is to study a practical approach to deal with nontermination in de nite clause grammars. We focus on two problems, loop and cyclic structure detection and representation, maintaining a tight balance between practical eciency and operational completeness.
The employment of Natural Language Processing techniques for Information Retrieval has been studied many times, but such works have mainly focused on English. In this article we describe the evolution of the research developed by our group for the case of Spanish: from our initial experiments, characterized by the lack of standard resources for eva...
The employment of Natural Language Processing techniques for Information Retrieval has been studied many times, but such works have mainly focused on English. In this article we describe the evolution of the research developed by our group for the case of Spanish: from our initial experiments, characterized by the lack of standard resources for eva...
In this our second participation in the CLEF Spanish monolingual track, we have continued applying Natural Language Processing techniques for single word and multi-word term con- ation. Two dierent conation approaches have been tested. The rst approach is based on the lemmatization of the text in order to avoid inectional variation. Our second appr...
The extraction of the keywords that characterize each document in a given collection is one of the most important components of an Information Retrieval system. In this article, we propose to apply shallow parsing, implemented by means of cascades of nite-state transducers, to extract complex index terms based on an approximate grammar of Spanish....
The extraction of the keywords that character-ize each document in a given collection is one of the most important components of an Informa-tion Retrieval system. In this article, we pro-pose to apply shallow parsing, implemented by means of cascades of finite-state transducers, to extract complex index terms based on an ap-proximate grammar of Spa...
El rendimiento de los sistemas de Recuperación de Información se ve limitado por los fenómenos de variación lingüística presentes en los textos. Las técnicas de Procesamiento de Lenguaje Natural a nivel de palabra han mostrado su utilidad para reducir dicha variación. Proponemos en este artículo extender esta aproximación a la variación a nivel de...
In this our second participation in the CLEF Spanish monolingual track, we have continued ap- plying Natural Language Processing techniques for single word and multi-word term conflation. Two different conflation approaches have been tested. The first approach is based on the lemmatization of the text in order to avoid inflectional variation. Our s...
We work in the domain of a regional least-cost strategy with dynamic validation in order to avoid cascaded errors [3], extending the theoretical model to illustrate its asymptotic equivalence with global repair algorithms. This is an objective criterion to measure the quality of an error repair algorithm, since the point of reference is a technique...
We de ne a new model of automata for the description of bidirectional parsing strategies for context-free grammars and a tabulation mechanism that allow them to be executed in polynomial time. This new model of automata provides a modular way of de ning bidirectional parsers, separating the description of a strategy from its execution.
A large part of wide coverage Tree Adjoining Grammars (TAG) is formed by trees that satisfy the restrictions imposed by Tree Insertion Grammars (TIG). This characteristic can be used to reduce the practical complexity of TAG parsing, applying the standard adjunction operation only in those cases in which the simpler cubic-time TIG adjunction cannot...
An incremental development environment for unrestricted context-free languages is described and tested. Our proposal includes
a parse generator, an incremental facility to make the overall parsing efficient in the context of program development; and
a graphical interface that provides a complete set of customization and trace facilities. The tool,...
Tree Adjoining Grammar (TAG) is a formalism that has become very popular for the description of natural languages. However, the parsers for TAG that have been defined on the basis of the Earley's algorithm entail important computational costs. In this article, we propose to extend the left corner relation from Context Free Grammar (CFG) to TAG in o...
Definimos un analizador tabular para gramáticas de adjunción de árboles (TAG) con estrategia de análisis ascendente y recorrido bidireccional de la cadena de entrada. Este analizador es el resultado de la fusión del analizador ascendente bidireccional ya definido para TAG con el nuevo analizador para gramáticas de inserción de árboles (TIG) que pre...
The extraction of the keywords that characterize a document in a given collection is one of the most important components of an Information Retrieval system. In this article, we propose to apply shallow parsing, implemented by means of cascades of finite-state transducers, to extract complex index terms based on an approximated grammar of Spanish....