• Home
  • Fco. Mario Barcala Rodríguez
Fco. Mario Barcala Rodríguez

Fco. Mario Barcala Rodríguez
NLPgo

PhD Computer Science

About

32
Publications
5,091
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
159
Citations

Publications

Publications (32)
Chapter
Full-text available
This article provides an overview of the design and composition of the corpus ESLORA and shows its usefulness in analysing social and situational variation. The corpus also contributes to the study of the processes of change related to the geographical variation of Spanish, since it records its use in a region with its own distinctive language. Thi...
Article
Full-text available
The design of an oral corpus and the processes of registering, codifying and treating the materials in order to build a useful resource for linguistic analysis prompt numerous decisions regarding theory and methodology. This article is focused on those stages of corpus construction which are more clearly conditioned by the computational processing...
Article
Full-text available
ESLORA is a corpus of Spanish made up of semi-directed interviews and spontaneous conversations recorded in Galicia between 2007 and 2015. The design and construction of the corpus meets three objectives: to register the use of a variety of Spanish which to date has been scarcely documented, to gain additional insight into the methods for the const...
Article
Neste traballo avaliamos, dende o punto de vista lingüístico, un etiquetador automático estatístico, desenvolto conxuntamente polo Centro Ramón Piñeiro para a Investigación en Humanidades e o Grupo COLE das Universidades de Vigo e A Coruña, destinado a etiquetar os documentos do Corpus de Referencia do Galego Actual co obxecto de proporcionar recur...
Conference Paper
Full-text available
Sentence word segmentation is an important task in robust part-of-speech (POS) tagging systems. In some cases this is relatively simple, since each textual word (or token) corresponds to one linguistic component. However, there are many others where segmentation can be very hard, such as those of contractions, verbal forms with enclitic pronouns, e...
Article
Full-text available
In this paper, we describe the compilation and structure of two linguistic resources, a corpus and a dictionary of terms in the field of economy, developed for Galician. In addition to this, we describe the use of these resources for the automatic extractio n of multi-word terms by means of a combination of linguistic and statistical techniques. Wh...
Conference Paper
Sentence word segmentation and Part-Of-Speech (POS) tagging are common pre-processing tasks for many Natural Language Processing (NLP) applications. This paper presents a practical application for POS tagging and segmentation disambiguation using an extension of the one-pass Viterbi algorithm called Viterbi-N. We introduce the internals of the deve...
Article
Full-text available
Descripción del proyecto Gari-Coter para la elaboración de los recursos lingüísticos en gallego necesarios para un re-elaborador de consultas multilingüe. Description of the Gari-Coter project for the development of the necessary linguistic resources in Galician for a multilingual query re-elaborator.
Conference Paper
Full-text available
We describe a proposal on spelling correction intended to be applied on Galician, a Romance language. Our aim is to put into evidence the flexibility of a novelty technique that provides a quality equivalent to global strategies, but with a significantly minor computational cost. To do it, we take advantage of the grammatical background present in...
Conference Paper
Full-text available
We consider a set of natural language processing techniques based on nite-state technology that can be used to analyze huge amounts of texts. These techniques include an advanced tokenizer, a part-of-speech tagger that can manage ambiguous streams of words, a system for conating words by means of derivational mechanisms, and a shallow parser to ext...
Article
Full-text available
In this paper we consider a set of natural language processing techniques that can be used to analyze large amounts of texts, focusing on the advanced tokenizer which accounts for a number of complex linguistic phenomena, as well as for pre-tagging tasks such as proper noun recognition. We also show the results of several experiments performed in o...
Article
Full-text available
We consider a set of natural language processing techniques based on finite-state technology that can be used to analyze huge amounts of texts. These techniques include an advanced tokenizer, a part-of-speech tagger that can manage ambiguous streams of words, a system for conflating words by means of derivational mechanisms, and a shallow parser to...
Conference Paper
Full-text available
One of the most important prior tasks for robust part-ofspeech tagging is the correct tokenization or segmentation of the texts. This task can involve processes which are much more complex than the simple identification of the different sentences in the text and each of their individual components, but it is often obviated in many current applicati...
Conference Paper
Full-text available
This article presents two new approaches for term indexing which are particularly appropriate for languages with a rich lexis and morphology, such as Spanish, and need few resources to be applied. At word level, productive derivational morphology is used to conflate semantically related words. At sentence level, an approximate grammar is used to co...
Article
We present a reflection on the evolution of the different methods for constructing minimal deterministic acyclic finite-state automata from a finite set of words. We outline the most important methods, including the traditional ones (which consist of the combination of two phases: insertion of words and minimization of the partial automaton) and th...
Article
Full-text available
En este artículo presentamos una serie de técnicas de Procesamiento de Lenguaje Natural aplicadas a la normalización de términos en Recuperación de Información Textual. El objetivo de dichas técnicas es el tratamiento de los fenómenos de variación lingüística morfológica y léxica. En concreto explorará la utilización de la lematización, su empleo c...
Conference Paper
Full-text available
We present a reflection on the evolution of the different methods for constructing minimal deterministic acyclic finite-state automata from a finite set of words. We outline the most important methods, including the traditional ones (which consist of the combination of two phases: insertion of words and minimization of the partial automaton) and th...
Conference Paper
Parsing CYK-like algorithms are inherently parallel: there are a lot of cells in the chart that can be calculated simultaneously. In this work, we present a study on the appropriate techniques of parallelism to obtain an optimal performance of the extended CYK algorithm, a stochastic parsing algorithm that preserves the same level of expressiveness...
Article
Full-text available
En este artículo se presentan dos nuevas técnicas para la indexación de textos escritos en español. A nivel de palabra, proponemos la utilización de la morfología derivativa para obtener conjuntos de palabras relacionadas semánticamente. Esta técnica se combina, a nivel de frase, con la utilización de una gramática aproximada, lo que nos permitirá...
Article
Parsing CYK-like algorithms are inherently parallel: there are a lot of cells in the chart that can be calculated simultaneously. In this work, we present a study on the appropriate techniques of parallelism to obtain an optimal performance of the extended CYK algorithm, a stochastic parsing algorithm that preserves the same level of expressiveness...
Article
Full-text available
Una de las tareas previas más importantes para la etiquetación robusta del lenguaje natural es la correcta segmentación o preprocesamiento de los textos. Esta fase, que puede involucrar a procesos mucho más complejos que la simple identificación de las diferentes frases del texto y de cada uno de sus componentes individuales, es a menudo obviada en...
Article
Los algoritmos de análisis sintáctico tipo CYK presentan una naturaleza intrínsecamente paralela: existen muchas celdas de la tabla de análisis que pueden ser calculadas simultáneamente. En este trabajo se realiza un estudio sobre cual debe ser la técnica de paralelismo adecuada para obtener un rendimiento óptimo del algoritmo CYK extendido, un alg...
Conference Paper
Full-text available
Conventional Information Retrieval Systems (IRSs), also called text indexers, deal with plain text documents or ones with a very elementary structure. These kinds of system are able to solve queries in a very efficient way, but they cannot take into account tags which mark different sections, or at best this capability is very limited. In contrast...
Article
En este trabajo analizamos los aspectos m¿as relevantes para definir una metodología que posibilite la construcción de córpora textuales estructurados basados en XML. In this article we discuss the most important issues in the definition of a methodology for the development of structured text corpora based on XML.
Article
In this paper we evaluate main technologies to develope Information Retrieval Systems based on large text structured corpora: Oracle (Oracle Corporation, 8/3/2005) and Tamino (Software AG Company, 8/3/2005) En este trabajo se evalúan las principales tecnologías para el desarrollo de sistemas de recuperación de información basados en córpora estruct...
Article
One of the most important prior tasks for robust part-of-speech tagging is the correct tokenization or segmentation of the texts. This task can involve processes which are much more complex than the simple identification of the diferent sentences in the text and each of their individual components, but it is often obviated in many current applicati...
Article
En este trabajo se evalúan las principales tecnologías para el desarrollo de sistemas de recuperación de información basados en córpora estructurados de grandes dimensiones: Oracle (Oracle Corporation, 8/3/2005) y Tamino (Software AG Company, 8/3/2005). In this paper we evaluate main technologies to develop Information Retrieval Systems based on la...

Network

Cited By