Rogelio Nazar’s research while affiliated with Pontifical Catholic University of Valparaíso and other places

What is this page?

This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (38)

Text extension in tokens
Text extension in paragraphs
Extension of paragraphs in tokens
Distribution of connectives
Distribution of structurers


Statistical Modeling of Discourse Genres: The Case of the Opinion Column in Spanish
  • Article
  • Publisher preview available

October 2024


8 Reads


1 Citation

SN Computer Science

Rogelio Nazar

This paper presents a statistical model of the opinion column, a discourse genre typical of the press. The model was derived from a relatively small sample (ca. 4000 texts), and it takes into account discourse variables such as text and paragraph length, discourse markers, deixis and modalization. In order to test the accuracy of the model, it was evaluated against a different corpus of mixed column and non-column documents. The idea was to test whether the model is able to identify those texts pertaining to the target genre. Results show that it is indeed accurate, with results ranging from 85 to 77% precision and 40–61% recall, depending on how restrictive the application of the model is.

View access options

shows results for both corpora.
A Lightweight Statistical Method for Terminology Extraction

December 2023


36 Reads

Journal of Computer-Assisted Linguistic Research

We propose a method for the task of automatic terminology extraction in the context of a larger project devoted to the automation of part of the tasks involved in the production of terminological databases. Terminology extraction is the key to drafting the macrostructure of a terminological resource (i.e., the list of entries), to which information can be later added at the microstructural level with grammatical or semantic information. To this end, we developed a statistical method that is conceptually simple compared to modern neural network approaches. It is a lightweight method because it is based on term dispersion and co-occurrence statistics that can be computed with basic hardware. For the evaluation, we experimented with corpora of lexicography and linguistics in English and Spanish of ca. 66 million tokens. Results improve baselines in almost 20%.

Estilector: un sistema de evaluación automática de la escritura académica en castellano

April 2023


141 Reads


3 Citations

Perspectiva Educacional

Se describe Estilector, un sistema de retroalimentación automática de escritura académica en castellano. El software está orientado principalmente a la detección de problemas textuales, aunque también incorpora, en menor medida, correcciones gramaticales y ortográficas. Pretende apoyar el aprendizaje de la escritura aportando feedback formativo, y proporciona referencias bibliográficas y sugerencias de revisión más extensas que las de los procesadores de texto. El programa funciona actualmente en línea y es de acceso gratuito. En sus años de existencia, ha tenido gran número y variedad de usuarios, y se ha ido adaptando a un público que requiere apoyo básico y transversal a la hora de escribir. En este artículo ofrecemos una descripción del estado actual del proyecto con sus métodos de detección y revisión de errores además de algunas reflexiones finales, incluyendo sus limitaciones y posibilidades de desarrollo futuro.

A proposal for the inductive categorisation of parenthetical discourse markers in Spanish using parallel corpora

February 2023


20 Reads


4 Citations

International Journal of Corpus Linguistics

We propose a method for the automatic induction of categories of Spanish discourse markers using parallel corpora, based on a quantitative and empirical approach that minimises explicit linguistic knowledge. We conducted the analysis the using a large Spanish-English parallel corpus. First, we used this corpus to obtain a list of parenthetical discourse markers in each language. Then, we used it as a “semantic mirror”, inspecting the English equivalences and assessing which Spanish discourse markers fulfil a similar function in discourse and vice versa. The result of this procedure is an emerging categorisation of discourse markers. The main contribution is to offer empirical evidence for the adequacy of existing manually-compiled taxonomies and the potential for discovery of new, unaccounted categories. In this article we focus on units pertaining to the Spanish language but, since the method is purely quantitative, it is possible to apply it to different languages as well.

Correlación entre la metáfora orientacional bueno es arriba / malo es abajo y polaridad positiva/negativa en verbos del español: un estudio con estadística de corpus

January 2023


17 Reads

Digital Humanities, Corpus and Language Technology: A look from diverse case studies is an outstanding collection of research contributions that explores the intersection of technology and the humanities. The authors provide a comprehensive overview of how these technologies can enhance research across various disciplines, from literature to history to anthropology. This book is a mustread for anyone interested in future research in the humanities. Digital Humanities, Corpus, and Language Technologies are rapidly growing fields that have the potential to revolutionize research across various disciplines. New technologies have opened up new perspectives for research, allowing scientists to analyze data in previously impossible ways. The interdisciplinary approach and practical applications make it an invaluable resource for researchers, students, and anyone interested in the intersection of technology and the humanities.


September 2022


52 Reads


A partir de la consideración de los medios de comunicación como espacios propicios para la estandarización de la norma lingüística, la presente investigación analiza el grado de aceptación de algunas normas que aparecieron en la Ortografía de la lengua española (RAE y ASALE 2010): a) la tilde de guión/guion y otras palabras similares, b) sólo/solo y los pronombres demostrativos y c) algunos casos de extranjerismos y latinismos adaptados (como whisky/wiski). Se analizaron dos diarios de cinco países hispanohablantes (Argentina, Chile, Colombia, España y México). Se utilizó GoogleApi para recopilar 179.238 contextos de aparición en total, divididos por años y diarios, durante los primeros diez años de vigencia de la obra (2010-2019). Los resultados indican que, al final del periodo estudiado, permanecen aún, en conjunto, numerosos casos de formas con la ortografía antigua en los tres grupos.

Extended List of Features Selected by Introspection
Experimental Results Using the Short Deductive Setting
Experimental Results Using the Long Deductive Setting
Experimental Results Using the Short Inductive Setting
Experimental Results Using the Long Inductive Setting
Corpus-Based Methods for Recognizing the Gender of Anthroponyms

August 2021


124 Reads


5 Citations

Names A Journal of Onomastics

Rogelio Nazar



Nicolas Acosta




Sofıa Zamora

This paper presents a series of methods for automatically determining the gender of proper names, based on their co-occurrence with words and grammatical features in a large corpus. Although the results obtained were for Spanish given names, the method presented here can be easily replicated and used for names in other languages. Most methods reported in the literature use pre-existing lists of first names that require costly manual processing and tend to become quickly outdated. Instead, we propose using corpora. Doing so offers the possibility of obtaining real and up-to-date name-gender links. To test the effectiveness of our method, we explored various machine-learning methods as well as another method based on simple frequency of co-occurrence. The latter produced the best results: 93% precision and 88% recall on a database of ca. 10,000 mixed names. Our method can be applied to a variety of natural language processing tasks such as information extraction, machine translation, anaphora resolution or large-scale delivery or email correspondence, among others.

New verbs and dictionaries: A method for the automatic detection of neology in Spanish verbs

June 2021


17 Reads

International Journal of Lexicography

The appearance of new verbs can be observed regularly, but verbs are not frequently investigated in neology, and they are difficult to detect automatically. In this study, a corpus-based method is proposed to detect Spanish verbs with a series of algorithms that analyse the morphology of regular verbs. The vocabulary was drawn from a large corpus and contrasted with a major dictionary of Spanish. Then, a series of filters were applied to distinguish between valid neologism candidates and spelling mistakes. Around 88% of the neologisms proposed by the method were correct and we estimate that the system detected 76% of the neologisms present in the corpus. This procedure can be included in the workflow of a lexicographic project as a regular part of the task, as a systematic way of collecting new verbs from the data and avoiding under-representation or bias.

Pruning and repopulating a lexical taxonomy: experiments in Spanish, English and French

December 2020


92 Reads


5 Citations

In this paper we present the problem of a noisy lexical taxonomy and suggest two tasks as potential remedies. The first task is to identify and eliminate incorrect hypernymy links, and the second is to repopulate the taxonomy with new relations. The first task consists of revising the entire taxonomy and returning a Boolean for each assertion of hypernymy between two nouns (e.g. brie is a kind of cheese ). The second task consists of recursively producing a chain of hypernyms for a given noun, until the most general node in the taxonomy is reached (e.g. brie → cheese → food → etc.). In order to achieve these goals, we implemented a hybrid hypernym-detection algorithm that incorporates various intuitions, such as syntagmatic, paradigmatic and morphological association measures as well as lexical patterns. We evaluate these algorithms individually and collectively and report findings in Spanish, English and French.

Corpus-Based Methods for Recognizing the Gender of Anthroponyms

November 2020


51 Reads

Names A Journal of Onomastics

This paper presents a series of methods for automatically determining the gender of proper names, based on their co-occurrence with words and grammatical features in a large corpus. Although the results obtained were for Spanish given names, the method presented here can be easily replicated and used for names in other languages as well. Most methods reported in the literature use pre-existing lists of first names that require costly manual processing and tend to become quickly outdated. Instead, we propose using corpora. Doing so offers the possibility of obtaining real and up-to-date name-gender links. To test the effectiveness of our method, we explored various machine learning methods as well as another method based on simple frequency of co-occurrence. The latter produced the best results: 93% precision and 88% recall on a database of ca. 10,000 mixed names. Our method can be applied to a variety of natural language processing tasks such as information extraction, machine translation, anaphora resolution or large-scale delivery or email correspondence, among others.

Citations (30)

... Las medidas de longitud de enunciado (Nazar, 2024) se aplican al análisis de la extensión de oraciones, párrafos o bien textos contando sus palabras, letras o morfemas, y han sido utilizadas en diversos contextos. Una de las primeras aplicaciones que se encontró fue como medida de complejidad sintáctica (Flesh, 1949;Brown, 1973), que es aplicable, a su vez, al estudio del desarrollo lingüístico en niños, entre otras posibilidades. ...


Statistical Modeling of Discourse Genres: The Case of the Opinion Column in Spanish

SN Computer Science

... Actualmente, asistimos al surgimiento de un marcado interés por la investigación sobre la evaluación y retroalimentación a través de herramientas digitales, como se evidencia, en nuestra región, en la reciente publicación de un número especial coordinado por Mateo-Girona et al. (2023). Los enfoques más recientes abordan las potencialidades de programas automáticos de evaluación de la escritura (Nazar y Renau, 2023), así como el impacto de estrategias pedagógicas innovadoras que involucran recursos digitales para la valoración y comentario por parte de docentes y entre pares (García-Yeste, 2013;Martin, 2020). ...

Estilector: un sistema de evaluación automática de la escritura académica en castellano

Perspectiva Educacional

... Los marcadores del discurso (MD) son unidades lingüísticas que sirven para enlazar proposiciones, estructurar la información del texto o bien para dirigir los intercambios comunicativos (Robledo, 2021;Robledo & Nazar, 2023). Los MD son invariables y no cumplen una función sintáctica en la oración. ...

A proposal for the inductive categorisation of parenthetical discourse markers in Spanish using parallel corpora
  • Citing Article
  • February 2023

International Journal of Corpus Linguistics

... 8 This fact is clearly evidenced in the (Italian) political sphere, as reported by Sensales and Areni (2017). it comes to gender (e.g., Nazar et al., 2021;Stahlberg et al., 2007), 9 which has been proven in the experiments carried out by Wasserman and Weseley (2009), where students completing a survey of sexist attitudes in English expressed less sexist attitudes than those taking the same survey in Spanish or French. Keeping this idea in mind, we chose to focus on Spanish writers to unveil whether they might or not be influenced by their mother tongue, which is definitely gender-based. ...

Corpus-Based Methods for Recognizing the Gender of Anthroponyms

Names A Journal of Onomastics

... Así, algunos md permiten proyectar la actitud de un interlocutor hacia un supuesto a comunicar. Este supuesto Distintas investigaciones han evidenciado que el reconocimiento de estas funciones no siempre es correcto por parte de los estudiantes nóveles (Asenjo y Nazar, 2019;Landone, 2010). Ello incide en la pertinencia del uso de los md en la escritura académica, lo que afecta la claridad, la coherencia y la cohesión de los textos. ...


RLA Revista de lingüística teórica y aplicada

... New words and meanings emerge everywhere and every time. Collective important events are linked to the surge of new words and meanings, e. g., economic crisis (Galanes-Santos, 2019; Renau, Nazar & Lecaros, 2020), environment (Sanmartín Sáez, 2016;Guslyakova, Valeeva & Vatkova, 2020), and, of course, the Covid-19 pandemic -cf. Klosa-Kückelhaus & Kernerman (2022) for a compilation of studies of neology during that period. ...

La evolución de las marcas ortográficas y tipográficas en los procesos de lexicalización de neologismos: Un estudio en el vocabulario de la crisis económica en prensa española
  • Citing Article
  • August 2020

Revista Española de Lingüística Aplicada/Spanish Journal of Applied Linguistics

... As source of the data, we used a sub-set of Verbario, an online database of Spanish verbs [29]. The Verbario structure is similar to that of the dictionaries, with a list of verbs and, for each verb, a list of meanings that are analyzed with a technique called Corpus Pattern Analysis, CPA [30]. ...

Verbo y contexto de uso: Un análisis basado en corpus con métodos cualitativos y cuantitativos

Revista signos

... Further work will be, indeed, a Discourse Marker Tagger that classifies DMs into their types and subtypes (following the work begun by Hernán and Nazar (2018) and Nazar (2021), section 1.3) because this would provide us more information about the financial text and its factual inferences. We will look to study derived from usage data, with less reliance on language knowledge, using the methodology proposed by these authors. ...

Clasificación automatizada de marcadores discursivos. Automatic categorization of discourse markers

Procesamiento de Lenguaje Natural

... El análisis de estos distintos usos ha sido la motivación de la presente investigación. En una investigación previa (Robledo, Nazar, y Renau, 2017) desarrollamos un método para la inducción automática de taxonomías de MDs a partir de corpus paralelos, utilizando una técnica de clustering (conglomerados). Por ejemplo, uno de esos clusters nos presenta elementos que denominamos conectores contraargumentativos, tales como sin embargo, por el contrario o en vez de ello (ver sección 2.2). ...

Un enfoque inductivo y de corpus para la categorización de los marcadores del discurso en español