Helena Gomez Adorno

Helena Gomez Adorno
Universidad Nacional Autónoma de México | UNAM · Instituto de Investigaciones en Matemáticas Aplicadas y Sistemas

PhD in Computer Science

About

58
Publications
35,562
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
935
Citations
Introduction
My current research interests lie in the field of natural language processing, computational linguistics, and information retrieval. I am particularly interested in automatic machine reading and semantic similarity, two related subtasks in the field of natural language processing. My current work is primarily focused on developing new text representation structures that will facilitate automatic question answering over texts written in a natural language.
Additional affiliations
August 2016 - August 2016
Universidad Nacional de Asunción
Position
  • Postgraduate Course Teacher
Description
  • Advanced database course: Spatial databases using SQL Server technology (20 hours).
May 2015 - May 2015
University of the Aegean
Position
  • Research intern
January 2014 - present
Instituto Politécnico Nacional
Position
  • PhD Student
Education
August 2011 - October 2013
Benemérita Universidad Autónoma de Puebla
Field of study
  • Computer Science
January 2001 - December 2004
Universidad Nacional de Asunción
Field of study
  • Information Systems

Publications

Publications (58)
Article
Full-text available
Recently, document embeddings methods have been proposed aiming at capturing hidden properties of the texts. These methods allow to represent documents in terms of fixed-length, continuous and dense feature vectors. In this paper, we propose to learn document vectors based on n-grams and not only on words. We use the recently proposed Paragraph Vec...
Article
Full-text available
We show how to consider similarity between features for calculation of similarity of objects in the Vector Space Model (VSM) for machine learning algorithms and other classes of methods that involve similarity between objects. Unlike LSA, we assume that similarity between features is known (say, from a synonym dictionary) and does not need to be le...
Article
Full-text available
Distributed word representation in a vector space (word embeddings) is a novel technique that allows to represent words in terms of the elements in the neighborhood. Distributed representations can be extended to larger language structures like phrases, sentences, paragraphs and documents. The capability to encode semantic information of texts and...
Article
Full-text available
We apply the integrated syntactic graph feature extraction methodology to the task of automatic authorship detection. This graph-based representation allows integrating different levels of language description into a single structure. We extract textual patterns based on features obtained from shortest path walks over integrated syntactic graphs an...
Preprint
Full-text available
Given the current social distance restrictions across the world, most individuals now use social media as their major medium of communication. Millions of people suffering from mental diseases have been isolated due to this, and they are unable to get help in person. They have become more reliant on online venues to express themselves and seek advi...
Preprint
Full-text available
Both policy and research benefit from a better understanding of individuals' jobs. However, as large-scale administrative records are increasingly employed to represent labor market activity, new automatic methods to classify jobs will become necessary. We developed an automatic job offers classifier using a dataset collected from the largest job b...
Article
Existe un gran número de trabajos que tienen por objeto la clasificación de diversos tipos de documentos, desde textos literarios hasta interacciones informales en redes sociales como Twitter, de acuerdo a los sentimientos que pretenden evocar. Se pueden realizar clasificaciones muy variadas con base en los sentimientos que el autor considere. El o...
Article
Full-text available
In this work, we propose a novel approach to solve the authorship identification task on a cross-topic and open-set scenario. Authorship verification is the task of determining whether or not two texts were written by the same author. We model the documents in a graph representation and then a graph neural network extracts relevant features from th...
Article
Full-text available
The multi-label emotion classification task aims to identify all possible emotions in a written text that best represent the author’s mental state. In recent years, multi-label emotion classification attracted the attention of researchers due to its potential applications in e-learning, health care, marketing, etc. There is a need for standard benc...
Article
In this paper, we present authorship attribution methods applied to ¡El Mondrigo! (1968), a controversial text supposedly created by order of the Mexican Government to defame a student strike. Up to now, although the authorship of the book has been attributed to several journalists and writers, it could not be demonstrated and remains an open probl...
Chapter
Most of the datasets used in Machine Learning (ML) tasks contain categorical attributes. In practice, these attributes must be numerically encoded for their use in supervised learning algorithms. Although there are several encoding techniques, the most commonly used ones do not necessarily preserve possible patterns embedded in the data when they a...
Article
Context Bug fixing is a frequent and important task in Open Source Software (OSS) development and involves the communication of messages, which can serve for multiple purposes and affect the efficiency and effectiveness of corrective software activities. Objective This work is aimed at studying the communication functions of bug comments and their...
Chapter
In this paper, we present a similarity-based approach towards paraphrase detection in Spanish. We evaluate various models for semantic similarity computation using a gold-standard paraphrase corpus. It contains one original document and paraphrased documents on different levels (low and high), and reference documents on the same topic or same vocab...
Chapter
Full-text available
Social networks have modified the way we communicate. It is now possible to talk to a large number of people we have never met. Knowing the traits of a person from what he/she writes has become a new area of computational linguistics called Author Profiling. In this paper, we introduce a method for applying transfer learning to address the gender i...
Article
Full-text available
The paper presents a new corpus for fake news detection in the Urdu language along with the baseline classification and its evaluation. With the escalating use of the Internet worldwide and substantially increasing impact produced by the availability of ambiguous information, the challenge to quickly identify fake news in digital media in various l...
Preprint
Software development tasks must be performed successfully to achieve software quality and customer satisfaction. Knowing whether software tasks are likely to fail is essential to ensure the success of software projects. Issue Tracking Systems store information of software tasks (issues) and comments, which can be useful to predict issue success; ho...
Article
Software development tasks must be performed successfully to achieve software quality and customer satisfaction. Knowing whether software tasks are likely to fail is essential to ensure the success of software projects. Issue Tracking Systems store information of software tasks (issues) and comments, which can be useful to predict issue success; ho...
Article
Word embeddings are powerful for many tasks in natural language processing. In this work, we learn word embeddings using weighted graphs from word association norms (WAN) with the node2vec algorithm. Although building WAN is a difficult and time-consuming task, training the vectors from these resources is a fast and efficient process. This allows u...
Article
We present a method for gender and language variety identification using a convolutional neural network (CNN). We compare the performance of this method with a traditional machine learning algorithm – support vector machines (SVM) trained on character n-grams (n = 3–8) and lexical features (unigrams and bigrams of words), and their combinations. We...
Article
Full-text available
This work introduces a lexical search model based on a type of knowledge graphs, namely word association norms. The aim of the search is to retrieve a target word, given the description of a concept, i.e., the query. This differs from traditional information retrieval models were complete documents related to the query are retrieved. Our algorithm...
Conference Paper
Full-text available
We present the CIC-GIL approach to the author profiling (AP) task at MEX-A3T 2018. The task consists of two subtasks: identification of authors' location (6-way) and occupation (8-way) in a corpus of Mexican Spanish tweets. We used the logistic regression algorithm trained on typed character n-grams, function-word n-grams, and regionalisms for loca...
Article
Full-text available
It is well-known that the lack of quality data is a major problem for information retrieval engines. Web articles are flooded with non-relevant data such as advertising and related links. Moreover, some of these ads are loaded in a randomized way every time you hit a page, so the HTML document will be different and hashing of the content will be no...
Article
Full-text available
In this paper, we present an approach to identify changes in the writing style of 7 authors of novels written in English. We defined 3 stages of writing for each author, each stage contains 3 novels with a maximum of 3 years between each publication. We propose several stylometric features to represent the novels in a vector space model. We use sup...
Conference Paper
Full-text available
We present the CIC systems submitted to the 2017 PAN shared task on Cross-Genre Gender Identification in Russian texts (RUSProfiling). We submitted five systems. One of them was based on a statistical approach using only lexical features, and other four on machine-learning techniques using some combinations of gender-specific Russian grammatical fe...
Article
Full-text available
In this paper, we introduce an algorithm for obtaining the subtrees (continuous and non-continuous syntactic n-grams) from a dependency parse tree of a sentence. Our algorithm traverses the dependency tree of the sentences within a text document and extracts all its subtrees (syntactic n-grams). Syntactic n-grams are being successfully used in the...
Conference Paper
Full-text available
We compare the performance of character n-gram features (n = 3–8) and lexical features (unigrams and bigrams of words), as well as their combinations, on the tasks of authorship attribution, author profiling, and discriminating between similar languages. We developed a single multi-labeled corpus for the three aforementioned tasks, composed of news...
Conference Paper
Full-text available
We present the CIC's approach to the Author Profiling (AP) task at PAN 2017. This year task consists of two subtasks: gender and language variety identification in English, Spanish, Portuguese, and Arabic. We use typed and untyped character n-grams, word n-grams, and non-textual features (domain names). We experimented with various feature represen...
Conference Paper
Full-text available
This paper presents our approach to the Author Clustering task at PAN 2017. We performed a hierarchical clustering analysis of different document features: typed and untyped character n-grams, and word n-grams. We experimented with two feature representation methods, log-entropy model, and tf-idf; while tuning minimum frequency threshold values to...
Conference Paper
Full-text available
To determine author demographics of texts in social media such as Twitter, blogs, and reviews, we use doc2vec document embeddings to train a logistic regression classifier. We experimented with age and gender identification on the PAN author profiling 2014–2016 corpora under both single- and cross-genre conditions. We show that under certain settin...
Article
Full-text available
A crucial step in plagiarism detection is text alignment. This task consists in finding similar text fragments between two given documents. We introduce an optimization methodology based on genetic algorithms to improve the performance of a plagiarism detection model by optimizing its input parameters. The implementation of the genetic algorithm is...
Conference Paper
Full-text available
This paper presents the CIC UALG's system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year's task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all la...
Conference Paper
Full-text available
This paper presents our approach to the Author Profiling (AP) task at PAN 2016. The task aims at identifying the author's age and gender under cross-genre AP conditions in three languages: English, Spanish, and Dutch. Our pre-processing stage includes reducing non-textual features to their corresponding semantic classes. We exploit typed character...
Conference Paper
Full-text available
This paper presents our approach for SemEval 2016 task 4: Sentiment Analysis in Twitter. We participated in Subtask A: Message Polarity Classification. The aim is to classify Twitter messages into positive, neutral, and negative polarity. We used a lexical resource for pre-processing of social media data and train a neural network model for feature...
Article
Full-text available
En este trabajo presentamos un recurso léxico para el pre-procesamiento de textos publicados en redes sociales desarrollado para los idiomas: inglés, español, holandés e italiano. El recurso se compone de diccionarios de palabras slang, abreviaturas, contracciones y emoti-cones utilizados comúnmente en redes sociales. Los diccionarios fueron utiliz...
Article
Full-text available
We introduce a lexical resource for preprocessing social media data. We show that a neural network-based feature representation is enhanced by using this resource. We conducted experiments on the PAN 2015 and PAN 2016 author profiling corpora and obtained better results when performing the data preprocessing using the developed lexical resource. Th...
Conference Paper
Full-text available
The paper describes our approach for the Authorship Identification task at the PAN CLEF 2015. We extract textual patterns based on features obtained from shortest path walks over Integrated Syntactic Graphs (ISG). Then we calculate a similarity between the unknown document and the known document with these patterns. The approach uses a predefined t...
Conference Paper
Full-text available
This paper describes our approach to tackle the Author Profiling task at PAN 2015. Our method relies on syntactic features, such as syntactic based n-grams of various types in order to predict the age, gender and personality traits that has the author of a given text. In this paper, we describe the used features, the employed classification algorit...
Conference Paper
Full-text available
In this paper, we propose the application of the Tree Edit Distance (TED) for calculation of similarity between syntactic n-grams for further detection of soft similarity between texts. The computation of text similarity is the basic task for many natural language processing problems, and it is an open research field. Syntactic n-grams are text fea...
Conference Paper
Full-text available
The aim of this paper is to evaluate the use of content and style features in automatic classification of intentions of Tweets. For this we propose different style features and evaluate them using a machine learning approach. We found that although the style features by themselves are useful for the identification of the intentions of tweets, it is...
Conference Paper
Full-text available
This paper describes the approach used in the system for the Question Answering Task based on Entrance Exams, which was pre-sented at the CLEF 2014. The task aims to evaluate methods of text un-derstanding with reading comprehension tests. The system should read a given document and answer multiple-choice questions about it. Our ap-proach transform...
Conference Paper
Full-text available
In this paper it is presented a model of text representation based on graphs. The model is applied in the particular case study of au-thorship attribution. The experiments were performed by using a corpus made up of 500 documents written by 10 different authors (50 documents per author). The obtained results highlight the benefit of using text fea-...
Conference Paper
Full-text available
In this paper it is presented a methodology for tackling the problem of question answering for reading comprehension tests. The implemented system accepts a document as input and it answers multiple choice questions about it. It uses the Lucene information retrieval engine for carrying out information extraction employing additional automated lingu...
Article
Full-text available
In this paper it is presented a methodology for tackling the problem of answer validation in question answering for reading comprehension tests. The implemented system accepts a document as input and it answers multiple choice questions about it based on semantic similarity measures. It uses the Lucene information retrieval engine for carrying out...
Conference Paper
Full-text available
In this paper we explore models based on decision trees and neural networks models for predicting levels of ozone. We worked with a data set of the Atmospheric Monitoring System of Mexico City (SIMAT), which includes measurements hour by hour, between 2010 to 2011. The data come from of three meteorological stations: Pedregal, Tlalnepantla and Xalo...
Conference Paper
Full-text available
The growing use of information technologies such as mobile devices has had a major social and technological impact such as the growing use of Short Message Services (SMS), a communication system broadly used by cellular phone users. In 2011, it was estimated over 5.6 billion of mobile phones sending between 30 and 40 SMS at month. Hence the great i...
Conference Paper
Full-text available
The growing use of information technologies such as mobile devices has had a major social and technological impact such as the growing use of Short Message Services (SMS), a communication system broadly used by cellular phone users. Hence the great importance of analyzing representation and normalization techniques for this kind of texts. In this p...

Network

Cited By