Grigori Sidorov

Grigori Sidorov
Instituto Politécnico Nacional | IPN · Centro de Investigación en Computación (CIC)

PhD

About

239
Publications
56,225
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,489
Citations

Publications

Publications (239)
Conference Paper
Full-text available
Reorganizing words in a passage using synonyms and different words without changing the main message delivered in the original sentence is called paraphrasing. Simplifying, clarification or taking quotes, etc. In this paper, we address a Paraphrase Identification model for Mexican Spanish text pairs. A data augmentation step was done using Google T...
Conference Paper
Full-text available
The article explains the model submission by the team CIC for "Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)" at PAN 2022. Irony profiling can help in identifying stereotype spreaders and can enhance the understanding of author behaviours. We proposed a methodology focusing on feature engineering to classify irony for long texts b...
Conference Paper
Full-text available
The amount of acronyms in texts is growing with the increase in the number of scientific articles, and it is not bound only to English texts. The Acronym Extraction (AE) task aims at automatically identifying and extracting the acronyms and their long forms in the given text. To tackle the challenge of AE in different languages, this paper describe...
Conference Paper
Full-text available
Analyzing sentiments or opinions in code-mixed languages is gaining importance due to increase in the use of social media and online platforms especially during the Covid-19 pandemic. In a multilingual society like India, code-mixing and script mixing is quite common as people especially the younger generation are quite familiar in using more than...
Conference Paper
Full-text available
Offensive Language Identification (OLI) in code-mixed under-resourced Dravidian languages is a challenging task due to the complex characteristics of code-mixed text and scarcity of digital resources and tools to process these languages. This paper describes the strategy proposed by our team MUCIC for the 'Dravidian-CodeMix-HASOC2021' shared task w...
Conference Paper
Full-text available
Social media usually consists of various forms of toxic contents such as Hate Speech (HS) and contents in offensive and abusive languages, in addition to useful and relevant ones. The offensive contents on social media may target a religion, community, individual or group of people, with specific thoughts and beliefs. A category of offensive conten...
Conference Paper
Full-text available
Identifying fake news shared on social media is a vital task due to its immense effects in a negative way on the society, community, an individual or whoever is the target. Controlling and managing the fake news shared on social media manually is an impractical task due to the increasing number of social media users, increasing volume of fake news...
Conference Paper
Full-text available
Spreading positive vibes or hope content on social media may help many people to get motivated in their life. To address Hope Speech detection in YouTube comments, this paper presents the description of the models submitted by our team-MUCIC, to the Hope Speech Detection for Equality, Diversity, and Inclusion (HopeEDI) shared task at Association fo...
Conference Paper
Full-text available
abusive language content such as hate speech, profanity, and cyberbullying etc., which is common in online platforms is creating lot of problems to the users as well as policy makers. Hence, detection of such abusive language in user-generated online content has become increasingly important over the past few years. Online platforms strive hard to...
Conference Paper
Full-text available
Hope is an inherent part of human life and essential for improving the quality of life. Hope increases happiness and reduces stress and feelings of helplessness. Hope speech is the desired outcome for better and can be studied using text from various online sources where people express their desires and outcomes. In this paper, we address a deep-le...
Article
Full-text available
Urdu is a widely used language in South Asia and worldwide. While there are similar datasets available in English, we created the first multi-label emotion dataset consisting of 6,043 tweets and six basic emotions in the Urdu Nastalíq script. A multi-label (ML) classification approach was adopted to detect emotions from Urdu. The morphological and...
Conference Paper
Full-text available
Profane or abusive speech with the intention of humiliating and targeting individuals, a specific community or groups of people is called Hate Speech (HS). Identifying and blocking HS contents is only a temporary solution. Instead, developing systems that are able to detect and profile the content polluters who share HS will be a better option. In...
Conference Paper
Full-text available
Social media analytics are widely being explored by researchers for various applications. Prominent among them are identifying and blocking abusive contents especially targeting individuals and communities, for various reasons. The increasing abusive contents and the increasing number of users on social mediademands automated tools to detect and fi...
Article
Full-text available
Recently, a wide range of small devices, such as Wi-Fi Internet of things development boards, which are a kind of the microcontroller units in a general purpose board, are interrelated throughout the planet. In addition, certain microcontroller units interact inside our homes when turning lights on or detecting movements, measuring various paramete...
Article
In spite of having been investigated for over fifty years, developing a robust spoken dialog management system remains an open research issue in robotics and natural language processing. In this paper, we present a language-independent spoken dialog management module integrated into a human-robot interaction system. We adopt an algorithmic approach...
Article
We present a method for gender and language variety identification using a convolutional neural network (CNN). We compare the performance of this method with a traditional machine learning algorithm – support vector machines (SVM) trained on character n-grams (n = 3–8) and lexical features (unigrams and bigrams of words), and their combinations. We...
Article
Full-text available
Wireless sensor networks (WSNs) consist of a large number of small devices or nodes, called micro controller units (MCUs) and located in homes and/or offices, to be operated through the internet from anywhere, making these devices smarter and more efficient. Quality of service routing is one of the critical challenges in WSNs, especially in surveil...
Chapter
We have conducted various experiments [93] in order to test the usefulness of the concept of syntactic n-grams. Essentially, we consider the task of authorship attribution, i.e., there are texts for which the authors are known and a text for which we have to determine the author (among the considered authors only). In our case, we use a corpus comp...
Chapter
Computational linguistics is an important area within the field of linguistics. Computational methods used in computational linguistics originate from computer science, or, to be more specific, from artificial intelligence. In fact, large part of modern computational lingusitics consists in application of machine learning methods to large textual d...
Chapter
In this and the following chapters, we present two ideas related to the non-linear construction of n-grams. Recall that the non-linear construction consists in taking the elements which form n-grams in a different order than the surface (textual) representation, i.e., in a different way than words (lemmas, POS tags, etc.) appear in a text. Here we...
Chapter
So, we have already learned how to obtain syntactic n-grams (although, at the moment, we are considering only continuous syntactic n-grams). Now let’s discuss what types of syntactic n-grams exist depending on the elements they are formed of, i.e., what kind of elements (components) can be parts of syntactic n-grams. In fact, the considerations to...
Chapter
In this chapter, we discuss the features that are used for text representation while comparing them in vector space model, such as words or n-grams. We also present the possible values of these features: tf, idf, and tf-idf. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2019.
Chapter
Another idea related to the non-linear construction of n-grams, i.e., using distinct elements or distinct order of their appearance in a text, is the idea of replacing words by their synonyms or by the generalized concepts that correspond to the words according to a certain ontology. © The Author(s), under exclusive license to Springer Nature Switz...
Chapter
The vector space model is a widely used model in computer science. Its wide use is due to the simplicity of the model and its very clear conceptual basis that corresponds to the human intuition in processing information and data. The idea behind the model is very simple, and it is an answer to the question, how can we compare objects in a formal wa...
Chapter
After building the vector space model, we can represent and compare any type of objects of our study. Now we can discuss the question whether we can improve the vector space we have built. The importance of this question is related to the fact that the vector space model can have thousands of features, and possibly many of these features are redund...
Chapter
In the previous chapters, we introduced the new concept of syntactic n-grams, i.e., n-grams obtained following paths in syntax trees. The discussion that follows in this chapter addresses the comparison of continuous syntactic n-grams with non-continuous syntactic n-grams (i.e., n-grams with bifurcations (ramifications) and n-grams without them). ©...
Chapter
Full-text available
As we have already mentioned, the main idea of the formal features applicable in computational linguistics is related to the vector space model and the use of n-grams as features in this space, which also includes unigrams, i.e., words. The words are considered in the contexts. Usually, it is neighbor words. But some words that have syntactic relat...
Chapter
As we mentioned earlier in the book, in the automatic analysis of natural language (natural language processing, NLP) and in computational linguistics, machine learning methods are becoming more and more popular. Applying these methods increasingly gives better results. In this chapter, we describe the design of experiments in computational lingusi...
Chapter
As we described in the previous chapters, mainstream of the modern computational linguistics is based on application of machine learning methods. We represent our task as a classification task, represent our objects formally using features and their values (constructing vector space model), and then apply well-known classification algorithms. In th...
Chapter
In this section, we provide examples of the continuous and non-continuous syntactic n-gram construction for Spanish. We analyze the sample sentence provided in the previous chapter. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2019.
Chapter
The question arises, how to represent non-continuous syntactic n-grams without resorting to their graphic form? Recall that continuous syntactic n-grams are simply sequences of words (obtained by following paths in a syntactic tree), but the case of the non-continuous syntactic n-gram is rather different. © The Author(s), under exclusive license to...
Chapter
In recent years, a novel paradigm appeared related to application of neural networks to any tasks related to artificial intelligence [59], in particular, in natural language processing [39]. It became extremely popular in NLP area after works of Mikolov et al. starting in 2013 [74, 75]. The main idea of this paradigm is to apply neural networks for...
Article
This book is about a new approach in the field of computational linguistics related to the idea of constructing n-grams in non-linear manner, while the traditional approach consists in using the data from the surface structure of texts, i.e., the linear structure. In this book, we propose and systematize the concept of syntactic n-grams, which allo...
Conference Paper
Full-text available
In this paper, we describe the CIC-IPN submissions to the shared task on Indian Native Language Identification (INLI 2018). We use the Support Vector Machines algorithm trained on numerous feature types: word, character, part-of-speech tag, and punctuation mark n-grams, as well as character n-grams from misspelled words and emotion-based features....
Article
The present study deals with the detection of negative emotions in informal short texts (tweets). Our work takes advantage of several features of social networks, particularly their availability and confidence they offer users in terms of reflecting their emotions. The corpus of tweets was manually marked with emotions. The corpus was balanced beca...
Conference Paper
Full-text available
We explore the hypothesis that emotion is one of the dimensions of language that surfaces from the native language into a second language. To check the role of emotions in native language identification (NLI), we model emotion information through polarity and emotion load features, and use document representations using these features to classify t...
Chapter
Distance and similarity measures are essential to solve many pattern recognition problems such as classification, information retrieval and clustering, where the use of a specific distance could led to a better performance than others. A weighted cosine distance is proposed considering a variation in the weights of exclusive attributes of the input...
Conference Paper
Full-text available
We present the CIC-GIL approach to the author profiling (AP) task at MEX-A3T 2018. The task consists of two subtasks: identification of authors' location (6-way) and occupation (8-way) in a corpus of Mexican Spanish tweets. We used the logistic regression algorithm trained on typed character n-grams, function-word n-grams, and regionalisms for loca...
Article
Full-text available
Recently, document embeddings methods have been proposed aiming at capturing hidden properties of the texts. These methods allow to represent documents in terms of fixed-length, continuous and dense feature vectors. In this paper, we propose to learn document vectors based on n-grams and not only on words. We use the recently proposed Paragraph Vec...
Article
Full-text available
It is well-known that the lack of quality data is a major problem for information retrieval engines. Web articles are flooded with non-relevant data such as advertising and related links. Moreover, some of these ads are loaded in a randomized way every time you hit a page, so the HTML document will be different and hashing of the content will be no...
Article
Recently, the extraction of clinical events from unstructured medical texts has attracted much attention of the research community. Machine learning approaches are popular for this task, due to their ability to solve the problem of sequence tagging effectively. It has been suggested previously that simple features, such as word unigrams, part-of-sp...
Article
Full-text available
In this paper, we present an approach to identify changes in the writing style of 7 authors of novels written in English. We defined 3 stages of writing for each author, each stage contains 3 novels with a maximum of 3 years between each publication. We propose several stylometric features to represent the novels in a vector space model. We use sup...
Article
Biological effects of hormones in both plants and animals are based on high-affinity interaction with cognate receptors resulting in their activation. The signal of cytokinins, classical plant hormones, is perceived in Arabidopsis by three homologous membrane receptors: AHK2, AHK3, and CRE1/AHK4. To study the cytokininereceptor interaction, we used...
Article
Full-text available
For the Authorship Attribution (AA) task, some categories of character n-grams are more predictive than others, both under single-and cross-topic AA conditions. Taking into account the good performance of character n-grams, in this paper, we examine different features: various types of syllable n-grams as features (for single-and cross-topic AA in...
Chapter
The effectiveness of character n-gram features for representing the stylistic properties of a text has been demonstrated in various independent Authorship Attribution (AA) studies. Moreover, it has been shown that some categories of character n-grams perform better than others both under single and cross-topic AA conditions. In this work, we presen...
Conference Paper
Full-text available
We present the CIC systems submitted to the 2017 PAN shared task on Cross-Genre Gender Identification in Russian texts (RUSProfiling). We submitted five systems. One of them was based on a statistical approach using only lexical features, and other four on machine-learning techniques using some combinations of gender-specific Russian grammatical fe...
Article
Full-text available
In this paper, we introduce an algorithm for obtaining the subtrees (continuous and non-continuous syntactic n-grams) from a dependency parse tree of a sentence. Our algorithm traverses the dependency tree of the sentences within a text document and extracts all its subtrees (syntactic n-grams). Syntactic n-grams are being successfully used in the...
Article
Full-text available
Sentence representation at the semantic level is a challenging task for Natural Language Processing and Artificial Intelligence. Despite the advances in word embeddings (i.e. word vector representations), capturing sentence meaning is an open question due to complexities of semantic interactions among words. In this paper, we present an embedding m...
Conference Paper
Good pedagogical actions are key components in all learning-teaching schemes. Automate that is an important Intelligent Tutoring Systems objective. We propose apply Partially Observable Markov Decision Process (POMDP) in order to obtain automatic and optimal pedagogical recommended action patterns in benefit of human students, in the context of Int...
Conference Paper
Full-text available
We compare the performance of character n-gram features (n = 3–8) and lexical features (unigrams and bigrams of words), as well as their combinations, on the tasks of authorship attribution, author profiling, and discriminating between similar languages. We developed a single multi-labeled corpus for the three aforementioned tasks, composed of news...
Conference Paper
Full-text available
We present the CIC's approach to the Author Profiling (AP) task at PAN 2017. This year task consists of two subtasks: gender and language variety identification in English, Spanish, Portuguese, and Arabic. We use typed and untyped character n-grams, word n-grams, and non-textual features (domain names). We experimented with various feature represen...
Conference Paper
Full-text available
This paper presents our approach to the Author Clustering task at PAN 2017. We performed a hierarchical clustering analysis of different document features: typed and untyped character n-grams, and word n-grams. We experimented with two feature representation methods, log-entropy model, and tf-idf; while tuning minimum frequency threshold values to...
Conference Paper
Full-text available
We present the CIC-FBK system, which took part in the Native Language Identification (NLI) Shared Task 2017. Our approach combines features commonly used in previous NLI research, i.e., word n-grams, lemma n-grams, part-of-speech n-grams, and function words, with recently introduced character n-grams from misspelled words, and features that are nov...
Article
Partially observable Markov decision processes (POMDPs) are mathematical models for the planning of action sequences under conditions of uncertainty. Uncertainty in POMDPs is manifested in two ways: uncertainty in the perception of model states and uncertainty in the effects of actions on states. The diagnosis and treatment of cerebral vascular dis...
Conference Paper
Named entities (NE) are words that refer to names of people, locations, organization, etc. NE are present in every kind of documents: e-mails, letters, essays, novels, poems. Automatic detection of these words is very important task in natural language processing. Sometimes, NE are used in authorship attribution studies as a stylometric feature. Th...
Conference Paper
Full-text available
To determine author demographics of texts in social media such as Twitter, blogs, and reviews, we use doc2vec document embeddings to train a logistic regression classifier. We experimented with age and gender identification on the PAN author profiling 2014–2016 corpora under both single- and cross-genre conditions. We show that under certain settin...
Article
Full-text available
A crucial step in plagiarism detection is text alignment. This task consists in finding similar text fragments between two given documents. We introduce an optimization methodology based on genetic algorithms to improve the performance of a plagiarism detection model by optimizing its input parameters. The implementation of the genetic algorithm is...
Conference Paper
Full-text available
This paper presents the CIC UALG's system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year's task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all la...
Conference Paper
Full-text available
The effectiveness of character n-gram features for representing the stylistic properties of a text has been demonstrated in various independent Authorship Attribution (AA) studies. Moreover, it has been shown that some categories of character n-grams perform better than others both under single and cross-topic AA conditions. In this work, we presen...
Article
Full-text available
Distributed word representation in a vector space (word embeddings) is a novel technique that allows to represent words in terms of the elements in the neighborhood. Distributed representations can be extended to larger language structures like phrases, sentences, paragraphs and documents. The capability to encode semantic information of texts and...