Article

A convolutional neural network approach for gender and language variety identification

Article

A convolutional neural network approach for gender and language variety identification

If you want to read the PDF, try requesting it from the authors.

Abstract

We present a method for gender and language variety identification using a convolutional neural network (CNN). We compare the performance of this method with a traditional machine learning algorithm – support vector machines (SVM) trained on character n-grams (n = 3–8) and lexical features (unigrams and bigrams of words), and their combinations. We use a single multi-labeled corpus composed of news articles in different varieties of Spanish developed specifically for these tasks. We present a convolutional neural network trained on word- and sentence-level embeddings architecture that can be successfully applied to gender and language variety identification on a relatively small corpus (less than 10,000 documents). Our experiments show that the deep learning approach outperforms a traditional machine learning approach on both tasks, when named entities are present in the corpus. However, when evaluating the performance of these approaches reducing all named entities to a single symbol “NE” to avoid topic-dependent features, the drop in accuracy is higher for the deep learning approach.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Although some machine learning methods can achieve good results on some tasks, due to the complexity of feature engineering [8], the effects of these methods are very dependent on feature representation, and difficult to achieve acceptable classification results [9,10]. With the popularity of deep learning, many deep learning methods are applied to sentiment classification tasks [11,12,13]. Compared with machine learning methods, deep learning does not need to manually extract features. ...
... The purpose of the sentiment classification task is to obtain the sentiment polarity contained in the sentence, while the sentence information contained in the word is not complete, so the sentence encoder is needed to extract the features of the sentence, so as to generate the vector representation of the sentence. Convolutional neural network (CNN) was originally applied in the field of the image, but it has also been widely applied in the field of NLP in recent years [11,25] and has achieved good results. The network structure of incomplete connection and weight sharing reduces the complexity of the model. ...
... Finally, the final sentence representation obtained from the output of the shared encoder and the output of the private encoder will be sent to the Softmax classifier after a number of fully connected layers (set to 3 in the experiment) for dimensionality reduction. We define the loss of classifier as Equation (11), and the final loss function is shown in Equation (12). ...
Article
Full-text available
Sentiment classification is an interesting and crucial research topic in the field of natural language processing (NLP). Data-driven methods, including machine learning and deep learning techniques, provide one direct and effective solution to solve the sentiment classification problem. However, the classification performance declines when the input includes review comments for multiple tasks. The most appropriate way of constructing a sentiment classification model under multi-tasking circumstances remains questionable in the related field. In this study, aiming at the multi-tasking sentiment classification problem, we propose a multi-task learning model based on a multi-scale convolutional neural network (CNN) and long short term memory (LSTM) for multi-task multi-scale sentiment classification (MTL-MSCNN-LSTM). The model comprehensively utilizes and properly handles global features and local features of different scales of text to model and represent sentences. The multi-task learning framework improves the encoder quality, simultaneously improving the results of emotion classification. Six different types of commodity review datasets were employed in the experiment. Using accuracy and F1-score as the metrics to evaluate the performance of the proposed model, comparing with methods such as single-task learning and LSTM encoder, the proposed MTL-MSCNN-LSTM model outperforms most of the existing methods.
... The considered models were evaluated in the HispaBlogs and in the PAN17 corpora, where accuracies up to 73.6% and 80.3% where reported, respectively for each corpus. Other studies have considered the use of embedding layers within neural networks to automatically learn the text representations of the documents [13,14]. The advantage of these models is that the extracted embeddings are specifically designed for each corpus. ...
... The main objective of the present study is to discriminate among two Spanish dialects from different regions in Colombia: "Antioqueño" and "Bogotano", using both speech and language information. This can be a more challenging problem than the addressed in related studies [6,14] having in mind that the dialect differences within the same country can be more subtle than the observed among different countries that share the same native language. To address the aforementioned objective, we compare different models created using speech signals and their manually extracted transliterations. ...
... CNNs have shown to be efficient in author profiling tasks such as personality and author identification [23] and [24]. In [25], the author used a methodology based on word and sentence level embeddings with CNNs for gender and geographic identification. Word2Vec and FastText, which is a variation of Word2Vec at character-level, were employed. ...
... They used a Word2Vec model as input for their deep learning architecture and reported accuracies of up to 72.2 % and 91.4 % for gender and LV recognition, respectively. Other studies have considered the use of embedding layers within neural networks to automatically learn text representations of the documents [25], [26]. The advantage of these models is that extracted embeddings are specifically designed for each corpus. ...
Article
Full-text available
The interest in author profiling tasks has increased in the research community because computer applications have shown success in different sectors such as security, marketing, healthcare, and others. Recognition and identification of traits such as gender, age or location based on text data can help to improve different marketing strategies. This type of technology has been widely discussed regarding documents taken from social media. However, its methods have been poorly studied using data with a more formal structure, where there is no access to emoticons, mentions, and other linguistic phenomena that are only present in social media. This paper proposes the use of recurrent and convolutional neural networks and a transfer learning strategy to recognize two demographic traits, i.e., gender and language variety, in documents written in informal and formal language. The models were tested in two different databases consisting of tweets (informal) and call-center conversations (formal). Accuracies of up to 75 % and 68 % were achieved in the recognition of gender in documents with informal and formal language, respectively. Moreover, regarding language variety recognition, accuracies of 92 % and 72 % were obtained in informal and formal text scenarios, respectively. The results indicate that, in relation to the traits considered in this paper, it is possible to transfer the knowledge from a system trained on a specific type of expressions to another one where the structure is completely different and data are scarcer.
... With the fast development of e-commerce, the automated sentiment classification (ASC) method for reviews on various products is demanded in the field of nature language processing (NLP) [1]. ASC methods classify the reviews into positive/negative sentiment classes with satisfactory efficiency and accuracy [2]. ...
... Inspired by the human behaviors that handle multiple tasks simultaneously, multi-task learning neural network (MTL-NN) is proposed, extending the NN with a more sophisticated internal structure. The MTL-NN is a hierarchical structure of NN performing sentiment analysis receiving data containing multiple tasks as input [1]. For example, an online shopping website contains review comments associated with various products, such as books, televisions, handphones, etc. Traditional single-task learning (STL) NN experiences difficulties analyzing text pieces mixing different product types. ...
Article
Full-text available
In the era of big data, multi-task learning has become one of the crucial technologies for sentiment analysis and classification. Most of the existing multi-task learning models for sentiment analysis are developed based on the soft-sharing mechanism that has less interference between different tasks than the hard-sharing mechanism. However, there are also fewer essential features that the model can extract with the soft-sharing method, resulting in unsatisfactory classification performance. In this paper, we propose a multi-task learning framework based on a hard-sharing mechanism for sentiment analysis in various fields. The hard-sharing mechanism is achieved by a shared layer to build the interrelationship among multiple tasks. Then, we design a task recognition mechanism to reduce the interference of the hard-shared feature space and also to enhance the correlation between multiple tasks. Experiments on two real-world sentiment classification datasets show that our approach achieves the best results and improves the classification accuracy over the existing methods significantly. The task recognition training process enables a unique representation of the features of different tasks in the shared feature space, providing a new solution reducing interference in the shared feature space for sentiment analysis.
... DL is able to perform tasks such as regression and classification. There are many different DL architectures available in the literature such as convolutional neural networks (CNN), 5 Boltzmann machine, 6 long short-term memory (LSTM) networks, 7 feedforward deep networks, 8,9 deep belief networks, 10 recurrent neural networks (RNN), 11,12 and gated recurrent units 13,14 (GRU). ...
... RNN uses the backpropagation through time batch algorithm for updating weights. 5 This is based on the following formula: ...
Article
In this study, we tackle the problem of author profiling. The aim of the proposed approach is to determine the author's age and gender. Once the user connects to the company website, this company collects the available data about him (which is usually very limited). Then, the user receives a service recommendation according to his gender and age. Thus, a context‐specific decision‐making system based on these limited data is required to produce an efficient classification. Such a decision system allows companies to promote their marketing. To obtain the best categorization, machine learning (ML) and deep learning (DL) techniques have been applied in the literature. In this article, we apply both classical ML techniques and recently developed DL techniques. More precisely, we adopt the gated recurrent unit model. Our experiments show that our findings are positively comparable with the best state‐of‐the‐art methods.
... The Social Media Mining for Health Applications (SMM4H) Shared Task involves natural language processing (NLP) challenges of using social media data for health research, including informal, colloquial expressions and misspellings of clinical concepts, noise, data sparsity, ambiguity, and multilingual posts (Gasco et al., 2022). As computational analysis opens up new opportunities for researching complex topics using social media data, models are being developed to automatically detect demographic information such as users' age (Klein et al., 2021;Tonja et al., 2022), language (Sarkar et al., 2016) (Aroyehun and Gelbukh, 2020), gender (Markov et al., 2017) (Gómez-Adorno et al., 2019, medical history (Lee et al., 2021), and so on. ...
Conference Paper
This paper describes our submissions for the Social Media Mining for Health (SMM4H) 2022 shared tasks. We participated in 2 tasks: a) Task 4: Classification of Tweets self-reporting exact age and b) Task 9: Classification of Reddit posts self-reporting exact age. We evaluated the two(BERT and RoBERTa) transformer-based models for both tasks. For Task 4 RoBERTa-Large achieved an F1 score of 0.846 on the test set and BERT-Large achieved an F1 score of 0.865 on the test set for Task 9.
... On the other hand, unsupervised systems are more versatile across various types of texts and domains, and can be implemented more easily once the lexical and semantic resources have been created. Previous work uses various machine learning algorithms [5,9,15], neural models [1,7] to gain the understanding of how to predict text, including convolution models [13,6] and the implementation of attention models [2,4]. ...
Chapter
Full-text available
Deriving intelligence from text is important as it can provide valuable information on how events influence public opinion. In this work, a classification task was done in order to obtain the sentiment behind the polarity of an economic text using machine learning and deep learning methods. We analyzed the text for keywords that can be categorized into positive, negative and neutral reviews and found more insights. In the final result of classifying three groups (positive, negative and neutral), the models were unable to perform up to 80% accuracy, where only one variant has the accuracy of 80% as the best on the test dataset.
Article
Full-text available
In this work, a study investigation was carried out using n-grams to classify sentiments with different machine learning and deep learning methods. We used this approach, which combines existing techniques, with the problem of predicting sequence tags to understand the advantages and problems confronted with using unigrams, bigrams and trigrams to analyse economic texts. Our study aims to fill the gap by evaluating the performance of these n-grams features on different texts in the economic domain using nine sentiment analysis techniques and found more insights. We show that by comparing the performance of these features on different datasets and using multiple learning techniques, we extracted useful intelligence. The evaluation involves assessing the precision, recall, f1-score and accuracy of the function output of the several machine learning algorithms proposed. The methods were tested using Amazon, IMDB, Reuters, and Yelp economic review datasets and our comprehensive experiment shows the effectiveness of n-grams in the analysis of sentiments.
Article
Full-text available
It is well-known that the lack of quality data is a major problem for information retrieval engines. Web articles are flooded with non-relevant data such as advertising and related links. Moreover, some of these ads are loaded in a randomized way every time you hit a page, so the HTML document will be different and hashing of the content will be not possible. Therefore, we need to filter the non-relevant text of documents. The automatic extraction of relevant text in on-line text (news articles, etc.), is not a trivial task. There are many algorithms for this purpose described in the literature. One of the most popular ones is Boilerpipe and its performance is one of the best. In this paper, we present a method, which improves the precision of the Boilerpipe algorithm using the HTML tree for selection of the relevant content. Our filter greatly increases precision (at least 15%), at the cost of some recall, resulting in an overall F1-measure improvement (around 5%). We make the experiments for the news articles using our own corpus of 2,400 news in Spanish and 1,000 in English.
Article
Full-text available
Recently, document embeddings methods have been proposed aiming at capturing hidden properties of the texts. These methods allow to represent documents in terms of fixed-length, continuous and dense feature vectors. In this paper, we propose to learn document vectors based on n-grams and not only on words. We use the recently proposed Paragraph Vector method. These n-grams include character n-grams, word n-grams and n-grams of POS tags (in all cases with n varying from 1 to 5). We considered the task of Cross-Topic Authorship Attribution and made experiments on The Guardian corpus. Experimental results show that our method outperforms word-based embeddings and character n-gram based linear models, which are among the most effective approaches for identifying the writing style of an author.
Article
Full-text available
To improve the classification performance for Chinese short text with automatic semantic feature selection, in this paper we propose the Hybrid Attention Networks (HANs) which combines the word- and character-level selective attentions. The model firstly applies RNN and CNN to extract the semantic features of texts. Then it captures class-related attentive representation from word- and character-level features. Finally, all of the features are concatenated and fed into the output layer for classification. Experimental results on 32-class and 5-class datasets show that, our model outperforms multiple baselines by combining not only the word- and character-level features of the texts, but also class-related semantic features by attentive mechanism.
Article
Full-text available
In present front-line of Big Data, prediction tasks over the nodes and edges in complex deep architecture needs a careful representation of features by assigning hundreds of thousands, or even millions of labels and samples for information access system, especially for hierarchical extreme multi-label classification. We introduce edge2vec, an edge representations framework for learning discrete and continuous features of edges in deep architecture. In edge2vec, we learn a mapping of edges associated with nodes where random samples are augmented by statistical and semantic representations of words and documents. We argue that infusing semantic representations of features for edges by exploiting word2vec and para2vec is the key to learning richer representations for exploring target nodes or labels in the hierarchy. Moreover, we design and implement a balanced stochastic dual coordinate ascent (DCA)-based support vector machine for speeding up training. We introduce a global decision-based top-down walks instead of random walks to predict the most likelihood labels in the deep architecture. We judge the efficiency of edge2vec over the existing state-of-the-art techniques on extreme multi-label hierarchical as well as flat classification tasks. The empirical results show that edge2vec is very promising and computationally very efficient in fast learning and predicting tasks. In deep learning workbench, edge2vec represents a new direction for statistical and semantic representations of features in task-independent networks.
Conference Paper
Full-text available
We present the CIC's approach to the Author Profiling (AP) task at PAN 2017. This year task consists of two subtasks: gender and language variety identification in English, Spanish, Portuguese, and Arabic. We use typed and untyped character n-grams, word n-grams, and non-textual features (domain names). We experimented with various feature representations (binary, raw frequency , normalized frequency, log-entropy weighting, tf-idf), machine-learning algorithms (liblinear and libSVM implementations of Support Vector Machines (SVM), multinomial naive Bayes, ensemble classifier, meta-classifiers), and frequency threshold values. We adjusted system configurations for each of the languages and subtasks.
Conference Paper
Full-text available
This paper presents the CIC UALG's system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year's task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all languages) approach and a two-step (language group and then languages within the group) approach. Features exploited include lexical features (unigrams of words) and character n-grams. Besides traditional (untyped) character n-grams, we introduce typed character n-grams in the DSL task. Experiments were carried out with different feature representation methods (binary and raw term frequency), frequency threshold values, and machine-learning algorithms – Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.
Article
Full-text available
Este artículo presenta un método para calcular la similitud entre programas (código fuente). La tarea es útil, por ejemplo, para la clasificación temática de programas o detección de reuso de código (digamos, en el caso de plagio). Usamos para los experimentos el lenguaje de programación Karel. Para determinar la similitud entre programas y/o ideas de soluciones similares utilizamos un enfoque basado en técnicas de procesamiento de lenguaje natural y de recuperación de información. Estas técnicas usan la representación de un documento como un vector de valores de características. Usualmente, las características son n-gramas de palabras o de caracteres. Posteriormente, se puede aplicar el análisis semántico latente para reducir la dimensionalidad de este espacio vectorial. Finalmente, se usa el aprendizaje automático supervisado para la clasificación de textos (o programas que son textos también) parecidos. Para validar el método propuesto, se compiló un corpus de programas para 100 tareas diferentes con un total de 9,341 códigos y otro corpus para 34 tareas adicionalmente clasificado por la idea de solución, formado por 374 códigos. Los resultados experimentales muestran que para el corpus con ideas de solución es mejor la representación con trigramas de caracteres, mientras que para el corpus completo los mejores resultados se obtienen con trigramas de términos y la aplicación del análisis semántico latente.
Article
Full-text available
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Article
Full-text available
LIBLINEAR is an open source library for large-scale linear classification. It supports logistic regres- sion and linear support vector machines. We provide easy-to-use command-line tools and library calls for users and developers. Comprehensive documents are available for both beginners and advanced users. Experiments demonstrate that LIBLINEAR is very efficient on large sparse data sets.
Article
Full-text available
Output coding is a general framework for solving multiclass categorization problems. Previous research on output codes has focused on building multiclass machines given predefined output codes. In this paper we discuss for the first time the problem of designing output codes for multiclass problems. For the design problem of discrete codes, which have been used extensively in previous works, we present mostly negative results. We then introduce the notion of continuous codes and cast the design problem of continuous codes as a constrained optimization problem. We describe three optimization problems corresponding to three different norms of the code matrix. Interestingly, for the l 2 norm our formalism results in a quadratic program whose dual does not depend on the length of the code. A special case of our formalism provides a multiclass scheme for building support vector machines which can be solved efficiently. We give a time and space efficient algorithm for solving the quadratic program. We describe preliminary experiments with synthetic data show that our algorithm is often two orders of magnitude faster than standard quadratic programming packages. We conclude with the generalization properties of the algorithm.
Conference Paper
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Conference Paper
Named entities (NE) are words that refer to names of people, locations, organization, etc. NE are present in every kind of documents: e-mails, letters, essays, novels, poems. Automatic detection of these words is very important task in natural language processing. Sometimes, NE are used in authorship attribution studies as a stylometric feature. The goal of this paper is to evaluate the effect of the presence of NE in texts for the authorship attribution task: are we really detecting the style of an author or are we just discovering the appearance of the same NE. We used the corpus that consists of 91 novels of 7 authors of XVIII century. These authors spoke and wrote English, their native language. All novels belong to fiction genre. The used stylometric features were character n-grams, word n-gram and n-gram of POS tags of various sizes (2-grams, 3-grams, etc.). Five novels were selected for each author, these novels contain between 4 and 7% of the NE. All novels were divided into blocks, each block contains 10,000 terms. Two kinds of experiment were conducted: automatic classification of blocks containing NE and of the same blocks without NE. In some cases, we use only the most frequent n-grams (500, 2,000 and 4,000 n-grams). Three machine learning algorithms were used for classification task: NB, SVM (SMO) and J48. The results show that as a tendency the presence of the NE helps to classify (improvements from 5% to 20%), but there are specific authors when NE do not help and even make the classification worse (about 10% of experimental data).
Article
This article presents a deep learning based method for determining the author's personality type from text: given a text, the presence or absence of the Big Five traits is detected in the author's psychological profile. For each of the five traits, the authors train a separate binary classifier, with identical architecture, based on a novel document modeling technique. Namely, the classifier is implemented as a specially designed deep convolutional neural network, with injection of the document-level Mairesse features, extracted directly from the text, into an inner layer. The first layers of the network treat each sentence of the text separately; then the sentences are aggregated into the document vector. Filtering out emotionally neutral input sentences improved the performance. This method outperformed the state of the art for all five traits, and the implementation is freely available for research purposes.
Chapter
One of the most striking contemporary developments in the field of machine learning is the meteoric ascent of what has been called “deep learning,” and this area is now at the forefront of current research. We discuss key differences between traditional neural network architectures and learning techniques, and those that have become popular in deep learning. A detailed derivation of the backpropagation algorithm in vector-matrix form is provided, and the relationship to computational graphs and deep learning software is discussed. Deep convolutional neural networks are covered, as well as autoencoders, recurrent neural networks, and stochastic approaches based on Boltzmann machines. Key practical aspects of training these models with large data sets are discussed, along with the role of GPU computing.
Article
Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Many popular models to learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for morphologically rich languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skip-gram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram, words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpus quickly. We evaluate the obtained word representations on five different languages, on word similarity and analogy tasks.
Article
We report on a series of experiments with convolutional neural networks (CNN) trained on top of pre-trained word vectors for sentence-level classification tasks. We first show that a simple CNN with little hyperparameter tuning and static vectors achieves excellent results on multiple benchmarks. Learning task-specific vectors through fine-tuning offers further gains in performance. We additionally propose a simple modification to the architecture to allow for the use of both task-specific and static word vectors. The CNN models discussed herein improve upon the state-of-the-art on 4 out of 7 tasks, which include sentiment analysis and question classification.
Article
We present a novel per-dimension learning rate method for gradient descent called ADADELTA. The method dynamically adapts over time using only first order information and has minimal computational overhead beyond vanilla stochastic gradient descent. The method requires no manual tuning of a learning rate and appears robust to noisy gradient information, different model architecture choices, various data modalities and selection of hyperparameters. We show promising results compared to other methods on the MNIST digit classification task using a single machine and on a large scale voice dataset in a distributed cluster environment.
Article
A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. initial tests find this completely automatic method for retrieval to be promising.
Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter
  • F Rangel
  • P Rosso
  • M Potthast
  • B Stein
Author Profiling with Bidirectional RNNs using Attention with GRUs
  • D Kodiyan
  • F Hardegger
  • S Neuhaus
  • M Cieliebak
INAOE’s Participation at PAN’15: Author Profiling task, in: Working Notes Papers of the CLEF
  • M A Álvarez-Carmona
  • A P López-Monroy
  • M Montes-Y-Gómez
  • L Villaseñor-Pineda
  • H Jair-Escalante
Convolutional Neural Networks for Author Profiling in PAN
  • S Sierra
  • M Montes-Y-Gómez
  • T Solorio
  • F González