Chapter

Estimating Aggressiveness of Russian Texts by Means of Machine Learning

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper considers emotional assessment of texts in Russian using machine learning on the example of aggression detection. It summarizes the related work, methods, models and datasets, describes actual problems, proposes a text processing pipeline and a software system for training neural networks on heterogeneous datasets. The experiments show that neural networks trained on the annotated corpora both in Russian and English, allow to determine whether a text item in Russian contains an aggressive message. Authors thoroughly compare different assessment methods, particularly corpus-based approaches, machine learning solutions and hybrid variants. Results, obtained here, can be used to estimate the aggressiveness probability, for example, to rank messages for subsequent manual verification. They also enable feasibility studies on the possibilities of detecting a particular type of emotion in a text using corpora in other languages. The paper highlights further research directions, where different Python toolkits (NLTK, Keras) could be used for better model performance.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Chapter
The aim of our work is to develop a software that could detect toxicity in the Russian segment of social media. In this paper, we investigated the problem of toxic detection in messages in Russian language. We implemented a set of features using selected vector models, trained some classifiers on the dataset about fourteen thousand annotated messages and compare results. Experiments were conducted with a calculation of accuracy, precision, and recall values. F1 measure reached the value 0.91, accuracy value is 0.87.
Chapter
The article discusses the development of an online tool for moderating the content of social network groups. The use of classification using machine learning methods is proposed as the main element of the system. The creation of the feature set of messages is assumed by extracting the content features of the text, as well as the use of word embeddings vectors. The authors conducted a series of experiments to find the best combination of vector representation, content features and classification method. Tests on a dataset of 11 thousand messages in Russian showed the result of 87% accuracy. The architecture of the group moderator’s web application with the ability to automatically apply classification results to control users and display posts is proposed.
Chapter
The paper presents the distribution of pragmatic markers (PM) of Russian everyday speech in two types of discourse: dialogical and monologic. PMs are an essential part of any oral discourse, therefore, quantitative data on their distribution are necessary for solving both theoretical and practical tasks related to studies of speech communication, as well as for translation and teaching Russian as a foreign language. The article describes samples from two speech corpora: “One Speaker’s Day” (ORD corpus, consisting of mostly dialogue speech, the annotated subcorpus containing 321 504 tokens) and “Balanced Annotated Text Library” (SAT corpus, which consists only of monologues, the annotated subcorpus containing 50 128 tokens). Besides, it presents statistical data of PM distributions obtained for 60 basic (invariant) markers, PMs common in both dialogue and monologue (for example, hesitative marker such as vot, tam, tak) are identified, as well as those that are more typical for monologues (boundary markers like znachit, nu vot, vs’o) or dialogues (‘xeno’-markers like takoj, grit; and meta-communicative markers vidish’, (ja) ne znaju). Special attention is paid to PMs usage both in different communication situations and in speech of different sociolects.
Article
Full-text available
While social media offer great communication opportunities, they also increase the vulnerability of young people to threatening situations online. Recent studies report that cyberbullying constitutes a growing problem among youngsters. Successful prevention depends on the adequate detection of potentially harmful messages and the information overload on the Web requires intelligent systems to identify potential risks automatically. The focus of this paper is on automatic cyberbullying detection in social media text by modelling posts written by bullies, victims, and bystanders of online bullying. We describe the collection and fine-grained annotation of a cyberbullying corpus for English and Dutch and perform a series of binary classification experiments to determine the feasibility of automatic cyberbullying detection. We make use of linear support vector machines exploiting a rich feature set and investigate which information sources contribute the most for the task. Experiments on a hold-out test set reveal promising results for the detection of cyberbullying-related posts. After optimisation of the hyperparameters, the classifier yields an F1 score of 64% and 61% for English and Dutch respectively, and considerably outperforms baseline systems.
Conference Paper
Full-text available
Social media platforms allow users to share and discuss their opinions online. However, a minor- ity of user posts is aggressive, thereby hinders respectful discussion, and — at an extreme level — is liable to prosecution. The automatic identification of such harmful posts is important, be- cause it can support the costly manual moderation of online discussions. Further, the automation allows unprecedented analyses of discussion datasets that contain millions of posts. This system description paper presents our submission to the First Shared Task on Aggression Identification. We propose to augment the provided dataset to increase the number of labeled comments from 15,000 to 60,000. Thereby, we introduce linguistic variety into the dataset. As a consequence of the larger amount of training data, we are able to train a special deep neural net, which generalizes especially well to unseen data. To further boost the performance, we combine this neural net with three logistic regression classifiers trained on character and word n-grams, and hand-picked syntactic features. This ensemble is more robust than the individual single models. Our team named “Julian” achieves an F1-score of 60% on both English datasets, 63% on the Hindi Facebook dataset, and 38% on the Hindi Twitter dataset.
Article
Full-text available
The article presents a method of detecting prosodically prominent words, i.e. words that carry most of the information in the utterance. The method relies on lexical, grammatical and syntactic markers of prominence, and can be used in Text-to-Speech synthesis systems to make synthesized speech sound more natural. Three different classification methods were used: Naive Bayes, Maximum Entropy and Conditional Random Fields models. The results of the experiments show that discriminative models provide more balanced values of the performance metrics, while the generative model is potentially more useful for detecting prominent words in speech signal. The results of the study are comparable with the performances of similar systems developed for other languages, and in some cases surpass them.
Conference Paper
Full-text available
In recent years, bullying and aggression against social media users have grown significantly, causing serious consequences to victims of all demographics. Nowadays, cyberbullying affects more than half of young social media users worldwide, suffering from prolonged and/or coordinated digital harassment. Also, tools and technologies geared to understand and mitigate it are scarce and mostly ineffective. In this paper, we present a principled and scalable approach to detect bullying and aggressive behavior on Twitter. We propose a robust methodology for extracting text, user, and network-based attributes, studying the properties of bullies and aggressors, and what features distinguish them from regular users. We find that bullies post less, participate in fewer online communities, and are less popular than normal users. Aggressors are relatively popular and tend to include more negativity in their posts. We evaluate our methodology using a corpus of 1.6M tweets posted over 3 months, and show that machine learning classification algorithms can accurately detect users exhibiting bullying and aggressive behavior, with over 90% AUC.
Article
Full-text available
The article deals with a new approach to text classification considering the existence of different types of classification features (binary, nominal, ordinal and interval). The specialty of the approach is a phased classification process, which makes it possible to not cause different types of classification features to a single range. The author describes a computational experiment using texts included in Russian National Corpus and suggests the set of classification features for Russian text classification based on the age of theirs supposed readers. Text documents included in the sample are divided into two categories - for adults and for children, - according to the views of experts.
Conference Paper
Full-text available
We describe EmoBank, a corpus of 10k English sentences balancing multiple genres, which we annotated with dimensional emotion metadata in the Valence-Arousal-Dominance (VAD) representation format. EmoBank excels with a bi-perspectival and bi-representational design. On the one hand, we distinguish between writer's and reader's emotions, on the other hand, a subset of the corpus complements dimensional VAD annotations with categorical ones based on Basic Emotions. We find evidence for the supremacy of the reader's perspective in terms of IAA and rating intensity , and achieve close-to-human performance when mapping between dimensional and categorical formats.
Conference Paper
Full-text available
The detection of aggressive behavior in online discussion communities is of great interest, due to the large number of users, especially of young age, who are frequently exposed to such behaviors in social networks. Research on cyberbullying prevention focuses on the detection of potentially harmful messages and the development of intelligent systems for the identification of verbal aggressiveness expressed with insults and threats. Text mining techniques are among the most promising tools used so far in the field of aggressive sentiments detection in short texts, such as comments, reviews, tweets etc. This article presents a novel approach which employs sentiment analysis at message level, but considers the whole communication thread (i.e., users discussions) as the context of the aggressive behavior. The suggested approach is able to detect aggressive, inappropriate or antisocial behavior, under the prism of the discussion context. Key aspects of the approach are the monitoring and analysis of the most recently published comments, and the application of text classification techniques for detecting whether an aggressive action actually emerges in a discussion thread. Thorough experimental validation of the suggested approach in a dataset for cyberbullying detection tasks demonstrates its applicability and advantages compared to other approaches.
Article
Full-text available
Sentiment analysis is one of the fastest growing research areas in computer science, making it challenging to keep track of all the activities in the area. We present a computer-assisted literature review, where we utilize both text mining and qualitative coding, and analyze 6,996 papers from Scopus. We find that the roots of sentiment analysis are in the studies on public opinion analysis at the beginning of 20th century and in the text subjectivity analysis performed by the computational linguistics community in 1990’s. However, the outbreak of computer-based sentiment analysis only occurred with the availability of subjective texts on the Web. Consequently, 99% of the papers have been published after 2004. Sentiment analysis papers are scattered to multiple publication venues, and the combined number of papers in the top-15 venues only represent ca. 30% of the papers in total. We present the top-20 cited papers from Google Scholar and Scopus and a taxonomy of research topics. In recent years, sentiment analysis has shifted from analyzing online product reviews to social media texts from Twitter and Facebook. Many topics beyond product reviews like stock markets, elections, disasters, medicine, software development and cyberbullying extend the utilization of sentiment analysis.
Article
Full-text available
Sentiment Analysis (SA) is an ongoing field of research in text mining field. SA is the computational treatment of opinions, sentiments and subjectivity of text. This survey paper tackles a comprehensive overview of the last update in this field. Many recently proposed algorithms' enhancements and various SA applications are investigated and presented briefly in this survey. These articles are categorized according to their contributions in the various SA techniques. The related fields to SA (transfer learning, emotion detection, and building resources) that attracted researchers recently are discussed. The main target of this survey is to give nearly full image of SA techniques and the related fields with brief details. The main contributions of this paper include the sophisticated categorizations of a large number of recent articles and the illustration of the recent trend of research in the sentiment analysis and its related areas.
Article
E-communication represents a major threat to users who are exposed to a number of risks and potential attacks. Detecting these risks with as much anticipation as possible is crucial for prevention. However, much research so far has focused on forensic tools that can be applied only when an attack has been performed. This paper proposes a novel and effective methodology for the early detection of threats in written social media. The goal is to recognize a potential attack before it is consummated, and using a minimum amount of information. The proposed approach considers the use of profile-based representations (PBRs) for this goal. PBRs have multiple benefits, including non-sparsity, low dimensionality, and a proved discriminative power. Moreover, representations for partial documents can be derived naturally with PBRs, which makes them suitable for the addressed problem. Results include empirical evidence on the usefulness of PBRs in the early recognition setting for two tasks in which anticipation is critical: sexual predator detection and aggressive text identification. These results reveal, on the one hand, that PBRs achieve state of the art performance when using full-length documents (i.e., the classical task), and, on the other hand, that the proposed methodology outperforms previous work on early recognition of sexual predators by a considerable margin, while obtaining state of the art performance in aggressive text identification. To the best of our knowledge, these are the best results reported on early recognition for the approached problems. We foresee this work will pave the way for the development of novel methodologies for the problem and will motivate further research from the intelligent systems and text mining communities.
Article
As the first step to model emotional state of a person, we build sentiment analysis models with existing deep neural network algorithms and compare the models with psychological measurements to enlighten the relationship. In the experiments, we first examined psychological state of 64 participants and asked them to summarize the story of a book, Chronicle of a Death Foretold (Marquez, 1981). Secondly, we trained models using crawled 365,802 movie review data; then we evaluated participants' summaries using the pretrained model as a concept of transfer learning. With the background that emotion affects on memories, we investigated the relationship between the evaluation score of the summaries from computational models and the examined psychological measurements. The result shows that although CNN performed the best among other deep neural network algorithms (LSTM, GRU), its results are not related to the psychological state. Rather, GRU shows more explainable results depending on the psychological state. The contribution of this paper can be summarized as follows: (1) we enlighten the relationship between computational models and psychological measurements. (2) we suggest this framework as objective methods to evaluate the emotion; the real sentiment analysis of a person.
Article
This paper discusses the problems of application and choice of cryptographic standards taking into account user requirements and preferences. User profiles are created by means of the ontology apparatus. On the basis of user profiles and document features an appropriate set of documents is formed, the elements of which are then arranged according to the degree of compliance to user requirements. Various filtration methods, such as collaborative filtering, content analysis and filtering, as well as hybrid methods combining both approaches, are used. Thus, a recommender system for choosing cryptographic standards and algorithms is built. If there are several user selection criteria, it is reasonable to apply an integral index of object’s relevance to user preferences. This index is defined as the weighed sum of the particular indices.
Article
Sentiment Analysis (SA), also called Opinion Mining, is currently one of the most studied research fields. It aims to analyze people’s sentiments, opinions, attitudes, emotions, etc., towards elements such as topics, products, individuals, organizations, and services. Different techniques and software tools are being developed to carry out Sentiment Analysis. The goal of this work is to review and compare some free access web services, analyzing their capabilities to classify and score different pieces of text with respect to the sentiments contained therein. For that purpose, three well-known collections have been used to perform several experiments whose results are shown and commented upon, leading to some interesting conclusions about the capabilities of each analyzed tool.
Article
The research described in this work focuses on identifying key components for the task of irony detection. By means of analyzing a set of customer reviews, which are considered ironic both in social and mass media, we try to find hints about how to deal with this task from a computational point of view. Our objective is to gather a set of discriminating elements to represent irony, in particular, the kind of irony expressed in such reviews. To this end, we built a freely available data set with ironic reviews collected from Amazon. Such reviews were posted on the basis of an online viral effect; i.e. contents that trigger a chain reaction in people. The findings were assessed employing three classifiers. Initial results are largely positive, and provide valuable insights into the subjective issues of language facing tasks such as sentiment analysis, opinion mining and decision making.
Article
The general psychoevolutionary theory of emotion that is presented here has a number of important characteristics. First, it provides a broad evolutionary foundation for conceptualizing the domain of emotion as seen in animals and humans. Second, it provides a structural model which describes the interrelations among emotions. Third, it has demonstrated both theoretical and empirical relations among a number of derivative domains including personality traits, diagnoses, and ego defenses. Fourth, it has provided a theoretical rationale for the construction of tests and scales for the measurement of key dimensions within these various domains. Fifth, it has stimulated a good deal of empirical research using these tools and concepts. Finally, the theory provides useful insights into the relationships among emotions, adaptations, and evolution.
Article
Emotions are viewed as having evolved through their adaptive value in dealing with fundamental life-tasks. Each emotion has unique features: signal, physiology, and antecedent events. Each emotion also has characteristics in common with other emotions: rapid onset, short duration, unbidden occurrence, automatic appraisal, and coherence among responses. These shared and unique characteristics are the product of our evolution, and distinguish emotions from other affective phenomena.
Textual aggression detection through deep learning
  • A Tommasel
  • J M Rodriguez
  • D Godoy
Tommasel, A., Rodriguez, J.M., Godoy, D.: Textual aggression detection through deep learning. In: Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying, TRAC-2018, pp. 177-187 (2018)
An analysis of annotated corpora for emotion classification in text
  • L A M Bostan
  • R Klinger
Bostan, L.A.M., Klinger, R.: An analysis of annotated corpora for emotion classification in text. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 2104-2119 (2018)
Combining shallow and deep learning for aggressive text detection
  • V Golem
  • M Karan
  • J Šnajder
Golem, V., Karan, M., Šnajder, J.: Combining shallow and deep learning for aggressive text detection. In: Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying, TRAC-2018, pp. 188-198 (2018)
Semantiko-sintaksicheskij analiz estestvennykh yazykov. CHast’ II. Metod semantiko-sintaksicheskogo analiza tekstov (Semantic-syntactic analysis of natural languages. Part II. Method of semantic-syntactic analysis of texts)
  • I V Smirnov
  • A O Shelmanov
  • E S Kuznecova
  • I V Hramoin
Smirnov, I.V., SHelmanov, A.O., Kuznecova, E.S., Hramoin, I.V.: Semantiko-sintaksicheskij analiz estestvennykh yazykov. CHast' II. Metod semantiko-sintaksicheskogo analiza tekstov (Semantic-syntactic analysis of natural languages. Part II. Method of semantic-syntactic analysis of texts). Iskusstvennyj intellekt i prinyatie reshenij, vol. 1, pp. 11-24. ISA RAS, Moscow (2014)
Applying of sentiment analysis for texts in Russian based on machine learning approach
  • N Yussupova
  • D Bogdanova
  • M Boyko
Yussupova, N., Bogdanova, D., Boyko, M.: Applying of sentiment analysis for texts in Russian based on machine learning approach. In: IMMM 2012: The Second International Conference on Advances in Information Mining and Management, pp. 8-14 (2012)
Methods for determination of psychophysiological condition of user within smart environment based on complex analysis of heterogeneous data
  • D Levonevskii
  • O Shumskaya
  • Velichko
  • M Uzdyaev
  • D Malov
Levonevskii, D., SHumskaya, O., Velichko, Uzdyaev, M., Malov, D.: Methods for determination of psychophysiological condition of user within smart environment based on complex analysis of heterogeneous data. Paper presented at the 14th International Conference on Electromechanics and Robotics "Zavalishin's Readings", ER(ZR)-2019 (2019)
Constricting a corpus for sentiment classification training
  • Y Rubtsova
Rubtsova, Y.: Constricting a corpus for sentiment classification training. Softw. Syst. 1(109), 72-79 (2015)
An approach to text classification based on age groups of addressees
  • A V Glazkova
  • AV Glazkova