Article

Pinpointing Fine-Grained Relationships between Hateful Tweets and Replies

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Recent studies in the hate and counter hate domain have provided the grounds for investigating how to detect this pervasive content in social media. These studies mostly work with synthetic replies to hateful content written by annotators on demand rather than replies written by real users. We argue that working with naturally occurring replies to hateful content is key to study the problem. Building on this motivation, we create a corpus of 5,652 hateful tweets and replies. We analyze their fine-grained relationships by indicating whether the reply (a) is hate or counter hate speech, (b) provides a justification, (c) attacks the author of the tweet, and (d) adds additional hate. We also present linguistic insights into the language people use depending on these fine-grained relationships. Experimental results show improvements (a) taking into account the hateful tweet in addition to the reply and (b) pretraining with related tasks.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Recently, the NLP communities have contributed corpora for the detection (Mathew et al., 2019;He et al., 2021;Albanyan and Blanco, 2022;Yu et al., 2022) and categorization (Mathew et al., 2019;Goffredo et al., 2022) of counterspeech. Categorizing Hateful post: You are full of fat f**k liberals who jump to say a person died because they didn't wear a flimsy piece of cloth and not because they made awful life decisions and stuffed their faces. ...
... Claim is an assertion put forward (Toulmin et al., 1979) that shows disagreement with the hateful comment by rebutting its premise, evidence, or conclusion, or targeting the reasoning between them (Walton, 2009) without providing any justifications or reasons for supporting the claim. Justification refers to language that provides one or more justifications as evidence or reason to oppose the hateful content (Albanyan and Blanco, 2022). For example, the following reply is justified: "This is racist. ...
... Past work has found that lexicon-based features can differentiate between Counter replies from Non-counter replies to hateful comments (Mathew et al., 2019;Albanyan and Blanco, 2022). In our study, we explore whether linguistic and lexiconbased features can distinguish Counter Author from Counter Content. ...
... Over the past decade, a wide range of abusive language detection approaches have been developed (Alrashidi et al., 2022;Ejaz et al., 2024;Gandhi et al., 2024;Haidar et al., 2016;Iwendi et al., 2023;Mansur et al., 2023;Poletto et al., 2021;Salawu et al., 2017;Yin and Zubiaga, 2021), accompanied by labeled datasets needed for training machine learning detection models (Albanyan et al., 2023;Albanyan and Blanco, 2022;Almohaimeed et al., 2023;Kennedy et al., 2022;Toraman et al., 2022;Vidgen et al., 2021a). Abusive language manifests in various forms, from explicit threatening hate speech to more subtle forms of verbal aggression. ...
... "Supportive" language refers to positive comments in support of identifiable communities of people. In some datasets, "Counter Speech" (Vidgen et al., 2021a) or "Counter Hate" (Albanyan et al., 2023;Albanyan and Blanco, 2022) labels are introduced that distinguish a form of communication that challenges, condemns, or calls out abusive language. As a result, text that contains hateful content directed at the author of the original hate speech may be labeled as "Counter Speech" or "Counter Hate". ...
Preprint
Full-text available
The proliferation of abusive language in online communications has posed significant risks to the health and wellbeing of individuals and communities. The growing concern regarding online abuse and its consequences necessitates methods for identifying and mitigating harmful content and facilitating continuous monitoring, moderation, and early intervention. This paper presents a taxonomy for distinguishing key characteristics of abusive language within online text. Our approach uses a systematic method for taxonomy development, integrating classification systems of 18 existing multi-label datasets to capture key characteristics relevant to online abusive language classification. The resulting taxonomy is hierarchical and faceted, comprising 5 categories and 17 dimensions. It classifies various facets of online abuse, including context, target, intensity, directness, and theme of abuse. This shared understanding can lead to more cohesive efforts, facilitate knowledge exchange, and accelerate progress in the field of online abuse detection and mitigation among researchers, policy makers, online platform owners, and other stakeholders.
... In order to achieve such methods, understanding online hate and predicting hate crimes in their various manifestations are very important [7]. Therefore, to improve hate detection, hate speech datasets [8][9][10][11][12][13] have been developed over the years, which mostly focus on hate speech attributes closely related to hate expressions (e.g., abusive comment, hate speech, offensive, and disrespectful content), the target of hate (e.g., origin, gender, religion, and sex orientation), and retweets (now termed posts) or the like are made available for the purpose of progressive research and development. ...
... Comprehensive studies [4,[28][29][30] have documented these effects, emphasizing the urgent need for effective detection mechanisms. Current research primarily focuses on developing computational models to detect explicit and implicit hate speech [11][12][13]. However, traditional methods often overlook nuanced expressions such as sarcasm and indirect speech that can mask hateful intentions [31][32][33], a critical gap highlighted in the literature. ...
Article
Full-text available
The growing number of social media users has impacted the rise in hate comments and posts. While extensive research in hate speech detection attempts to combat this phenomenon by developing new datasets and detection models, reconciling classification accuracy with broader decision-making metrics like plausibility and faithfulness remains challenging. As restrictions on social media tighten to stop the spread of hate and offensive content, users have adapted by finding new approaches, often camouflaged in the form of sarcasm. Therefore, dealing with new trends such as the increased use of emoticons (negative emoticons in positive sentences) and sarcastic comments is necessary. This paper introduces sarcasm-based rationale (emoticons or portions of text that indicate sarcasm) combined with hate/offensive rationale for better detection of hidden hate comments/posts. A dataset was created by labeling texts and selecting rationale based on sarcasm from the existing benchmark hate dataset, HateXplain. The newly formed dataset was then applied in the existing state-of-the-art model. The model’s F1-score increased by 0.01 when using sarcasm rationale with hate/offensive rationale in a newly formed attention proposed in the data’s preprocessing step. Also, with the new data, a significant improvement was observed in explainability metrics such as plausibility and faithfulness.
... Before building our own dataset, we surveyed a variety of toxic datasets available for testing LLMs. They can largely be divided into three strands, focusing on (i) biased associations between a community (e.g., women) and semantic assignments (e.g., household) (e.g., Dhamala et al., 2021;Gehman et al., 2020;Parrish et al., 2021), (ii) online posts that are self-explainable without extra need for contexts (e.g., "this b**ch think she in I Am Legend LMAOOO" Albanyan and Blanco, 2022;Albanyan et al., 2023;Toraman et al., 2022;Wijesiriwardene et al., 2020), or (iii) machine-generated responses to toxicity-induced instructions (e.g., Hartvigsen et al., 2022;Wen et al., 2023). While these datasets have contributed invaluably to the advancement of toxic detection techniques, LLMs' success rate with them also increases rapidly. ...
Preprint
Full-text available
The rapid development of large language models (LLMs) gives rise to ethical concerns about their performance, while opening new avenues for developing toxic language detection techniques. However, LLMs' unethical output and their capability of detecting toxicity have primarily been tested on language data that do not demand complex meaning inference, such as the biased associations of 'he' with programmer and 'she' with household. Nowadays toxic language adopts a much more creative range of implicit forms, thanks to advanced censorship. In this study, we collect authentic toxic interactions that evade online censorship and that are verified by human annotators as inference intensive. To evaluate and improve LLMs' reasoning of the authentic implicit toxic language, we propose a new prompting method, Pragmatic Inference Chain (PIC), drawn on interdisciplinary findings from cognitive science and linguistics. The PIC prompting significantly improves the success rate of GPT-4o, Llama-3.1-70B-Instruct, and DeepSeek-v2.5 in identifying implicit toxic language, compared to both direct prompting and Chain-of-Thought. In addition, it also facilitates the models to produce more explicit and coherent reasoning processes, hence can potentially be generalized to other inference-intensive tasks, e.g., understanding humour and metaphors.
Article
Full-text available
Social media users, including organizations, often struggle to acquire the maximum number of responses from other users, but predicting the responses that a post will receive before publication is highly desirable. Previous studies have analyzed why a given tweet may become more popular than others, and have used a variety of models trained to predict the response that a given tweet will receive. The present research addresses the prediction of response measures available on Twitter, including likes, replies and retweets. Data from a single publisher, the official US Navy Twitter account, were used to develop a feature-based model derived from structured tweet-related data. Most importantly, a deep learning feature extraction approach for analyzing unstructured tweet text was applied. A classification task with three classes, representing low, moderate and high responses to tweets, was defined and addressed using four machine learning classifiers. All proposed models were symmetrically trained in a fivefold cross-validation regime using various feature configurations, which allowed for the methodically sound comparison of prediction approaches. The best models achieved F1 scores of 0.655. Our study also used SHapley Additive exPlanations (SHAP) to demonstrate limitations in the research on explainable AI methods involving Deep Learning Language Modeling in NLP. We conclude that model performance can be significantly improved by leveraging additional information from the images and links included in tweets.
Article
Full-text available
This article is a survey of methods for measuring agreement among corpus annotators. It exposes the mathematics and underlying assumptions of agreement coefficients, covering Krippendorff's alpha as well as Scott's pi and Cohen's kappa; discusses the use of coefficients in several annotation tasks; and argues that weighted, alpha-like coefficients, traditionally less used than kappa-like measures in computational linguistics, may be more appropriate for many corpus annotation tasks—but that their use makes the interpretation of the value of the coefficient even harder.
Conference Paper
Full-text available
We present an approach to detecting hate speech in online text, where hate speech is defined as abusive speech targeting specific group characteristics, such as ethnic origin, religion, gender, or sexual orientation. While hate speech against any group may exhibit some common characteristics, we have observed that hatred against each different group is typically characterized by the use of a small set of high frequency stereotypical words; however, such words may be used in either a positive or a negative sense, making our task similar to that of words sense disambiguation. In this paper we describe our definition of hate speech, the collection and annotation of our hate speech corpus, and a mechanism for detecting some commonly used methods of evading common "dirty word" filters. We describe pilot classification experiments in which we classify anti-semitic speech reaching an accuracy 94%, precision of 68% and recall at 60%, for an F1 measure of. 6375.
Conference Paper
Full-text available
What makes a tweet worth sharing? We study the content of tweets to uncover linguistic tendencies of shared microblog posts (retweets), by examining surface linguistic features, deeper parse-based features and Twitter-specific conventions in tweet content. We show how these features correlate with a functional classification of tweets, thereby categorizing people's writing styles based on their different intentions on Twitter. We find that both linguistic features and functional classification contribute to re-tweeting. Our work shows that opinion tweets favor originality and pithiness and that update tweets favor direct statements of a tweeter's current activity. Judicious use of #hashtags also helps to encourage retweeting.
Article
Full-text available
Even though considerable attention has been given to the polarity of words (positive and negative) and the creation of large polarity lexicons, research in emotion analysis has had to rely on limited and small emotion lexicons. In this paper we show how the combined strength and wisdom of the crowds can be used to generate a large, high-quality, word-emotion and word-polarity association lexicon quickly and inexpensively. We enumerate the challenges in emotion annotation in a crowdsourcing scenario and propose solutions to address them. Most notably, in addition to questions about emotions associated with terms, we show how the inclusion of a word choice question can discourage malicious data entry, help identify instances where the annotator may not be familiar with the target term (allowing us to reject such annotations), and help obtain annotations at sense level (rather than at word level). We conducted experiments on how to formulate the emotion-annotation questions, and show that asking if a term is associated with an emotion leads to markedly higher inter-annotator agreement than that obtained by asking if a term evokes an emotion.
Article
In recent years online social networks have suffered an increase in sexism, racism, and other types of aggressive and cyberbullying behavior, often manifesting itself through offensive, abusive, or hateful language. Past scientific work focused on studying these forms of abusive activity in popular online social networks, such as Facebook and Twitter. Building on such work, we present an eight month study of the various forms of abusive behavior on Twitter, in a holistic fashion. Departing from past work, we examine a wide variety of labeling schemes, which cover different forms of abusive behavior. We propose an incremental and iterative methodology that leverages the power of crowdsourcing to annotate a large collection of tweets with a set of abuse-related labels. By applying our methodology and performing statistical analysis for label merging or elimination, we identify a reduced but robust set of labels to characterize abuse-related tweets. Finally, we offer a characterization of our annotated dataset of 80 thousand tweets, which we make publicly available for further scientific exploration.
Article
Should hate speech be banned? This article contends that the debate on this question must be disaggregated into discrete analytical stages, lest its participants continue to talk past one another. The first concerns the scope of the moral right to freedom of expression, and whether hate speech falls within the right's protective ambit. If it does, hate speech bans are necessarily unjust. If not, we turn to the second stage, which assesses whether speakers have moral duties to refrain from hate speech. The article canvasses several possible duties from which such a duty could be derived, including duties not to threaten, harass, offend, defame, or incite. If there is a duty to refrain from hate speech, it is yet a further question whether the duty should actually be enforced. This third stage depends on pragmatic concerns involving epistemic fallibility, the abuse of state power, and the benefits of counter-speech over coercion.
Article
The scientific study of hate speech, from a computer science point of view, is recent. This survey organizes and describes the current state of the field, providing a structured overview of previous approaches, including core algorithms, methods, and main features used. This work also discusses the complexity of the concept of hate speech, defined in many platforms and contexts, and provides a unifying definition. This area has an unquestionable potential for societal impact, particularly in online communities and digital media platforms. The development and systematization of shared resources, such as guidelines, annotated datasets in multiple languages, and algorithms, is a crucial step in advancing the automatic detection of hate speech.
Conference Paper
The damage personal attacks cause to online discourse motivates many platforms to try to curb the phenomenon. However, understanding the prevalence and impact of personal attacks in online platforms at scale remains surprisingly difficult. The contribution of this paper is to develop and illustrate a method that combines crowdsourcing and machine learning to analyze personal attacks at scale. We show an evaluation method for a classifier in terms of the aggregated number of crowd-workers it can approximate. We apply our methodology to English Wikipedia, generating a corpus of over 100k high quality human-labeled comments and 63M machine-labeled ones from a classifier that is as good as the aggregate of 3 crowd-workers, as measured by the area under the ROC curve and Spearman correlation. Using this corpus of machine-labeled scores, our methodology allows us to explore some of the open questions about the nature of online personal attacks. This reveals that the majority of personal attacks on Wikipedia are not the result of a few malicious users, nor primarily the consequence of allowing anonymous contributions from unregistered users.
Conference Paper
We address the problem of hate speech detection in online user comments. Hate speech, defined as an "abusive speech targeting specific group characteristics, such as ethnicity, religion, or gender", is an important problem plaguing websites that allow users to leave feedback, having a negative impact on their online business and overall user experience. We propose to learn distributed low-dimensional representations of comments using recently proposed neural language models, that can then be fed as inputs to a classification algorithm. Our approach addresses issues of high-dimensionality and sparsity that impact the current state-of-the-art, resulting in highly efficient and effective hate speech detectors.
Article
The use of “Big Data” in policy and decision making is a current topic of debate. The 2013 murder of Drummer Lee Rigby in Woolwich, London, UK led to an extensive public reaction on social media, providing the opportunity to study the spread of online hate speech (cyber hate) on Twitter. Human annotated Twitter data was collected in the immediate aftermath of Rigby's murder to train and test a supervised machine learning text classifier that distinguishes between hateful and/or antagonistic responses with a focus on race, ethnicity, or religion; and more general responses. Classification features were derived from the content of each tweet, including grammatical dependencies between words to recognize “othering” phrases, incitement to respond with antagonistic action, and claims of well-founded or justified discrimination against social groups. The results of the classifier were optimal using a combination of probabilistic, rule-based, and spatial-based classifiers with a voted ensemble meta-classifier. We demonstrate how the results of the classifier can be robustly utilized in a statistical model used to forecast the likely spread of cyber hate in a sample of Twitter data. The applications to policy and decision making are discussed.
Article
Two formulas are presented for judging the significance of the difference between correlated proportions. The chi square equivalent of one of the developed formulas is pointed out.
  • M Abadi
  • A Agarwal
  • P Barham
  • E Brevdo
  • Z Chen
  • C Citro
  • G S Corrado
  • A Davis
  • J Dean
  • M Devin
Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G. S.; Davis, A.; Dean, J.; Devin, M.; et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.
WAC: A Corpus of Wikipedia Conversations for Online Abuse Detection
  • N Cécillon
  • V Labatut
  • R Dufour
  • G Linarès
Cécillon, N.; Labatut, V.; Dufour, R.; and Linarès, G. 2020. WAC: A Corpus of Wikipedia Conversations for Online Abuse Detection. In Proceedings of the 12th Language Resources and Evaluation Conference. ISBN 979-10-95546-34-4.
Automated Hate Speech Detection and the Problem of Offensive Language
  • T Davidson
  • D Warmsley
  • M Macy
  • I Weber
Davidson, T.; Warmsley, D.; Macy, M.; and Weber, I. 2017. Automated Hate Speech Detection and the Problem of Offensive Language. Proceedings of the International AAAI Conference on Web and Social Media, 11(1).
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  • J Devlin
  • M.-W Chang
  • K Lee
  • K Toutanova
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
  • C M Fernandes
  • M Dolenec
  • T Boche
  • V Silva
Fernandes, C. M.; Dolenec, M.; Boche, T.; and Silva, V. 2014. Final Report on Online Hate Speech. ELSA International.
Toxic, Hateful, Offensive or Abusive? What Are We Really Classifying? An Empirical Analysis of Hate Speech Datasets
  • P Fortuna
  • J Soler
  • L Wanner
Fortuna, P.; Soler, J.; and Wanner, L. 2020. Toxic, Hateful, Offensive or Abusive? What Are We Really Classifying? An Empirical Analysis of Hate Speech Datasets. In Proceedings of the 12th Language Resources and Evaluation Conference, 6786-6794. ISBN 979-10-95546-34-4.
Countering online hate speech
  • I Gagliardone
  • D Gal
  • T Alves
  • G Martinez
Gagliardone, I.; Gal, D.; Alves, T.; and Martinez, G. 2015. Countering online hate speech. Unesco Publishing.
  • Y Liu
  • M Ott
  • N Goyal
  • J Du
  • M Joshi
  • D Chen
  • O Levy
  • M Lewis
  • L Zettlemoyer
  • V Stoyanov
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692.
Interaction Dynamics between Hate and Counter Users on Twitter
  • B Mathew
  • N Kumar
  • P Goyal
  • A Mukherjee
Mathew, B.; Kumar, N.; Goyal, P.; and Mukherjee, A. 2020. Interaction Dynamics between Hate and Counter Users on Twitter. In Proceedings of the 7th ACM IKDD CoDS and 25th COMAD, CoDS COMAD 2020, 116-124. ISBN 9781450377386.
Analysis of Reply-Tweets for Buzz Tweet Detection
  • K Matsumoto
  • Y Hada
  • M Yoshida
  • K Kita
Matsumoto, K.; Hada, Y.; Yoshida, M.; and Kita, K. 2019. Analysis of Reply-Tweets for Buzz Tweet Detection. In Proceedings of the 33rd Pacific Asia Conference on Language, Information and Computation (PACLIC 33), Hakodate, Japan, 13-15.
ConanDoyle-neg: Annotation of negation cues and their scope in Conan Doyle stories
  • R Morante
  • W Daelemans
Morante, R.; and Daelemans, W. 2012. ConanDoyle-neg: Annotation of negation cues and their scope in Conan Doyle stories. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), 1563-1568.
PyTorch: An Imperative Style, High-Performance Deep Learning Library
  • A Paszke
  • S Gross
  • F Massa
  • A Lerer
  • J Bradbury
  • G Chanan
  • T Killeen
  • Z Lin
  • N Gimelshein
  • L Antiga
  • A Desmaison
  • A Kopf
  • E Yang
  • Z Devito
  • M Raison
  • A Tejani
  • S Chilamkurthy
  • B Steiner
  • L Fang
  • J Bai
  • S Chintala
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; Desmaison, A.; Kopf, A.; Yang, E.; DeVito, Z.; Raison, M.; Tejani, A.; Chilamkurthy, S.; Steiner, B.; Fang, L.; Bai, J.; and Chintala, S. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, 8026-8037. Curran Associates, Inc.
Facebook Throws More Money at Wiping Out Hate Speech and Bad Actors
  • D Seetharaman
Seetharaman, D. 2018. Facebook Throws More Money at Wiping Out Hate Speech and Bad Actors. The Wall Street Journal.
Lower Bias, Higher Density Abusive Language Datasets: A Recipe
  • J Van Rosendaal
  • T Caselli
  • M Nissim
van Rosendaal, J.; Caselli, T.; and Nissim, M. 2020. Lower Bias, Higher Density Abusive Language Datasets: A Recipe. In Proceedings of the Workshop on Resources and Techniques for User and Author Profiling in Abusive Language, 14-19. ISBN 979-10-95546-49-8.
Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books
  • Y Zhu
  • R Kiros
  • R Zemel
  • R Salakhutdinov
  • R Urtasun
  • A Torralba
  • S Fidler
Zhu, Y.; Kiros, R.; Zemel, R.; Salakhutdinov, R.; Urtasun, R.; Torralba, A.; and Fidler, S. 2015. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. In 2015 IEEE International Conference on Computer Vision (ICCV).