Conference PaperPDF Available

BNU-HKBU UIC NLP Team 2 at SemEval-2019 Task 6: Detecting Offensive Language Using BERT model


Abstract and Figures

In this study we deal with the problem of identifying and categorizing offensive language in social media. Our group, BNU-HKBU UIC NLP Team2, use supervised classification along with multiple version of data generated by different ways of pre-processing the data. We then use the state-of-the-art model Bidirec-tional Encoder Representations from Transformers , or BERT (Devlin et al. (2018)), to capture linguistic, syntactic and semantic features. Long range dependencies between each part of a sentence can be captured by BERT's bidirectional encoder representations. Our results show 85.12% accuracy and 80.57% F1 scores in Subtask A (offensive language identification), 87.92% accuracy and 50% F1 scores in Subtask B (categorization of offense types), and 69.95% accuracy and 50.47% F1 score in Subtask C (offense target identification). Analysis of the results shows that distinguishing between targeted and untargeted offensive language is not a simple task. More work needs to be done on the unbalance data problem in Subtasks B and C. Some future work is also discussed.
Content may be subject to copyright.
Proceedings of the 13th International Workshop on Semantic Evaluation (SemEval-2019), pages 551–555
Minneapolis, Minnesota, USA, June 6–7, 2019. ©2019 Association for Computational Linguistics
BNU-HKBU UIC NLP Team 2 at SemEval-2019 Task 6: Detecting
Offensive Language Using BERT model
Zhenghao Wu Hao Zheng Jianming Wang Weifeng Su Jefferson Fong
Computer Science and Technology, Division of Science and Technology
BNU-HKBU United International College
Zhuhai, Guangdong, China
In this study we deal with the problem of iden-
tifying and categorizing offensive language
in social media. Our group, BNU-HKBU
UIC NLP Team2, use supervised classification
along with multiple version of data generated
by different ways of pre-processing the data.
We then use the state-of-the-art model Bidirec-
tional Encoder Representations from Trans-
formers, or BERT (Devlin et al. (2018)), to
capture linguistic, syntactic and semantic fea-
tures. Long range dependencies between each
part of a sentence can be captured by BERT’s
bidirectional encoder representations. Our
results show 85.12% accuracy and 80.57%
F1 scores in Subtask A (offensive language
identification), 87.92% accuracy and 50% F1
scores in Subtask B (categorization of offense
types), and 69.95% accuracy and 50.47% F1
score in Subtask C (offense target identifica-
tion). Analysis of the results shows that distin-
guishing between targeted and untargeted of-
fensive language is not a simple task. More
work needs to be done on the unbalance data
problem in Subtasks B and C. Some future
work is also discussed.
1 Introduction
Social media is an essential part of human com-
munication today. People can share their opinions
in this platform with anonymity. Some people
use offensive language and hate speech casually
and frequently without taking any responsibility
for their behavior. For this reason, SemEval 2019
(Zampieri et al. (2019b)) set up the task Offen-
sEval: identifying and categorizing offensive
language in social media. This task is divided into
three subtasks: offensive language identification,
automatic categorization of offensive types, and
offence target identification.
Our group uses the Natural Language Process-
ing (NLP) latest model, Bidirectional Encoder
Representations from Transformers (BERT). It
is a general-purpose “language understanding”
model trained on a large text corpus such as
Wikipedia (Devlin et al. (2018)). After fine-
tuning, the model can be used for downstream
NLP tasks. Because BERT is very complex and
is the state-of-art model, it is prudent for us not
to change its internal structure. Hence, we focus
on preprocessing the data and error analysis.
After much experimentation with the data, such
as translating emoji into words, putting more
weight on some metaphorical words, removing
the hashtag and so on, we find that using the
original data will give the best performance. The
reason for this is perhaps if we remove some
information from the sentence, some features that
affect the prediction result will be lost. So we end
up using the original data to train our model.
2 Related Work
Much research has been done in detecting of-
fensive language, aggression, and hate speech
in user-generated content. In recent years, re-
searches tend to follow several approaches: use a
simple model with logistic regression to perform
detection, use a neural network model, or use
some other methods.
For the simple model, Davidson and Warmsley
(Davidson et al. (2017)) used a sentiment lexicon
designed for social media to assign sentiment
scores to each tweet. This is an effective way to
identify potentially offensive terms. Then they
use logistic regression with L2regularization to
detect hate speech in social network.
Neural network models use n-gram, skip-gram
or some other methods to extract features from
the data. These features are used to train different
models. The results produced by these models
will be used as the input for training the meta-
classifier (e.g. Malmasi and Zampieri (2018))
For other methods, using bag-of-words is an
effective way to detect hate speech, but it is
difficult to distinguish hate speech from text with
offensive words that are not hate speech (Kwok
and Wang (2013)). For identifying the targets and
intensity of hate speech, syntactic features method
is a good method (Burnap and Williams (2015)).
3 Methodology and Data
Only the training data provided by the organizer
(Zampieri et al. (2019a)) are used in training our
model. The data contain 13,240 pieces of tweet
that had been desensitized (replacing the user
names and website URLs). There are three labels
that are labeled with crowdsourcing for each of
the three subtasks. Gold labels obtained through
crowdsourcing are confirmed by three annotators.
We segmented the training set by 90% for the
training set, 5% for the cross-validation set, and
5% for the test set.
Because some offensive language is subtle,
less ham-fisted, and sometimes cross sentence
boundary, the model trained for this task must
make full use of the whole sentence content in
order to extract useful linguistic, syntactic and
semantic features which may help to make a
deeper understanding of the sentences, while at
the same time less subjected by the noisiness of
speech. So, we use BERT in all three subtasks.
Unlike most of the other methods, BERT uses
bidirectional representation to make use of the left
and right context to gain a deeper understanding of
a sentence by capturing long range dependencies
between each part of the sentence.
The uncased base version of the pre-trained
model files 1is used during the entire training.
The training data are processed in many ways
to fine-tune the model. Processing methods
1BERT-Base, Uncased: https://storage.
include removing all username tags, URL tags
and symbols, converting all text to lowercase, and
translating emoji into text2. One or more of the
above methods is selected to process the training
data, and then use the processed data to train the
In Subtask A, the accuracy after the various op-
erations is shown in the following table.
Preprocessing Accuracy
Original Data 0.8184
Remove tag & symbols 0.8126
Emoji translation v1 0.8081
Emoji translation v2 0.7960
Table 1: Training results for Sub-task A.
After all attempts, the best performing model
for Subtask A is the model trained by the original
data. Therefore, the original data are also used in
the training of the Subtasks B and C models.
4 Results
For Subtask A, The BERT-Base, Uncased, orig-
inal training data model get macro F1 score of
0.8057 and total accuracy of 0.8512.
For Subtask B, The BERT-Base, Uncased,
original training data model get macro F1 score of
0.50 and total accuracy of 0.8792.
For Subtask C, The BERT-Base, Uncased,
original training data model get macro F1 score of
0.5047 and total accuracy of 0.6995.
Results table and confusion matrices for Sub-
tasks A, B and C are shown below.
2In the process of translating emoji characters, v1 and v2
methods were used. v1: Translate all emoji characters into of-
ficial character name listed in the Unicode®11.0.0 Standard.
v2: In addition to ”v1” of processing of all emoji characters,
the selected 97 emotional emoji characters are translated into
manually determined emotional words.
System F1 (macro) Accuracy
All NOT baseline 0.4189 0.7209
All OFF baseline 0.2182 0.2790
original training data
0.8057 0.8512
Table 2: Results for Sub-task A.
System F1 (macro) Accuracy
All TIN baseline 0.4702 0.8875
All UNT baseline 0.1011 0.1125
original training data,
0.5 threshold
0.5000 0.8792
orginal training data,
0.65 threshold
0.4702 0.8875
Table 3: Results for Sub-task B.
System F1 (macro) Accuracy
All GRP baseline 0.1787 0.3662
All IND baseline 0.2130 0.4695
All OTH baseline 0.0941 0.1643
original training data
0.5047 0.6995
Table 4: Results for Sub-task C.
Figure 1: Sub-task A, BNU-HKBU UIC NLP Team 2
CodaLab 527070 BERT-Base, Uncased, original train-
ing data
Figure 2: Sub-task B, BNU-HKBU UIC NLP Team 2
CodaLab 531958 BERT-Base, Uncased, original train-
ing data
Figure 3: Sub-task C, BNU-HKBU UIC NLP Team 2
CodaLab 535873 BERT-Base, Uncased, original train-
ing data
5 Conclusions
Our model for Subtask A ranked 6 out of the
103 groups. This shows that BERT can identify
offensive language. However, the results for
Subtasks B and C are not as good. We will try to
explain the possible reasons using error analysis.
First, some speech may be miss-labeled by
annotators. In our test set, some predictions were
judged as wrong, but our manual examination
shows the predictions seem correct. For example,
“@USER B***hhhhh I’m jell” and “@USER
Crazy Russian dude owns all your data” are
both labeled NOT (not offensive). The model, as
well as our manual examination, deem these as
Second, we also notice a problem is that it is
hard for our model to understanding some specific
noun such as people name when our training data
is not enough. For example, our model predict
sentence “Hitler will be so proud of David Hogg”
as not offensive. The word “Hitler” has a very
special meaning that can makes an otherwise
innocent sentence to be offensive. Our model
presently can’t detect this.
Another problem is emoji characters in offen-
sive languages, which usually contains strong
emotions. And may be used to express irony. So
emoji characters are translated by two methods2
to help BERT model understand the meaning
of tweet posts. But the results show that both
translation methods lead to a drop in accuracy.
The main reason should be that some emoji
characters contain different meanings in different
contexts. For example, (Slightly Smiling Face)
can contain emotion of happy but also banter
as well. Thus, it is difficult to understand the
meaning of emoji characters in context.
Moreover, unbalanced data is a big problem.
In Subtask B, few sentences are predicted as
untargeted, and in Subtask C, no sentence is
predicted as in the Others category. This leads to a
low F1 score in these subtasks. Over-sampling in
less numerous categories would not work not well
in our task, and threshold moving only slightly
raises the F1 score. To deal with this problem as
future work, we may have to remove the labels
and use unsupervised learning.
Figure 4
For future work, we notice that offensive
languages often contain strong emotions such as
angry, banter or taunt. This emotion and other use-
ful contents may be improved by using DeepMoji
(Felbo et al. (2017)), which translates a sentence
into an emoji list to express a sentence’s hidden
information, such as sentiment and sarcasm. A
list of emoji related to the meaning of a sentence
produced by DeepMoji can be used to help BERT
to better classify the sentence categories, as show
in the Figure 4. The last step is to put the original
sentence and the encoded new sentence as input
for BERT’s sentence-pair classification task.
Pete Burnap and Matthew L. Williams. 2015. Cyber
hate speech on twitter: An application of machine
classification and statistical modeling for policy and
decision making.
Thomas Davidson, Dana Warmsley, Michael W. Macy,
and Ingmar Weber. 2017. Automated hate speech
detection and the problem of offensive language. In
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. Bert: Pre-training of deep
bidirectional transformers for language understand-
ing. CoRR, abs/1810.04805.
Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad
Rahwan, and Sune Lehmann. 2017. Using millions
of emoji occurrences to learn any-domain represen-
tations for detecting sentiment, emotion and sar-
casm. arXiv preprint arXiv:1708.00524.
Irene Kwok and Yuzhou Wang. 2013. Locate the hate:
Detecting tweets against blacks. In AAAI.
Shervin Malmasi and Marcos Zampieri. 2018. Chal-
lenges in discriminating profanity from hate speech.
J. Exp. Theor. Artif. Intell., 30:187–202.
Marcos Zampieri, Shervin Malmasi, Preslav Nakov,
Sara Rosenthal, Noura Farra, and Ritesh Kumar.
2019a. Predicting the Type and Target of Offensive
Posts in Social Media. In Proceedings of NAACL.
Marcos Zampieri, Shervin Malmasi, Preslav Nakov,
Sara Rosenthal, Noura Farra, and Ritesh Kumar.
2019b. SemEval-2019 Task 6: Identifying and Cat-
egorizing Offensive Language in Social Media (Of-
fensEval). In Proceedings of The 13th International
Workshop on Semantic Evaluation (SemEval).
... However due to German language characteristics the process can be more complex and results might achieve lower scores than their English counterparts. [102] used BERT model to detect and classify offensive language in English tweets. They used the base, uncased version with 768 dimensionality embeddings with very good results in the binary classification task. ...
... future harassment fine grained classification and create a path of understanding the motivation behind the verbal attacks.Gao and Huang (2017) [29] propose the utilization of context information by employing Bi-LSTM (Bidirectional Long-Short Term Memory Networks) with an Attention Layer, as defined by Bahdanau et al.[7], for hate speech detection.Wu et al. (2019) ...
Full-text available
Text Classification is the process of taking a set of labeled text documents or snippets, learning the correlation between content and label, and creating an algorithm that can automatically predict the labels of new (previously unseen) and unlabeled documents or snippets. As part of the Natural Language Processing field of study, Text Classification is a fundamental task in NLP and is the subject of many real-life applications. Text Classification is still a significant challenge in language processing since there is a need to accurately encode the grammatical relations in a sentence without losing the semantic meaning of the snippet / paragraph. The advent of deep learning models in recent years had led to an increase in research in solving NLP tasks using deep learning methods, text classification being one of them. One of the most urgent problems in text classification is the detection of bad actors in social networks and news media. Be it fake news, propaganda, partisan news, misinformation, harassment, hate speech or sexism, these kinds of behaviours have taken root in our online culture. Having an automated mechanism to filter them out has proven to be a challenging task. This paper presents several deep learning models capable of detecting such behaviours. Models using word embeddings like GloVe or Word2Vec prove to be still useful in most practical settings, but large language models like BERT are slowly taking their place in most applications. This thesis introduces two new specialized BERT models pre-trained to improve fake news, propaganda, and offensive tweets detection, one for English and one for German language. Their performance has been compared to that of simpler deep learning models and the results of our experiments show many instances where our models outperform the standard release of BERT with a relative high margin.
... The F-score obtained was 0.5282. Wu et al. applied an uncased BERT model pre-trained on model files with a 0.8057 F-score [22]. In [23], multi-layer RNN extracted features -benefited from ELMo embeddings, selfattention, character n-grams and node2vec-were given to gradient-boosted decision trees (GBDT) for classification and a 78.79% ...
Full-text available
The classification of documents is one of the problems studied since ancient times and still continues to be studied. With the social media becoming a part of daily life and its misuse, the importance of text classification has started to increase. This paper investigates the effect of data augmentation with sentence generation on classification performance in an imbalanced dataset. We propose an LSTM based sentence generation method, Term Frequency-Inverse Document Frequency (TF-IDF) and Word2vec and apply Logistic Regression (LR), Support Vector Machine (SVM), K Nearest Neighbour (KNN), Multilayer Perceptron (MLP), Extremly Randomized Trees (Extra tree), Random Forest, eXtreme Gradient Boosting (Xgboost), Adaptive Boosting (AdaBoost) and Bagging. Our experiment results on imbalanced Offensive Language Identification Dataset (OLID) that machine learning with sentence generation significantly outperforms.
... The general and most used preprocessing techniques in text data that are used in Natural language processing are Tokenization [97], [13], Removal of Noise and Outliers [98], Lower Casing [21], Removal of Stop Words [58], Integration of Raw data [99], Text Stemming [61], Dimensionality Reduction [100], Text Lemmatization [11], Parts of Speech Tagging [101], Removal of HTML tags [102], Removal of URLs [103], Spelling Correction [104], Removal of Emoticons [105] [66], Removing Punctuations or Special Characters [11] [59], Removing Frequent words [106], Removing of Rare words [107], Removing Single Characters [108], Removing Extra Whitespaces [109], Removal of Numerical Values [59], Removing Alphabets [60], Data Compression [110], Converting Emojis to Words [111], Converting Numbers to Words [112], Text Normalization [17], [113], Text Standardization [114], Poping Wh-type Words [12], Anaphora [11], Verb Processing [115], Synonym Words Processing [116], N-gram Stemming [117] and many more. ...
Full-text available
The Bangla language is the seventh most spoken language, with 265 million native and non-native speakers worldwide. However, English is the predominant language for online resources and technical knowledge, journals, and documentation. Consequently, many Bangla-speaking people, who have limited command of English, face hurdles to utilize English resources. To bridge the gap between limited support and increasing demand, researchers conducted many experiments and developed valuable tools and techniques to create and process Bangla language materials. Many efforts are also ongoing to make it easy to use the Bangla language in the online and technical domains. There are some review papers to understand the past, previous, and future Bangla Natural Language Processing (BNLP) trends. The studies are mainly concentrated on the specific domains of BNLP, such as sentiment analysis, speech recognition, optical character recognition, and text summarization. There is an apparent scarcity of resources that contain a comprehensive review of the recent BNLP tools and methods. Therefore, in this paper, we present a thorough analysis of 75 BNLP research papers and categorize them into 11 categories, namely Information Extraction, Machine Translation, Named Entity Recognition, Parsing, Parts of Speech Tagging, Question Answering System, Sentiment Analysis, Spam and Fake Detection, Text Summarization, Word Sense Disambiguation, and Speech Processing and Recognition. We study articles published between 1999 to 2021, and 50% of the papers were published after 2015. Furthermore, we discuss Classical, Machine Learning and Deep Learning approaches with different datasets while addressing the limitations and current and future trends of the BNLP.
... Offensive language identification has seen extensive usage of language modeling approaches like BERT (Pelicon et al., 2019;Pavlopoulos et al., 2019;Wu et al., 2019;Liu et al., 2019), GPT (Zampieri et al., 2019b) and ELMo (Indurthi et al., 2019) with varying hyperparameters and pre-processing steps. In this work, based on its widespread usage, BERT (Devlin et al., 2019) is used as the classifier. ...
... They also pointed out that the BERT model performed best in subtask A, and achieved the first place in subtask A of SemEval-2019 Task 6. In subtasks B and C, the dataset distribution is less smooth and the amount of data is less, so the effect is not as good as A. In contrast to other models, BERT uses a two-way representation to take advantage of the left and right context and deepen the understanding of the sentence by capturing long-term dependencies between the parts of the sentence (Wu et al., 2019). Kumar et al. (2019) believe that it is very important to preprocess the words. ...
... The work done by Wu et al. (2019) uses BERT uncased model with an F1 score of 0.8057 for task A and 0.50 for task B. Pavlopoulos et al. (2019) uses perspective API and BERT cased and uncased models to detect the offensive language with an F1 score of 0.7933 for task A and 0.6817 for task B. SemEval-2019 Task6 report of Zampieri et al. (2019b) ...
... This model was trained on a large text corpus such as Wikipedia, which can be applied to various NLP tasks without changing its core archirtecture. Zhenghao Wu et al. used BERT to capture linguistic, syntactic and semantic features (Wu et al., 2019). Andraz Pelicon et al. used a fine-tuned BERT model to achieve offensive language identification (Pelicon et al., 2019). ...
... There have been various shared tasks and competitions in this task such as: GermEval Task 2, 2019 (Struß et al., 2019), GermEval 2018 (Wiegand et al., 2018), SemEval 2019 -Task 5 (Basile et al., 2019), SemEval 2019 -Task (Offen-sEval 2019) (Zampieri et al., 2019), SemEval 2020 (Zampieri et al., 2020), Kaggle's Toxic Comment Classification Challenges. 2 Wu et al. (2019) use the BERT model to detect and classify offensive language in English tweets and obtain good results. Risch and Krestel (2020b) discuss toxic comments in online news discussions and describe subclasses of toxicity, present various deep learning approaches, and propose to augment training data by using transfer learning when the training data is sparse. ...
Experiment Findings
Full-text available
We describe our participation in all the sub-tasks of the Germeval 2021 shared task on the identification of Toxic, Engaging, and Fact-Claiming Comments. Our system is an ensemble of state-of-the-art pre-trained models finetuned with carefully engineered features. We show that feature engineering and data augmentation can be helpful when the training data is sparse. We achieve an F1 score of 66.87, 68.93, and 73.91 in Toxic, Engaging , and Fact-Claiming comment identification subtasks.
... A similar task was proposed in SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval) (Zampieri et al., 2019). Most of the top-ranked teams in this task used transformer language models (Liu et al., 2019a;Zhu et al., 2019;Pelicon et al., 2019;Wu et al., 2019) or an ensemble of CNN and RNN (Mahata et al., 2019;Mitrović et al., 2019) to classify the sentences. ...
Full-text available
In this study, we approach the problem of distinguishing general profanity from hate speech in social media, something which has not been widely considered. Using a new dataset annotated specifically for this task, we employ supervised classification along with a set of features that includes -grams, skip-grams and clustering-based word representations. We apply approaches based on single classifiers as well as more advanced ensemble classifiers and stacked generalisation, achieving the best result of accuracy for this 3-class classification task. Analysis of the results reveals that discriminating hate speech and profanity is not a simple task, which may require features that capture a deeper understanding of the text not always possible with surface -grams. The variability of gold labels in the annotated data, due to differences in the subjective adjudications of the annotators, is also an issue. Other directions for future work are discussed.
Full-text available
A key challenge for automatic hate-speech detection on social media is the separation of hate speech from other instances of offensive language. Lexical detection methods tend to have low precision because they classify all messages containing particular terms as hate speech and previous work using supervised learning has failed to distinguish between the two categories. We used a crowd-sourced hate speech lexicon to collect tweets containing hate speech keywords. We use crowd-sourcing to label a sample of these tweets into three categories: those containing hate speech, only offensive language, and those with neither. We train a multi-class classifier to distinguish between these different categories. Close analysis of the predictions and the errors shows when we can reliably separate hate speech from other offensive language and when this differentiation is more difficult. We find that racist and homophobic tweets are more likely to be classified as hate speech but that sexist tweets are generally classified as offensive. Tweets without explicit hate keywords are also more difficult to classify.
Although the social medium Twitter grants users freedom of speech, its instantaneous nature and retweeting features also amplify hate speech. Because Twitter has a sizeable black constituency, racist tweets against blacks are especially detrimental in the Twitter community, though this effect may not be obvious against a backdrop of half a billion tweets a day.1 We apply a supervised machine learning approach, employing inexpensively acquired labeled data from diverse Twitter accounts to learn a binary classifier for the labels “racist” and “nonracist.” The classifier has a 76% average accuracy on individual tweets, suggesting that with further improvements, our work can contribute data on the sources of anti-black hate speech.
The use of “Big Data” in policy and decision making is a current topic of debate. The 2013 murder of Drummer Lee Rigby in Woolwich, London, UK led to an extensive public reaction on social media, providing the opportunity to study the spread of online hate speech (cyber hate) on Twitter. Human annotated Twitter data was collected in the immediate aftermath of Rigby's murder to train and test a supervised machine learning text classifier that distinguishes between hateful and/or antagonistic responses with a focus on race, ethnicity, or religion; and more general responses. Classification features were derived from the content of each tweet, including grammatical dependencies between words to recognize “othering” phrases, incitement to respond with antagonistic action, and claims of well-founded or justified discrimination against social groups. The results of the classifier were optimal using a combination of probabilistic, rule-based, and spatial-based classifiers with a voted ensemble meta-classifier. We demonstrate how the results of the classifier can be robustly utilized in a statistical model used to forecast the likely spread of cyber hate in a sample of Twitter data. The applications to policy and decision making are discussed.
Bert: Pre-training of deep bidirectional transformers for language understanding
  • Jacob Devlin
  • Ming-Wei Chang
  • Kenton Lee
  • Kristina Toutanova
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.