Figure 4 - uploaded by Zhenghao Wu
Content may be subject to copyright.

Source publication
Conference Paper
Full-text available
In this study we deal with the problem of identifying and categorizing offensive language in social media. Our group, BNU-HKBU UIC NLP Team2, use supervised classification along with multiple version of data generated by different ways of pre-processing the data. We then use the state-of-the-art model Bidirec-tional Encoder Representations from Tra...

Context in source publication

Context 1
... emotion and other useful contents may be improved by using DeepMoji (Felbo et al. (2017)), which translates a sentence into an emoji list to express a sentence's hidden information, such as sentiment and sarcasm. A list of emoji related to the meaning of a sentence produced by DeepMoji can be used to help BERT to better classify the sentence categories, as show in the Figure 4. The last step is to put the original sentence and the encoded new sentence as input for BERT's sentence-pair classification task. ...

Similar publications

Conference Paper
Full-text available
This paper presents the models submitted by Ghmerti team for subtasks A and B of the OffensEval shared task at SemEval 2019. OffensEval addresses the problem of identifying and categorizing offensive language in social media in three subtasks; whether or not a content is offensive (subtask A), whether it is targeted (subtask B) towards an individua...

Citations

... The F-score obtained was 0.5282. Wu et al. applied an uncased BERT model pre-trained on model files with a 0.8057 F-score [22]. In [23], multi-layer RNN extracted features -benefited from ELMo embeddings, selfattention, character n-grams and node2vec-were given to gradient-boosted decision trees (GBDT) for classification and a 78.79% ...
Article
Full-text available
The classification of documents is one of the problems studied since ancient times and still continues to be studied. With the social media becoming a part of daily life and its misuse, the importance of text classification has started to increase. This paper investigates the effect of data augmentation with sentence generation on classification performance in an imbalanced dataset. We propose an LSTM based sentence generation method, Term Frequency-Inverse Document Frequency (TF-IDF) and Word2vec and apply Logistic Regression (LR), Support Vector Machine (SVM), K Nearest Neighbour (KNN), Multilayer Perceptron (MLP), Extremly Randomized Trees (Extra tree), Random Forest, eXtreme Gradient Boosting (Xgboost), Adaptive Boosting (AdaBoost) and Bagging. Our experiment results on imbalanced Offensive Language Identification Dataset (OLID) that machine learning with sentence generation significantly outperforms.
... The general and most used preprocessing techniques in text data that are used in Natural language processing are Tokenization [97], [13], Removal of Noise and Outliers [98], Lower Casing [21], Removal of Stop Words [58], Integration of Raw data [99], Text Stemming [61], Dimensionality Reduction [100], Text Lemmatization [11], Parts of Speech Tagging [101], Removal of HTML tags [102], Removal of URLs [103], Spelling Correction [104], Removal of Emoticons [105] [66], Removing Punctuations or Special Characters [11] [59], Removing Frequent words [106], Removing of Rare words [107], Removing Single Characters [108], Removing Extra Whitespaces [109], Removal of Numerical Values [59], Removing Alphabets [60], Data Compression [110], Converting Emojis to Words [111], Converting Numbers to Words [112], Text Normalization [17], [113], Text Standardization [114], Poping Wh-type Words [12], Anaphora [11], Verb Processing [115], Synonym Words Processing [116], N-gram Stemming [117] and many more. ...
Article
Full-text available
The Bangla language is the seventh most spoken language, with 265 million native and non-native speakers worldwide. However, English is the predominant language for online resources and technical knowledge, journals, and documentation. Consequently, many Bangla-speaking people, who have limited command of English, face hurdles to utilize English resources. To bridge the gap between limited support and increasing demand, researchers conducted many experiments and developed valuable tools and techniques to create and process Bangla language materials. Many efforts are also ongoing to make it easy to use the Bangla language in the online and technical domains. There are some review papers to understand the past, previous, and future Bangla Natural Language Processing (BNLP) trends. The studies are mainly concentrated on the specific domains of BNLP, such as sentiment analysis, speech recognition, optical character recognition, and text summarization. There is an apparent scarcity of resources that contain a comprehensive review of the recent BNLP tools and methods. Therefore, in this paper, we present a thorough analysis of 75 BNLP research papers and categorize them into 11 categories, namely Information Extraction, Machine Translation, Named Entity Recognition, Parsing, Parts of Speech Tagging, Question Answering System, Sentiment Analysis, Spam and Fake Detection, Text Summarization, Word Sense Disambiguation, and Speech Processing and Recognition. We study articles published between 1999 to 2021, and 50% of the papers were published after 2015. Furthermore, we discuss Classical, Machine Learning and Deep Learning approaches with different datasets while addressing the limitations and current and future trends of the BNLP.
... There have been various shared tasks and competitions in this task such as: GermEval Task 2, 2019 (Struß et al., 2019), GermEval 2018 (Wiegand et al., 2018), SemEval 2019 -Task 5 (Basile et al., 2019), SemEval 2019 -Task (Offen-sEval 2019) (Zampieri et al., 2019), SemEval 2020 (Zampieri et al., 2020), Kaggle's Toxic Comment Classification Challenges. 2 Wu et al. (2019) use the BERT model to detect and classify offensive language in English tweets and obtain good results. Risch and Krestel (2020b) discuss toxic comments in online news discussions and describe subclasses of toxicity, present various deep learning approaches, and propose to augment training data by using transfer learning when the training data is sparse. ...
Experiment Findings
Full-text available
We describe our participation in all the sub-tasks of the Germeval 2021 shared task on the identification of Toxic, Engaging, and Fact-Claiming Comments. Our system is an ensemble of state-of-the-art pre-trained models finetuned with carefully engineered features. We show that feature engineering and data augmentation can be helpful when the training data is sparse. We achieve an F1 score of 66.87, 68.93, and 73.91 in Toxic, Engaging , and Fact-Claiming comment identification subtasks.
... The general and most used preprocessing techniques in text data that are used in Natural language processing are Tokenization [93], [13], Removal of Noise and Outliers [94], Lower Casing [21], Removal of Stop Words [57], Integration of Raw data [95], Text Stemming [60], Dimensionality Reduction [96], Text Lemmatization [11], Parts of Speech Tagging [97], Removal of HTML tags [98], Removal of URLs [99], Spelling Correction [100], Removal of Emoticons [101] [65], Removing Punctuations or Special Characters [11] [58], Removing Frequent words [102], Removing of Rare words [103], Removing Single Characters [104], Removing Extra Whitespaces [105], Removal of Numerical Values [58], Removing Alphabets [59], Data Compression [106], Converting Emojis to Words [107], Converting Numbers to Words [108], Text Normalization [17], [109], Text Standardization [110], Poping Wh-type Words [12], Anaphora [11], Verb Processing [111], Synonym Words Processing [112], N-gram Stemming [113] and many more. ...
Preprint
Full-text available
The Bangla language is the seventh most spoken language, with 265 million native and non-native speakers worldwide. However, English is the predominant language for online resources and technical knowledge, journals, and documentation. Consequently, many Bangla-speaking people, who have limited command of English, face hurdles to utilize English resources. To bridge the gap between limited support and increasing demand, researchers conducted many experiments and developed valuable tools and techniques to create and process Bangla language materials. Many efforts are also ongoing to make it easy to use the Bangla language in the online and technical domains. There are some review papers to understand the past, previous, and future Bangla Natural Language Processing (BNLP) trends. The studies are mainly concentrated on the specific domains of BNLP, such as sentiment analysis, speech recognition, optical character recognition, and text summarization. There is an apparent scarcity of resources that contain a comprehensive study of the recent BNLP tools and methods. Therefore, in this paper, we present a thorough review of 71 BNLP research papers and categorize them into 11 categories, namely Information Extraction, Machine Translation, Named Entity Recognition, Parsing, Parts of Speech Tagging, Question Answering System, Sentiment Analysis, Spam and Fake Detection, Text Summarization, Word Sense Disambiguation, and Speech Processing and Recognition. We study articles published between 1999 to 2021, and 50\% of the papers were published after 2015. We discuss Classical, Machine Learning and Deep Learning approaches with different datasets while addressing the limitations and current and future trends of the BNLP.
... A similar task was proposed in SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval) (Zampieri et al., 2019). Most of the top-ranked teams in this task used transformer language models (Liu et al., 2019a;Zhu et al., 2019;Pelicon et al., 2019;Wu et al., 2019) or an ensemble of CNN and RNN (Mahata et al., 2019;Mitrović et al., 2019) to classify the sentences. ...
... However due to German language characteristics the process can be more complex and results might achieve lower scores than their English counterparts. [102] used BERT model to detect and classify offensive language in English tweets. They used the base, uncased version with 768 dimensionality embeddings with very good results in the binary classification task. ...
... future harassment fine grained classification and create a path of understanding the motivation behind the verbal attacks.Gao and Huang (2017) [29] propose the utilization of context information by employing Bi-LSTM (Bidirectional Long-Short Term Memory Networks) with an Attention Layer, as defined by Bahdanau et al.[7], for hate speech detection.Wu et al. (2019) ...
Thesis
Full-text available
Text Classification is the process of taking a set of labeled text documents or snippets, learning the correlation between content and label, and creating an algorithm that can automatically predict the labels of new (previously unseen) and unlabeled documents or snippets. As part of the Natural Language Processing field of study, Text Classification is a fundamental task in NLP and is the subject of many real-life applications. Text Classification is still a significant challenge in language processing since there is a need to accurately encode the grammatical relations in a sentence without losing the semantic meaning of the snippet / paragraph. The advent of deep learning models in recent years had led to an increase in research in solving NLP tasks using deep learning methods, text classification being one of them. One of the most urgent problems in text classification is the detection of bad actors in social networks and news media. Be it fake news, propaganda, partisan news, misinformation, harassment, hate speech or sexism, these kinds of behaviours have taken root in our online culture. Having an automated mechanism to filter them out has proven to be a challenging task. This paper presents several deep learning models capable of detecting such behaviours. Models using word embeddings like GloVe or Word2Vec prove to be still useful in most practical settings, but large language models like BERT are slowly taking their place in most applications. This thesis introduces two new specialized BERT models pre-trained to improve fake news, propaganda, and offensive tweets detection, one for English and one for German language. Their performance has been compared to that of simpler deep learning models and the results of our experiments show many instances where our models outperform the standard release of BERT with a relative high margin.
... Offensive language identification has seen extensive usage of language modeling approaches like BERT (Pelicon et al., 2019;Pavlopoulos et al., 2019;Wu et al., 2019;Liu et al., 2019), GPT (Zampieri et al., 2019b) and ELMo (Indurthi et al., 2019) with varying hyperparameters and pre-processing steps. In this work, based on its widespread usage, BERT (Devlin et al., 2019) is used as the classifier. ...
Preprint
Full-text available
In this paper, we present our participation in SemEval-2020 Task-12 Subtask-A (English Language) which focuses on offensive language identification from noisy labels. To this end, we developed a hybrid system with the BERT classifier trained with tweets selected using Statistical Sampling Algorithm (SA) and Post-Processed (PP) using an offensive wordlist. Our developed system achieved 34 th position with Macro-averaged F1-score (Macro-F1) of 0.90913 over both offensive and non-offensive classes. We further show comprehensive results and error analysis to assist future research in offensive language identification with noisy labels.
... Offensive language identification has seen extensive usage of language modeling approaches like BERT (Pelicon et al., 2019;Pavlopoulos et al., 2019;Wu et al., 2019;Liu et al., 2019), GPT (Zampieri et al., 2019b) and ELMo (Indurthi et al., 2019) with varying hyperparameters and pre-processing steps. In this work, based on its widespread usage, BERT (Devlin et al., 2019) is used as the classifier. ...
... This model was trained on a large text corpus such as Wikipedia, which can be applied to various NLP tasks without changing its core archirtecture. Zhenghao Wu et al. used BERT to capture linguistic, syntactic and semantic features (Wu et al., 2019). Andraz Pelicon et al. used a fine-tuned BERT model to achieve offensive language identification (Pelicon et al., 2019). ...
... They also pointed out that the BERT model performed best in subtask A, and achieved the first place in subtask A of SemEval-2019 Task 6. In subtasks B and C, the dataset distribution is less smooth and the amount of data is less, so the effect is not as good as A. In contrast to other models, BERT uses a two-way representation to take advantage of the left and right context and deepen the understanding of the sentence by capturing long-term dependencies between the parts of the sentence (Wu et al., 2019). Kumar et al. (2019) believe that it is very important to preprocess the words. ...