Conference PaperPDF Available

Maximum Entropy Modelling of Bully Users in Social Networks

Authors:

Abstract

Cyberbullying is a widespread problem among children and adolescents. Detection of bully users is one of the main course of actions to combat cyberbullying in social networks. Advances in artificial intelligence along with powerful computational facilities has fueled a rapid increase in predictive modeling of bully users from massive social network’s data. However, low prevalence of bullying incidents made the labelling process costly and laborious. Commonly used cyberbullying methods have been criticized for being inherently dependent on prevalence, and has been argued that low number of target class member (i.e. bully users) introduces statistical artefacts. To overcome this barrier, I proposed the use of the Maximum Entropy (ME) method for modelling of bully users in social networks. ME is a general-purpose machine learning method with a simple and precise mathematical formulation, and it has number of aspects that make it well-suited for cyberbullying studies. In order to evaluate the proposed method, I performed a case study using YouTube dataset. The dataset has been manually labeled as bullying and non-bullying posts. We compiled a set of 11 features in three categories to identify bullying users - namely user features, content features, and activity features - representing the characteristics, actions and behaviour of the users, respectively. ME predictions were compared with those of commonly used modeling methods; Boosted Regression Trees (BRT), Random Forests (RF), and Support Vector Machine (SVM). Predictions were made in 10 steps and in each step 10% of the remaining bullying posts were randomly excluded. All models provided reasonable prediction of the bullying incidents and were significantly better than random in both binomial tests of omission and receiver operating characteristic (ROC) analyses. The area under the ROC curve (AUC) was always higher for ME, indicating better discrimination of bullying post. Performance of ME was more robust and less sensitive towards prevalence. We believe that ME method can be used in its present form for many applications with imbalanced datasets, and merits further research and development.
MaximumEntropyModellingofBullyUsersinSocialNetworks
MaralDadvar
HumanMediaInteraction,UniversityofTwente
UnitedNationsGlobalPulse(UNGP)
dadvar.maral@gmail.com
Keywords.MaximumEntropy,Cyberbullying,SocialNetworks
Abstract. Cyberbullying is a widespread problem among children and adolescents.
Detection of bully users is one of the main course of actions to combat cyberbullying
in social networks. Advances in artificial intelligence along with powerful
computational facilities has fueled a rapid increase in predictive modeling of bully
users from massive social network’s data. However, low prevalence of bullying
incidents made the labelling process costly and laborious. Commonly used
cyberbullying methods have been criticized for being inherently dependent on
prevalence, and has been argued that low number of target class member (i.e. bully
users) introduces statistical artefacts. To overcome this barrier, I proposed the use of
theMaximumEntropy(ME)methodformodellingofbullyusersinsocialnetworks.
ME is a generalpurpose machine learning method with a simple and precise
mathematical formulation, and it has number of aspects that make it wellsuited for
cyberbullying studies. In order to evaluate the proposed method, I performed a case
study using YouTube dataset. The dataset has been manually labeled as bullying
and nonbullying posts. We compiled a set of 11 features in three categories to
identify bullying users namely user features, content features, and activity features
representingthecharacteristics,actionsandbehaviouroftheusers,respectively.
ME predictions were compared with those of commonly used modeling methods;
Boosted Regression Trees (BRT), Random Forests (RF), and Support Vector
Machine (SVM). Predictions were made in 10 steps and in each step 10% of the
remaining bullying posts were randomly excluded. All models provided reasonable
prediction of the bullying incidents and were significantly better than random in both
binomial tests of omission and receiver operating characteristic (ROC) analyses. The
area under the ROC curve (AUC) was always higher for ME, indicating better
discrimination of bullying post. Performance of ME was more robust and less
sensitivetowardsprevalence.
We believe that ME method can be used in its present form for many applications
withimbalanceddatasets,andmeritsfurtherresearchanddevelopment.
... We expand our work by re-implementing the models on a new dataset. For this purpose, we have used a YouTube dataset which has been extensively used in cyberbullying studies [6], [15], [16]. The ultimate aim was to investigate the interoperability and the performance of the reproduced models on new datasets, to see how adaptable they are to different social media platforms and to what extent models trained on a dataset (i.e., social network) can be transferred to another one. ...
Preprint
Full-text available
Cyberbullying is a disturbing online misbehaviour with troubling consequences. It appears in different forms, and in most of the social networks, it is in textual format. Automatic detection of such incidents requires intelligent systems. Most of the existing studies have approached this problem with conventional machine learning models and the majority of the developed models in these studies are adaptable to a single social network at a time. In recent studies, deep learning based models have found their way in the detection of cyberbullying incidents, claiming that they can overcome the limitations of the conventional models, and improve the detection performance. In this paper, we investigate the findings of a recent literature in this regard. We successfully reproduced the findings of this literature and validated their findings using the same datasets, namely Wikipedia, Twitter, and Formspring, used by the authors. Then we expanded our work by applying the developed methods on a new YouTube dataset (~54k posts by ~4k users) and investigated the performance of the models in new social media platforms. We also transferred and evaluated the performance of the models trained on one platform to another platform. Our findings show that the deep learning based models outperform the machine learning models previously applied to the same YouTube dataset. We believe that the deep learning based models can also benefit from integrating other sources of information and looking into the impact of profile information of the users in social networks.
... We expand our work by re-implementing the models on a new dataset. For this purpose, we have used a YouTube dataset which has been extensively used in cyberbullying studies [6], [15], [16]. The ultimate aim was to investigate the interoperability and the performance of the reproduced models on new datasets, to see how adaptable they are to different social media platforms and to what extent models trained on a dataset (i.e., social network) can be transferred to another one. ...
Preprint
Full-text available
Cyberbullying is a disturbing online misbehaviour with troubling consequences. It appears in different forms, and in most of the social networks, it is in textual format. Automatic detection of such incidents requires intelligent systems. Most of the existing studies have approached this problem with conventional machine learning models and the majority of the developed models in these studies are adaptable to a single social network at a time. In recent studies, deep learning based models have found their way in the detection of cyberbullying incidents, claiming that they can overcome the limitations of the conventional models, and improve the detection performance. In this paper, we investigate the findings of a recent literature in this regard. We successfully reproduced the findings of this literature and validated their findings using the same datasets, namely Wikipedia, Twitter, and Formspring, used by the authors. Then we expanded our work by applying the developed methods on a new YouTube dataset (~54k posts by ~4k users) and investigated the performance of the models in new social media platforms. We also transferred and evaluated the performance of the models trained on one platform to another platform. Our findings show that the deep learning based models outperform the machine learning models previously applied to the same YouTube dataset. We believe that the deep learning based models can also benefit from integrating other sources of information and looking into the impact of profile information of the users in social networks.
ResearchGate has not been able to resolve any references for this publication.