Details of the Cresci-2015 dataset.

Details of the Cresci-2015 dataset.

Source publication
Article
Full-text available
Social media bots pose potential threats to the online environment, and the continuously evolving anti-detection technologies require bot detection methods to be more reliable and general. Current detection methods encounter challenges, including limited generalization ability, susceptibility to evasion in traditional feature engineering, and insuf...

Similar publications

Preprint
Full-text available
Online communities are an increasingly important stakeholder for firms, and despite the growing body of research on them, much remains to be learned about them and about the factors that determine their attributes and sustainability. Whereas most of the literature focuses on predictors such as community activity, network structure, and platform int...

Citations

... , that implemented a novel GNN combined with Random Forest, where they generate subgraphs to train GNN classifiers augmented by a Fully Connected Network (FCN), enhancing both accuracy and robustness. Or Zeng et al. (2023) that presented a multidimensional learning approach integrating behavioral and relational analytics. ...
... While some studies ) extract data using X's API, yet its transition to a paid service poses challenges for experiments reliant on real-time data collection without incurring costs. Hence, recent studies Shi et al. , 2024Lei et al. 2023;Zeng et al. 2023;) have employed both contemporary and older public datasets (Cresci et al. 2015(Cresci et al. , 2017Yang et al. 2020), with those outlined in Table 2 representing the latest and most frequently used in current literature. Notably, the TwiBot-20 and MGTAB datasets contain unlabeled users, indicating their significance in supporting semi-supervised learning inquiries. ...
... For instance, while Owais et al. (2023) achieves a high F1-score, the methodology and dataset usage are insufficiently described, limiting comparability and transparency. The same issue Ye et al. (2023) arises in subsequent studies, such as MRLBot by Zeng et al. (2023) and the work of Wu et al. (2025). RG2. ...
Article
Full-text available
Identifying bots on X (formerly Twitter) is essential for preventing misinformation and ensuring user safety. However, current models face several challenges: (i) the use of outdated techniques and attributes, particularly those trained on older datasets; (ii) reproducibility problems stemming from unique and insufficiently detailed methodologies; (iii) shallow analysis depth, with few studies investigating all combinations of feature-based, text-based, and graph-based methods; (iv) a lack of thorough literature reviews to pinpoint effective characteristics for future research. To overcome these gaps, this study proposes a novel multimodal bot detection framework which integrates user profile features, text analysis, and graph-based techniques. Utilizing the recent TwiBot-22 dataset, the model combines semantic text information from user profiles and tweets with graph representations of user interactions, including novel relationships like list ownership and all applicable user profile features, encompassing both those previously used in state-of-the-art models and newly introduced features designed to enhance detection accuracy. The study compares the proposed model against existing approaches and demonstrates a significant improvement in detection accuracy, exceeding the best-performing models by 5.48%. The research contributes a comprehensive evaluation of feature sets, highlights the advantages of incorporating multiple detection strategies, and identifies key parameters for optimizing bot detection models.
... In their study, Zeng et al. [49] developed a social media bot detection framework called MRLBot. This system combined two distinct models: the DDTCN, which used a Transformer and CNN encoder-decoder to analyze user behavior, and the IB2V, which focused on mapping relationship networks through random walks in community contexts. ...
Article
Full-text available
In recent years, the proliferation of online communication platforms and social media has given rise to a new wave of challenges, including the rapid spread of malicious bots. These bots, often programmed to impersonate human users, can infiltrate online communities, disseminate misinformation, and engage in various activities detrimental to the integrity of digital discourse. It is becoming more and more difficult to discern a text produced by deep neural networks from that created by humans. Transformer-based Pre-trained Language Models (PLMs) have recently shown excellent results in challenges involving natural language understanding (NLU). The suggested method is to employ an approach to detect bots at the tweet level by utilizing content and fine-tuning PLMs, to reduce the current threat. Building on the recent developments of the BERT (Bidirectional Encoder Representations from Transformers) and GPT-3, the suggested model employs a text embedding approach. This method offers a high-quality representation that can enhance the efficacy of detection. In addition, a Feedforward Neural Network (FNN) was used on top of the PLMs for final classification. The model was experimentally evaluated using the Twitter bot dataset. The strategy was tested using test data that came from the same distribution as their training set. The methodology in this paper involves preprocessing Twitter data, generating contextual embeddings using PLMs, and designing a classification model that learns to differentiate between human users and bots. Experiments were carried out adopting advanced Language Models to construct an encoding of the tweet to create a potential input vector on top of BERT and their variants. By employing Transformer-based models, we achieve significant improvements in bot detection F1-score (93%) compared to traditional methods such as Word2Vec and Global Vectors for Word Representation (Glove). Accuracy improvements ranging from 3% to 24% compared to baselines were achieved. The capability of GPT-4, an advanced Large Language Model (LLM), in interpreting bot-generated content is examined in this research. Additionally, explainable artificial intelligence (XAI) was utilized alongside transformer-based models for detecting bots on social media, enhancing the transparency and reliability of these models.
Article
Harmful Twitter Bots (HTBs) are widespread and adaptable to a wide range of social network platforms. The use of social network bots on numerous social network platforms is increasing. As the popularity and utility of social networking bots grow, the attacks using social network-based automated accounts are getting more coordinated, resulting in crimes that might endanger democracy, the financial market, and public health. HTB designers develop their bots to elude detection while academics create several algorithms to identify social media bot accounts. This field is active and necessitates ongoing improvement due to the never-ending cat-and-mouse game. X, previously known as Twitter, is among the biggest social network platforms that has been plagued by automated accounts. Even though new research is being conducted to tackle this issue, the number of bots on Twitter keeps on increasing. In this research, we establish a robust theoretical foundation in the continuously evolving domain of Harmful Twitter Bot (HTB) detection by analyzing the existing HTB detection techniques. Our research provides an extensive literature review and introduces an enhanced taxonomy that has the potential to help the scientific community form better generalizations for HTB detection. Furthermore, we discuss this domain's obstacles and open challenges to direct and improve future research. As far as we are aware, this study marks the first comprehensive examination of HTB detection that includes articles published between June 2013 and August 2023. The review's findings include a more thorough classification of detection approaches, a spotlight on ways to spot Twitter bots, and a comparison of recent HTB detection methods. Moreover, we provide a comprehensive list of publicly available datasets for HTB detection. As bots evolve, efforts must be made to raise awareness, equip legitimate users with information, and help future researchers in the field of social network bot detection.
Article
Full-text available
The widespread use of online social networks (OSNs) has made them prime targets for cyber attackers, who exploit these platforms for various malicious activities. As a result, a whole industry of black-market services has emerged, selling services based on the sale of fake accounts. Because of the massive rise of OSNs, the number of fraudulent accounts rapidly expands. Hence, this research focuses on detecting fraudulent profiles on Instagram and Facebook and aims to find an optimal subset of features that can effectively differentiate between real and fake accounts. The problem has been formulated as a multiobjective optimization task, aiming to maximize the classification accuracy while minimizing the number of selected features. NSGA-II (non-dominated sorting genetic algorithm II) is employed as the optimization algorithm to explore the trade-offs between these conflicting objectives. In the current study, a novel approach for feature selection using the NSGA-II optimization algorithm to detect fake accounts is proposed. The proposed methodology relies on input data comprising features characterizing the profiles under investigation. The selected features are utilized to train a machine learning model. The model’s performance is evaluated using various metrics, including precision, recall, F1-score, and receiver operating characteristic (ROC) curve. The final prediction model achieved accuracy values ranging from 90 to 99.88%. The results indicated that the model, utilizing features selected by the NSGA-II algorithm, delivered high prediction accuracy while using less than 31% of the total feature space. This efficient feature selection allowed for the precise differentiation between fake and real users, demonstrating the model’s effectiveness with a minimal number of input variables. Furthermore, the results of experiments demonstrate that the proposed approach achieves better performance as compared to other existing approaches. This research paper focuses on explainability, which refers to the ability to understand and interpret the decisions and outcomes of machine learning models.