PreprintPDF Available

Towards Understanding Trends Manipulation in Pakistan Twitter

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

The rapid adoption of online social media platforms has transformed the way of communication and interaction. On these platforms, discussions in the form of trending topics provide a glimpse of events happening around the world in real-time. Also, these trends are used for political campaigns, public awareness, and brand promotions. Consequently, these trends are sensitive to manipulation by malicious users who aim to mislead the mass audience. In this article, we identify and study the characteristics of users involved in the manipulation of Twitter trends in Pakistan. We propose 'Manipify', a framework for automatic detection and analysis of malicious users for Twitter trends. Our framework consists of three distinct modules: i) user classifier, ii) hashtag classifier, and ii) trend analyzer. The user classifier introduces a novel approach to automatically detect manipulators using tweet content and user behaviour features. Also, the module classifies human and bot users. Next, the hashtag classifier categorizes trending hashtags into six categories assisting in examining manipulators behaviour across different categories. Finally, the trend analyzer module examines users, hashtags, and tweets for hashtag reach, linguistic features and user behaviour. Our user classifier module achieves 0.91 accuracy in classifying the manipulators. We further test Manipify on the dataset comprising of 665 trending hashtags with 5.4 million tweets and 1.9 million users. The analysis of trends reveals that the trending panel is mostly dominated by political hashtags. In addition, our results show a higher contribution of human accounts in trend manipulation as compared to bots. Furthermore, we present two case studies of hashtag-wars and anti-state propaganda to implicate the real-world application of our research.
Content may be subject to copyright.
Towards Understanding Trends Manipulation in
Pakistan Twitter
Soufia Kausar, Bilal Tahir, Muhammad Amir Mehmood
{soufia.kausar, bilal.tahir, amir.mehmood}@kics.edu.pk
Al-Khawarizmi Institute of Computer Science
University of Engineering and Technology, Lahore, Pakistan.
Abstract—The rapid adoption of online social media platforms
has transformed the way of communication and interaction.
On these platforms, discussions in the form of trending topics
provide a glimpse of events happening around the world in
real-time. Also, these trends are used for political campaigns,
public awareness, and brand promotions. Consequently, these
trends are sensitive to manipulation by malicious users who aim
to mislead the mass audience. In this article, we identify and
study the characteristics of users involved in the manipulation
of Twitter trends in Pakistan. We propose ‘Manipify’ – a
framework for automatic detection and analysis of malicious
users for Twitter trends. Our framework consists of three distinct
modules: i) user classifier, ii) hashtag classifier, and ii) trend
analyzer. The user classifier introduces a novel approach to
automatically detect manipulators using tweet content and user
behaviour features. Also, the module classifies human and bot
users. Next, the hashtag classifier categorizes trending hashtags
into six categories assisting in examining manipulators behaviour
across different categories. Finally, the trend analyzer module
examines users, hashtags, and tweets for hashtag reach, linguistic
features and user behaviour. Our user classifier module achieves
0.91 accuracy in classifying the manipulators. We further test
Manipify on the dataset comprising of 665 trending hashtags with
5.4million tweets and 1.9million users. The analysis of trends
reveals that the trending panel is mostly dominated by political
hashtags. In addition, our results show a higher contribution
of human accounts in trend manipulation as compared to bots.
Furthermore, we present two case studies of hashtag-wars and
anti-state propaganda to implicate the real-world application of
our research.
I. INTRODUCTION
Online social media platforms has emerged as key source
of information and socializing during last decade. These
platforms strive to maximize user engagement through rapid
information dissemination to information savvy users. In this
regard, Twitter – a micro-blogging platform – provides real-
time trends of the most discussed topics in the trending
panel [1]. Due to the extensive reach, such trends have enabled
journalists and business analysts to explore breaking news,
predict candidate popularity, and product reviews [2]–[5]. In
addition, a survey reports that 74% of Twitter users utilise
this platform as a source of daily news while 34% of these
users focus on trending topics for this purpose [6]. On one
hand, Twitter trends are being used to detect breakthrough
events, product marketing, and crisis management [4], [7]. On
the other hand, these trends are subjected to manipulation by
malicious users to spread false narratives [8].
Recent research reveals that Twitter trends can easily be ma-
nipulated by using a small number of automated accounts [9].
A new business has emerged where companies are selling
Twitter trends “manipulation as a service”. These services use
bots and trolls to generate scripted conversations to produce
false trends [10]. A survey reveals that 23-27% conversations
on Twitter related to politics during US elections 2016 are
carried out by bot accounts [11]. Also, another research
identifies 40% bot users disseminating information related to
COVID-19 on Twitter [12]. Due to such manipulation, critics
have also demanded to remove the trending panel as it does
not reflect the original trending topics [13].
In general, researchers have focused on examining the trend
manipulation by analyzing the activity of human and bot
accounts [14], [15]. Also, the pattern of deletion of tweets
related to a trend is studied to detect the possibility of
manipulation [16]. Moreover, a limited number of trending
hashtags are manually examined for manipulation [17]. How-
ever, these approaches have three major limitations. First, only
bot accounts cannot be labelled as manipulators as human
accounts are also involved in trend manipulation [17]. Second,
manipulators do not necessarily delete tweets after creating
a trend. Finally, the manual analysis of constantly emerging
new trends is not possible. Moreover, proposed techniques
for the identification of spam, bot, fake, compromised, and
cloned accounts are not extendable for manipulators due to
dissimilarity between their behaviours. To the best of our
knowledge, no research has been conducted for the automatic
identification of malicious users involved in the manipulation
of Twitter trends.
This paper is an effort to study the manipulation of trends
in Pakistan. In this regard, we propose a novel framework
of ‘Manipify’ to automatically identify and examine the ma-
nipulators. The framework consists of three major modules:
(i) user classifier, ii) hashtag classifier and (ii) trend analyzer.
The first module of the user classifier identifies bots, humans,
and manipulators using our developed datasets. Precisely, we
introduce a novel method to detect manipulators using content
and user behaviour features. Our method achieves an accuracy
of 0.91. Also, the user classifier leverages the profile features
of description, URL, followers, friends, status count, and geo-
location to identify the bot account with 0.84 accuracy. In
addition, the hashtag classifier categorises hashtags into six
classes of i) politics, ii) sports, iii) religious, iv) campaign,
arXiv:2109.14872v1 [cs.IR] 30 Sep 2021
v) entertainment, and vi) military using our labelled dataset of
2,384 hashtags. Finally, trend analyzer examines the trends for
language distribution of tweets, the reach of a hashtag, and the
behaviour of manipulators and bots. To test our framework, we
build a PK-Trend dataset containing 665 trending hashtags, 5.4
million tweets, and 1.9million users from Pakistan. Specifi-
cally, we collect trending hashtags and their related tweets for
one week in November 2020, December 2020, and January
2021 with a gap of five weeks. Our major contributions and
key findings are summarized as follows:
We introduce a machine learning-based method to au-
tomatically detect users involved in the manipulation of
Twitter trends. Our approach achieves the accuracy of
0.91 for manipulator detection.
Our analysis trending hashtags from Pakistan reveals that
political and campaign hashtags dominate the trending
panel with 32-42% and 16-32% hashtags, respectively.
On the contrary, only 1-14% hashtags are from the
categories such as sports, entertainment, and religion.
We found distinct patterns in user preferences of natural
languages with respect to hashtag categories. While the
English language is preferred for the entertainment and
sports category, Urdu is the frequently used language in
tweets related to political and religious hashtags.
Our user analysis highlights that on average our dataset
contains 374K (52%) and 31K (4.4%) bot and manipu-
lator accounts, respectively. Moreover, 51.5% of manip-
ulator accounts are human accounts.
The rest of the paper is structured as follows: Section II
presents the related work and Section III introduces the devel-
oped datasets. In Section IV, we describe the Manipify frame-
work while its evaluation is presented in Section V. Section VI
presents the analysis conducted for Twitter trends in Pakistan
using Manipify. Further, we present case studies in Section VII
to show the real-life applications of our framework. Finally,
we conclude our work in Section VIII.
II. RE LATE D WOR K
Recently, social media analysis has been adopted to perform
topic based sentiment analysis [18], [19], public opinion-
mining [20], emotion analysis [21], health surveillance [22],
crime monitoring [23], [24], spam detection [25], [26], crisis
management [27]–[29], and business marketing [4]. In addi-
tion, the social media users are examined for malicious user
identification [30]–[33], location inference [34], and influencer
identification [35].
A. Twitter trend analysis
In the last decade, researchers have focused on examining
Twitter trends due to their impact on society. For instance,
a real-time system was developed for the classification of
trending topics into news, current events, memes, and com-
memoratives [36]. The authors used features from the tweets
text and their metadata for the classification of trends. This
categorization of trends was a milestone towards identifying
the breaking news and viral memes from Twitter which was
helpful for journalists and news agencies. In addition, Zhang et
al. investigated the possibility of trend manipulation on Twit-
ter [14]. Authors experimented with features such as popularity
and coverage for Twitter trending topics to inspect features that
contribute more towards the prediction of trending. Their anal-
ysis also indicated the presence of malicious and spam users
manipulating Twitter trends. Similarly, Khan et al. proposed
a real-time trend detection method by analyzing a stream of
tweets [37]. They used statistical information retrieval methods
to extract important terms. Furthermore, first large scale study
on manipulated/fake Twitter trends was conducted by Elmas et
al. [16]. Authors uncovered the fact that nearly 20% of the
global Twitter trends were a result of manipulation. They also
observed that both the bots and compromised accounts were
involved in the manipulation of trends. In addition, the authors
discovered the attacked keyword reaches the trending panel
much faster than the normal trends.
B. Bot classification
The identification of bots is a hot topic of research due to
their usage and impact on social media content. For example,
supervised machine learning techniques have been proposed
with user-based features such as the age of the account, user
verified, the number of followers and followees [38]. Among
the user-based features, the feature of geo-location turned
out to be the most informative feature for the identification
of bot users. Similarly, Anwar et al. used an unsupervised
learning approach for bot detection in Twitter [33]. The data
from Canadian elections 2019 was used to perform k-means
clustering with the feature-set containing the number of daily
tweets, retweet percentage, and daily favourite count. In ad-
dition, one-class classification approach was also adopted for
bot classification [39]. Authors also compared the performance
of binary-class and multi-class classification with one-class
classification. Also, the real-time systems of BotSlayer [40]
and BotWalk [41] have been developed for the bot detec-
tion. BotWalk used an unsupervised adaptive algorithm with
network, content, and user-based features for classification
with 0.90 precision value. Furthermore, Sayyadiharikandeh et
al. have proposed Ensembles of Specialized Classifiers (ESC)
for bot detection [42]. The proposed methodology was also
deployed in the latest version of Botometer – an online bot
detection tool.
C. Twitter User Analysis
Twitter users are key to disseminate information and cre-
ating trends. Studying the characteristics of these users, Mo-
tamedi et al. conducted a detailed study on two snapshots
of Elite user accounts present on Twitter [43]. They investi-
gated features of elite users such as a change in followers,
followees, and rank over time. Also, the findings showed that
graph relation between elite users formed 14–20 communities.
Similarly, analysis on one million Twitter users was carried
out to analyze the behaviour of demographic groups [44].
The authors’ analysis of demographic attributes including
gender, ethnicity, and account age of one million Twitter users
Table I: Description of hashtag categories.
Category Description
Political Hashtags related to political figures, events or slogans.
Sports Sports hashtags are those that highlight a sports event, teams or players.
Religious The hashtags highlighting religious topics.
Campaign All hashtags that are promotional or social campaigns i.e, demanding justice.
Entertainment Entertainment hashtags include discussion of media celebrities, music, movies etc.
Military The hashtags discussing military events of personnel.
Other All hashtags that do not match the description of above mentioned categories.
highlighted that various demographic groups show differences
in behaviour. For instance, male users tended to specify
their location in their Twitter profile while keeping geo-
tagged features disabled. Similarly, the female users preferred
to use iPhone or Android devices for Twitter instead of a
web browser. In addition, Yaqub et al. conducted sentiment
analysis two political candidates during 2016 US presidential
elections [45]. The analysis showed Trump received more
positive sentiment as compared to Hillary Clinton. In addition,
the authors analyzed tweets of one million Twitter users to
identify their opinion. Their findings revealed that existing
opinions were re-shared using the retweet feature instead of
building new opinions and arguments.
D. Hashtag classification
The automatic understanding of hashtags is a challenging
task because hashtags are inconsistent and lack standard
vocabulary [46]. Previously, the researchers have adopted the
labour-intensive and time-consuming path for labelling the
trending hashtags. In this regard, Romero et al. categorised
hashtags into eight pre-defined categories manually to exam-
ine the patterns of information dissemination [47]. Jeon et
al. experimented to build a hashtag recommendation system
after their topic classification [48]. The TF-IDF lexical features
were extracted from tweets and train the Naive Bayes classifier
for hashtag classification. Another algorithm was proposed
for the hashtag classification after combining lexical and
pragmatic features [49]. The pragmatic features are related to
user profiles such as the number of followers or followees.
In addition, open-source content like Wikipedia and Open
Directory was utilized for the classification of hashtags [50].
Such as, Ferragina et al. used the Wikipedia graph to devise the
Hashtag-Entities (HE) graph which represented the semantic
relation between hashtags and their entities [50]. However,
this approach is limited to the hashtags that are available in
the Wikipedia graph only.
The literature review revealed that the research community
has focused on exploring the possibility of manipulation of
Twitter trends but no technique has been presented for the
automatic detection of manipulators. In contrast, we propose
a machine learning based model for automatic manipulator
detection. In addition, we build an analyzer module for a
comprehensive analysis of Twitter trends.
III. DATASET
In this section, first, we focus on building separate datasets
to detect manipulators and bots. Next, we describe the process
Table II: MT-Dat – Statistics
Manipulators Non-Manipulators Total
#Users 510 500 1,010
%Users 50.4% 49.6% 100%
Table III: BT-Dat – Statistics.
Dataset #Accounts #Bot %Bot #Human %Human
midterm-2018 20,000 11,908 60% 8,092 40%
celebrity-2019 4,589 0 0% 4,589 100%
Cresci-2017 11,017 7,543 68% 3,474 32%
BT-Dat 35,606 19,451 55% 16,155 45%
of developing a dataset for the hashtag classification. Finally,
we discuss our PK-Trends dataset to study Pakistan Twitter
trends.
A. Manipulator Detection – MT-Dat
One needs a gold-standard labelled dataset to train a super-
vised machine learning model for manipulator’s identification.
Due to the absence of such a dataset, we develop the MT-
Dat dataset by manually labelling users as ‘manipulators’ and
‘organic’ users. For the labelling of a user, we observe the
features of velocity, volume, and content similarity of tweets
posted by a user. Generally, the velocity and volume of tweets
related to a topic are considered as key features to determine
the trending topic [51]. Leveraging these features, manipula-
tors post tweets with high volume and velocity to create a fake
trend. In addition, we notice that manipulators generate a large
number of posts with similar content to increase the volume
of tweets. Therefore, we also manually observe the similarity
of tweets posted while assigning a label. Also, the aim of
manipulators is to force a fake trend into a trending panel.
Therefore, we only observe the tweets posted by users before
the hashtag was first seen in the trending panel for labelling.
Examining these features, two annotators manually labelled
randomly selected 1,010 users with a mutual agreement of
98%. In MT-dat, 510 users were labelled as manipulators and
500 were labelled as non-manipulators. Table II shows the
statics of our dataset.
B. Bot detection dataset – BT-Dat
Next, we use three publicly available datasets to develop
our bot detection dataset (BT-Dat). The first labelled dataset of
midterm-2018 dataset contains information of users and tweets
from the US midterm elections 2018 [52], [53]. Annotators
labelled the user as human if they are actively involved in
any political discussions. To label bot accounts, features of
Table IV: Ha-Dat – Statistics.
Hashtags Political Sports Religious Campaign Entertain Military Other Total
English 292 120 77 185 117 102 44 937
Urdu 212 18 102 105 35 38 56 566
English-Urdu 359 89 75 160 75 29 94 881
Total 863 227 254 450 227 169 194 2,384
Table V: PK-Trends – Statistics.
Sr# Dataset PK-Nov-20 PK-Dec-20 PK-Jan-21
1 Time Period 08 - 15 Nov, 2020 13 - 20 Dec, 2020 21 - 28 Jan, 2021
2 Unique Trends 1,542 1,454 1,391
3 Unique Hashtags 284 188 193
4 Local Hashtags 231 161 141
5 Global Hashtags 53 27 52
6 Unique Keywords 1,258 1,266 1,198
7 Unique Users 1,359,406 554,513 75,628
8 Unique Tweets 2,990,850 2,045,448 458,799
tweet time and account creation time are manually observed.
The midterm-2018 dataset contains a total of 50,538 user
accounts from which 42,446 are bot and 8,092 are human
accounts. We only include 11,908 bots from midterm-2018
in our dataset. Due to the insufficient number of human
accounts in midterm-2018, we also use the dataset of celebrity-
2019 [54]. This dataset contains 4,589 human accounts be-
longing to prominent public figures. In addition, the third
dataset of Cresci-2017 contains 7,543 bot and 3,474 human
accounts. The accounts in the dataset are labelled by the
Crowdflower contributors [55]. For human accounts labelling,
annotators contacted random Twitter users and ask a question
in the natural language. Accounts that answered questions
properly are labelled as human accounts. Moreover, Cresci-
2017 contains three classes of bots: i) traditional bots, ii) fake
followers, and iii) social spambots [56]. We combine users
in all three datasets to develop a comprehensive Bot detec-
tion (BT-Dat) dataset that contains 35,606 labelled Twitter
accounts. Table III shows statistics for the four bot detection
datasets.
C. Hashtag Classification dataset – Ha-Dat
As such, the hashtags contain symbols, names of organiza-
tions, people, or events joined without using any space [46].
A labelled hashtags dataset is required to understand and
classify these hashtags into their respective categories. To
this goal, we develop a labelled hashtag dataset by manually
annotating trending hashtags into six categories of (i) politics,
(ii) sports, (iii) religious, (iv) campaign, (v) entertainment
and (vi) military. We collected trending hashtags from Twitter
using the Twitter API [57] from July 25, 2018, to August
6, 2021. Using the definitions of hashtag categories given in
Table I, two annotators manually labelled randomly selected
2,384 hashtags. If a hashtag doesn’t match the description of
any category it is labelled as ‘other’. We fetch tweets related
to these hashtags in English and Urdu language. The hashtags
that contain tweets in the English language only are referred
to as ‘English hashtags’. Similarly, hashtags that contain Urdu
language tweets only are ‘Urdu hashtags’. Finally, ‘English-
Urdu hashtags’ are those that contain tweets in both languages.
The detailed statistics of our dataset are shown in Table IV.
D. PK-Trends dataset
Finally, we discuss PK-Trends dataset collected to study
different aspects of trending hashtags in Pakistan. For this pur-
pose, first, we collect trending hashtags in Pakistan from the
online service of GetDayTrends [58] which provides the list of
trending hashtags on an hourly basis. We create three datasets
by fetching trending hashtags for one week in November 2020
(PK-Nov-20) , December 2020 (PK-Dec-20), and January
2021 (PK-Jan-21). We deliberately take these samples with
a gap of five weeks to explore trending hashtags dynamics. In
addition, we define a time window to fetch tweets related to the
trending hashtags. This time window includes tweets from one
day before to one day after the hashtag is seen in the trending
panel. Moreover, the Python language library of Twint [59] is
used to fetch the tweets for a three-day time window. We
note that Twint could not scrap “retweets”. Therefore, cu-
mulatively, PK-Trends dataset contains 5.4million “original”
tweets posted by 1.9million users. Furthermore, PK-Nov-20,
PK-Dec-20, and PK-Jan-21 contain 284,188, and 193 unique
trending hashtags, respectively. Table V shows statistics of
PK-Trends dataset. Figure 1 shows the number of trending
hashtags containing related tweets in PK-Trends for seven days
with 2 hours bins for all datasets. We observe a periodic pattern
in the number of unique trending hashtags with a peak around
midday (11 AM-3 PM) consistent with prior studies [60]. This
observation highlights that users discuss more unique hashtags
during daytime. On average, PK-Nov-20, PK-Dec-20, and PK-
Jan-21 contain tweets related to 104 (36.6%), 64 (34.1%),
and 42 (21.7%) trending hashtags in each bin, respectively.
Also, the PK-Trends dataset contains tweets posted in multiple
languages. To study the natural language distribution, we use
the meta-information of tweet language provided by Twitter.
We found that English is the most frequent language used in
PK-Trends dataset with 3.2million tweets. Whereas, only 0.48
million tweets are posted in the Urdu language. It has 0.67
Figure 1: Number of hashtags – PK-Trends dataset.
million tweets marked as unknown language and 1.1million
tweets from “other” languages.
IV. MANIPIFY – FRAMEWORK
In this section, we describe the architecture of our proposed
Manipify framework. First, we explain the user classifica-
tion modules. Next, we discuss the methodology of hashtag
classification. Finally, we present the trend analyzer module
proposed to understand the various dynamics of users.
A. Manipulator Detection
The classification of manipulators requires distinct features
related to users to train the machine learning model. In
literature, there is no study available which automatically
identifies the manipulators using such features. Hence, we
design five features related to users which are : 1) number
of total tweets by a user (T weets), 2) number of tweets
before trend time (T weetsbefore), 3) average time between
consecutive tweets after trend time (T imeafter), 4) average time
between consecutive tweets before trend time (T imebefore), and
5) content similarity score (Simscore ). As such, the volume
and velocity of tweets containing a hashtag are key factors to
determine the trending hashtags [51]. In literature, the manual
analysis of manipulators reveals that they post a large number
of tweets (T weets) using a hashtag to create trend [17].
Particularly, it is observed that these users post tweets before
trend time (T weetsbefore) as ‘organic user’ generally use the
hashtag after it is seen in the trending panel. In addition,
the velocity of tweets containing a hashtag is the key factor
to determine the trend. Therefore, we consider the average
time between tweets posted by user before (T imebefore) and
after (T imeafter) trending time of a hashtag. We calculate the
velocity of tweets before and after trending time as we believe
manipulators limit the activity after trending which impacts the
overall value of tweets velocity.
Finally, we calculate the similarity score (Simscore) of
tweets posted by a user with the aim of identifying the
manipulators posting similar tweets to increase the volume.
In addition, posting and deleting a large number of tweets
using the same hashtag is a violation of Twitter’s platform
policy [61]. To calculate the content similarity score, we make
use of natural language processing techniques. The similarity
score of all tweets posted by users related to a hashtag
is computed by using the overlapping n-grams in tweets.
This is worth noting that the comparison of all possible n-
grams of tweets with length greater than one is done for
score calculation. Lets a user (U) post nnumber of tweets
related to a hashtag (H). We calculate the similarity score
(Simscore ) of users concerning hashtags by computing the ratio
of overlapping n-grams and total n-gram of all tweets posted
by users using Equation 1.
Simscor e(U) = Pn
w=2(ngramWf req(ngram))
#tweets (1)
It should be noted that we only utilise the original tweets of
a user for calculating the classification features and retweets
are not considered. Utilising these five features, we train a
Logistic Regression classifier to classify the manipulators.
B. Bot Detection
Finally, we use the user behaviour and activity features for
bot classification. We extract nine binary features related to
user account [38]. They are: 1) User description, 2) URL exists
in description, 3) Friend count>1000, 4) Follower count<30,
5) User geo-location, 6) List count>0, 7) Statuses count>0, 8)
URL exists in profile, and 9) User verified account. In addition
to these, we also use the features of followers count and friends
count. The presence of description highlights the probability
of human users because bot accounts lack such customized
information. Similarly, human users tend to have a lower
number of friends and the user with a large friend count are
considered bots. In addition, the bot account generally does not
enable geo-location for interaction. Moreover, the verification
of accounts by Twitter is an indication of human accounts.
These verified accounts belong to celebrities, organizations, or
political parties. We refer to the bot classifier as “BotCat” in
the rest of the paper. We train and evaluate the performance of
the Logistic Regression classifier with the split ratio of 70:30
for bot detection.
C. Hashtag Classification
Figure 2 shows the architecture diagram of Manipify. To
build a hashtag classification module, we leverage the lexical
features of tweets for the classification of hashtags as done by
of Jeon et al. [48]. First, during the pre-processing of tweet
text, we remove all symbols, hashtags and URLs. Next, we
Figure 2: Research methodology – Flow diagram
extract 1,3-ngrams utilising the statistical technique of Term
Frequency Inverse Document Frequency (TF-IDF) from all
pre-processed tweets related to a hashtag. Using the extracted
TF-IDF features, we train a Logistic Regression classifier with
default parameters for classification. It is crucial to mention
here that we train separate TF-IDF vectorizers and models for
English and Urdu hashtags. Moreover, we train a series of
binary classifiers for each hashtag category using the ‘one-
vs-all’ approach due to its better performance compared to
multi-class classifiers [62], [63]. We assign the category of
the classifier with the highest probability to the hashtag. The
hashtag is assigned the label of ‘Other’ if binary classifiers of
all categories give the probability less than 0.5. Moreover, we
need to calculate the minimum number of required tweets to
accurately classify the topic of a hashtag. In this regard, we
use 100 as a minimum number of tweets required to classify
the hashtag in accordance with literature [48], [64].
D. Trend Analyzer
Next, we turn our attention to the analyzer module which
is designed to perform an in-depth analysis of trending hash-
tags and users. First, the module focuses on analyzing the
content of hashtags by calculating the distribution of local
trends, natural language, and hashtag reach. Twitter trending
panel often consists of hashtags related to local and global
topics. Hashtags related to global topics like #MUFC and
#CristianoRonaldo are discussed all around the world and
these trends are not specifically related to local topics. We
limit the scope of the Manipify to study different aspects of
local trending hashtags due to two major observations. First,
global hashtags contain tweets in multiple languages. Second,
these hashtags are adopted internationally and the context
of hashtags varies with the cultural context of users [65].
These features of global hashtags made the understanding and
classification of these hashtags a challenging task. Therefore,
we propose a classification model of PK-Hash-local to identify
local trends for analysis. For classification, the following
meta-information related to hashtag is fetched: i) Trending
time of hashtag in the target country, ii) Trending time of
hashtag in other countries, iii) List of countries hashtag is
seen in trending panel, and iv) Whether Hashtag has trended
worldwide. We use this meta-information of hashtags to derive
three features for the classification. The feature of “1st trend”
is used to determine the country where a hashtag was first
trended. The hashtag is likely to be related to a local topic
if it is first seen in the trending panel of the target country.
Similarly, the feature of “number of countries trended” counts
the number of locations where the hashtag is seen in the
trending panel. With the higher number of this feature, the
hashtag has less probability of being a local trend. Finally, the
self-descriptive binary feature of “Worldwide trended” shows
that the hashtag is global or local. Table VII shows sample
labelled hashtags with features from PK-Trends. Finally, we
use meta-information features of hashtags to train the decision
tree classifier.
To study the natural language of tweets related to a hashtag,
we utilise the meta-information of tweet language provided
by Twitter to calculate the language distribution as described
in Section III. Finally, the analyzer calculates the reach of a
hashtag which is a measure to calculate the number of users
who potentially have viewed the hashtag [66]. It also gives
an approximation of the extent to which a trending hashtag
can affect the Twitter community. For instance, the hashtag
used in the tweet of a celebrity with millions of followers has
a greater effect on the Twitter community as compared to a
regular user. The possible reach of a hashtag is computed by
summing the number of followers of all users who tweeted
the hashtags. To calculate the reach (Reach) of a hashtag (H)
tweeted by unique users (n), the Equation 2 is used.
Table VI: Hashtag classification results on HT-Dat dataset.
Dataset Measure Political Sports Religious Campaign Entertainment Military Other
English
Precision 0.82 0.93 0.96 0.78 0.90 0.97 0.78
Recall 0.82 0.92 0.92 0.72 0.88 0.97 0.71
F1-score 0.82 0.93 0.94 0.76 0.89 0.97 0.66
Urdu
Precision 0.83 0.91 0.95 0.78 0.80 0.89 0.63
Recall 0.82 0.83 0.92 0.72 0.65 0.79 0.61
F1-score 0.82 0.85 0.93 0.74 0.65 0.81 0.58
English
- Urdu
Precision 0.88 0.92 0.94 0.85 0.76 0.83 0.38
Recall 0.91 0.89 0.85 0.72 0.83 0.86 0.44
F1-score 0.89 0.90 0.89 0.78 0.79 0.84 0.41
Table VII: Sample of local and global trending hashtags.
Sr# Hashtag Trending Time Features Local
Date Hours 1st trend # World
1#SamsungPakistan 24-01-2021 14:00-15:00 Pakistan 0 False True
2#Fajr 26-01-2021 02:00-05:00 Pakistan 1 False True
3#MotivationalQuotes 25-01-2021 02:00-05:00 Pakistan 2 False True
4#POTUS 21-01-2021 00:00-04:00 Australia 12 False False
5#UFC257 24-01-2021 02:00-12:00 Japan 58 True False
6#T10League 28-01-2021 16:00-17:00 India 2 False False
Table VIII: Classification performance of manipulator and bot
detection
Sr. Class Precision Recall F1-score Accuracy
1 Manipulators 0.90 0.92 0.91 0.91
2 Organic 0.92 0.89 0.90 0.91
Total 0.91 0.91 0.91 0.91
3 Human 0.79 0.82 0.81 0.86
4 Bots 0.89 0.88 0.89 0.86
Total 0.84 0.85 0.85 0.86
ReachH=n+
n
X
i=1
F ollower si(2)
Furthermore, the trend analyzer performs the analysis of
the user by studying the behaviour of bots and manipulating
users. For user analysis, first, we detect manipulators and bots
to investigate their characteristics. Next, their distribution is
analysed in different hashtag categories. Besides, a time series
analysis is presented for a sample of hashtags to scrutinize the
behaviour of bots, humans, manipulators and organic users.
V. MANIPIFY EVAL UATIO N
In this section, first, we describe the performances of ma-
nipulators and bot classifiers. Next, we dive in to the detailed
results of hashtag classification. Finally, we discuss the PK-
Hash-local classifier used to create a dataset for hashtags
related to Pakistan only.
A. Manipulator Detection
Table VIII presents the results of manipulator detection.
The classifier achieves an overall accuracy and F1-score of
0.91 each. However, the manipulators are classified with a
higher F1-score of 0.91 as compared to the organic users.
The examination of weights assigned to features used for
classification reveals the highest contribution of the feature
of Tweetsbefore. This feature is evidence of manipulation as
our manual analysis proves that manipulators are more active
before the trend time of a hashtag. In addition, the second-
highest weight is assigned to the Simscore feature as the
large similarity between tweets of a user shows manipulative
behaviour [61]. On the other hand, the most informative
features to classify organic user is Timeafter and Timebefore. The
manipulators tend to post tweets with the intent to increase
the volume and velocity as these are the key factors for
determining a trend [51]. Whereas, the organic users do not
have any pattern in the average time between their consecutive
tweets.
B. BotCat
Table VIII shows the classification performance of our
BotCat with BT-Dat dataset. We note that the bot classifier
achieves 0.86 accuracy on the BT-Dat dataset with precision,
recall, and F1-score of 0.84,0.85 and 0.85, respectively. The
0.89 precision for bot class compared to 0.79 precision value
for human class indicates that the classifier identifies bot
accounts more accurately as compared to human accounts. To
investigate the reason for the lower performance of the BotCat
for human accounts, we analyze the weights assigned by the
trained classifier to the classification features. We observe that
the classifier assigns the highest weight to feature of follower
count <30 to bot class explaining that users with follower
count less than 30 are more likely bot account. Similarly, the
feature of user verified is assigned the highest weights for
human accounts. In addition, we analyze bot accounts and
observe that the few bot accounts have a fake name, profile
picture, and the given description. Moreover, these accounts
replicate the behaviour of human users which results in mis-
classification of these users as human accounts [67].
Figure 3: Percentage of hashtags for each category – PK-Trends-Local dataset.
C. Hashtag Classification
We train and evaluate the hashtag classifiers for English
and Urdu hashtags separately. Table VI gives the detailed
classification results. First, we discuss the performance of the
classifier for English hashtags. The English hashtag classifier
achieved an accuracy of 0.84 with a 0.70 F1-score. However,
we observe the highest F1-score of 0.97 for the military
hashtags. Similarly, the sports and religious hashtags have
comparable F1-scores of 0.93 and 0.94. Whereas, the Other
category hashtags achieve the lowest F1-score of 0.66. Finally,
the political,campaign and entertainment hashtags have 0.82,
0.76 and 0.89 F1-scores. The low F1-score of hashtags for
some categories is due to the high lexical diversity in samples
of these classes. In addition, the number of training samples
also affect the classification performance of binary classifiers.
Next, we divert our attention towards the classification
results of Urdu hashtags. The Urdu hashtag classifier attains
0.79 accuracy and 0.63 F1-score. Here, the religious hashtags
are classified with highest F1-score of 0.93. Similar to the
English hashtags, the Other hashtags are classified with the
lowest F1-score of 0.58. The entertainment hashtags also
achieve a low F1-score of 0.65. Moreover, the categories of
political,sports,campaign and military attain F1-scores of
0.82,0.85,0.74 and 0.81. Overall, we observe lower F1-scores
for the Urdu language binary classifiers as compared to the
English classifiers. We attribute this result to the lower number
of samples in the training data for Urdu hashtags.
The classification of English-Urdu hashtags with our frame-
work presents a interesting challenge that which classifier
(English or Urdu) should be used for such hashtags. In this
regard, first, we classify the hashtag with both English and
Urdu classifiers using tweets of respective languages. Next,
we compare the probabilities for each class assigned by both
classifiers. Finally, we assign the label of the category with
the highest probability assigned by either classifier. Using
this approach, we observe the English-Urdu classifier achieves
0.84 accuracy and 0.79 F1-score. However, we notice the
best performance for the sports hashtags with 0.87 F1-score.
The religious and political category hashtags have comparable
performance with 0.89 F1-score each. Whereas, the campaign,
entertainment and military hashtags are classified with 0.78,
0.79 and 0.84 F1-scores respectively. Consequently, the Other
category hashtags are classified with lowest F1-score of 0.41.
We notice that this approach achieves better performance
compared to the Urdu language classifier.
D. PK-Trends-Local
Leveraging the PK-Hash-local classifier, we create the PK-
Trends-Local dataset containing the trends related to Pakistan.
In order to classify the ‘global’ and ‘local’ trends, first, we
create the labelled dataset by manually labelling the 193
hashtags of the PK-Jan-21 dataset into the local and global
category. With manual classification, 141 (73%) hashtags
are labelled as local while 52 (27%) are labelled as global
hashtags. Next, the meta-information related to hashtags in
the PK-Trends dataset is fetched from GetDayTrends. Using
the hashtag features (explained in Section IV) of the labelled
dataset and the split ratio of 70:30, the classifier achieves the
accuracy of 0.97. Next, we classify trends of PK-Nov-20 and
PK-Dec-20 using the trained classifier and identify 231 and
161 local hashtags in PK-Nov-20 and PK-Dec-20, respectively.
Table V shows the distribution of local and global trends in
PK-Trends.
VI. PK-TRENDS ANALYSIS
In this section, first we classify and analyse the content of
hashtags in PK-Trends-Local dataset. Next, we identify the
malicious users for each hashtag and discuss their distribution
in PK-Trends-Local. Finally, we present the category wise
analysis of users.
Figure 4: Per day distribution of trends and tweets for each category – PK-Trends-Local dataset.
A. Content Analysis
To begin with the analysis, first, we classify the hashtags
in PK-Trends-Local. This is done in order to conduct the user
analysis for various hashtag categories. Figure 3 shows the
distribution of hashtags and tweets related to each category in
the PK-Trends-Local dataset. The classification results show
that 32-40% hashtags belong to the political category. This
result highlights the interest of the general public in politics.
In addition, 15-32% of Twitter trending hashtags belong to
the campaign category. Interestingly, 8-40% tweets belong
to this category show the promotional efforts of users for
campaign hashtags. Moreover, only 2-11% hashtags belong to
the sports, 8-10% to the religious, 5-15% to the entertainment,
and 1-7% to the military category. Also, the ‘other’ category
contains less than 15% hashtags in three datasets of PK-
Trends-Local showing the large coverage of Manipify for
analysis trending panel with six pre-determined categories.
Zooming into the detailed analysis, Figure 4 provides the
category-wise distribution of hashtags for each day. We note
that the distribution of hashtag categories is intermittent due
to the influence of real-world events on the trending panel. For
instance, on the anniversary of the shooting at Army Public
School (APS) in Pakistan on 16 December, 9 (35%) hashtags
related to military class are seen in the trending panel.
Next, we analyze the distribution of natural languages of
tweets, hashtag reach, and sentiments for different hashtag
categories. As Manipify processes the data for Urdu and
English languages only, we inspect the ratio of tweets for
English, Urdu, unknown, and other languages as described
in Section III. Figure 5(a) shows the percentage of tweets
of each language belonging to seven categories in PK-Nov-
20. We notice that 15-75% tweets are posted in the English
language. Also, the political and religious categories contain
56% and 60% tweets in the Urdu language, respectively. On
the other hand, sports contains 77% while the entertainment
category contains 50% English tweets. Moreover, the dataset
contains 60-80% tweets posted in English and Urdu language
showing that the Manipify framework effectively analyzes the
predominant part of tweets related to the trending panel. The
datasets of PK-Dec-20 and PK-Jan-21 shows a similar pattern
for language distribution. From these results, we conclude that
the user prefers the local language Urdu to discuss the topics
related to political and religious categories. In addition, sports
and entertainment categories contain a high percentage of
English tweets because such category hashtags are discussed
by international users as well.
Figure 5(b) shows the average reach of hashtags related
to each category. We observe that the sports category has
maximum reach with a limited number of hashtags and tweets
as shown in Figure 3. This result highlights that the substantial
reach of the sports category is attributed to the usage of
such hashtags by international celebrities. For example, the
hashtags #PakvsSA and #NZvPAK are used by international
cricket players referring to the cricket match of Pakistan versus
South Africa and New Zealand versus Pakistan, respectively.
Moreover, the religious and campaign category hashtags have
(a) Category-wise language distribution of tweets – PK-Nov-20 dataset. (b) Average reach of trends for each category – PK-Trends-Local
dataset.
Figure 5: Category distribution according to language and average reach.
Figure 6: Percentage of bots, humans and manipulators – PK-
Trends-Local dataset.
a lower value for reach. These results provide an interesting
conclusion that religious category hashtags are generally used
by normal Twitter users instead of celebrity users. Also, the
campaign category hashtags are used for a limited audience
from Pakistan.
For sentiment analysis of hashtags in PK-Trends-Local,
we manually label the sentiment of all hashtags. Annotators
assign the label positive, negative, and neutral by analyzing
the text of the hashtag only. After labelling, we notice that
41% hashtags have positive while 23% express negative senti-
ment in the dataset. Furthermore, we investigate the category-
wise distribution and found that 40% of hashtags containing
negative sentiment belong to the campaign and 36% hashtags
belong to the political category. Besides sentiment labelling,
the annotators are also asked to label the political hashtags
into four classes: 1) faction, 2) personality, 3) slogan, and 4)
general discussion. The ‘faction’ class represents the politi-
cal parties like #PTI while hashtags discussing the political
persons such as #ImranKhan are marked as ‘personalities’.
Similarly, slogans of political entities i.e., #JeetKaNishanTeer
are organized into ‘slogans’. Finally, the remaining political
hashtags are label as ‘general’ discussion. After labelling, in-
terestingly, all political slogans are labelled as having positive
sentiment. In addition, we observe negative sentiment in 20%,
37%, 26% political hashtags related to faction, personalities,
and general discussions, respectively. From these results, we
draw three interesting conclusions. First, the political and
campaign category hashtags are used for negative campaign-
ing and mudslinging as highlighted by a large percentage
of negative sentiment hashtags in these categories. Second,
hashtags related to political slogans are generated by political
factions themselves for their promotion therefore they possess
positive sentiments. Finally, hashtags related to factions and
personalities are used by political rivals to create political
polarization on online social media.
B. User Analysis
We initiate the user analysis by exploring the behaviour of
users in PK-Trends-Local to determine the patterns of manip-
ulation. Figure 6 shows the distribution of manipulators and
organic users in PK-Trends-Local. Interestingly, PK-Nov-20,
PK-Dec-20, and PK-Jan-21 contain 4.5,4, and 5% manipula-
tors, respectively. Exploring the manipulators further, Figure 7
category-wise distribution of manipulators and organic users.
Moreover, the percentage of tweets posted by these users
is also provided. We observe the presence of only 15%
manipulating users in sports hashtags. This result is anticipated
because sports hashtags like #PakvsSA are trended during
the real-world event of a cricket match. However, a higher
percentage of manipulators is observed in the political and
entertainment hashtags with 5-10% and 8-12% manipulators,
respectively. On the contrary, the religious, campaign, and
Figure 7: Category-wise percentage of users and tweets of bots, humans and manipulators – PK-Trends-Local dataset.
military categories contain only 2-8% manipulators. However,
focusing on the tweets higher percentages of tweets are
seen with a very low percentage of manipulative users. To
sum up the results, we conclude that a higher percentage
of manipulators in the political hashtags is due to targeted
mudslinging generated by the rival political factions [68]. The
manipulators generate fake trends of such political hashtags
to increase the exposure of their content to Twitter users. In
addition, the trends of entertainment hashtags are generated
with pre-planned coordinated efforts to promote TV shows,
movies and music.
Figure 6 shows that PK-Nov-20, PK-Dec-20, and PK-Jan-
21 contain 50,52, and 64% bot accounts. In addition, the
bot accounts are highly suspected to play a key role in the
manipulation of the trending hashtags [69]. Therefore, we
further investigate the users identified as bots as well as
manipulators. Figure 6 shows the percentage of such users.
We observe that 2.1%, 2.0%, 2.7% accounts are identified as
bot as well as manipulators in PK-Nov-20, PK-Dec-20, and
PK-Jan-21, respectively. Interestingly, the percentage of bot
users involved in manipulation is consistent in all data points.
Whereas, the percentage of bots only is least in PK-Nov-20
and highest in PK-Jan-21. This result is expected because
the accounts generating automated activity are suspended by
Twitter [10]. Considering that the data related to all three
datasets in fetched in February 2021, the tweets posted by
such deactivated or suspended accounts are not fetched in PK-
Trends-Local.
Furthermore, Figure 7 also shows the category-wise percent-
age distribution of bots and humans users along with tweets
posted by these users. We notice that the campaign category
has the largest percentage of bots with 60-78% bots. Similarly,
the sports hashtags contain 45-50% bots. In addition, 42-50%
tweets related to political hashtags are created by bot users.
From these results, we conclude that bot accounts are used for
the promotion of campaign and entertainment hashtag [70].
Moreover, sports hashtags like #PakvsSA are used by bot
accounts to provide live updates related to match. Interestingly,
for political hashtags, the bot accounts are used for promotion
as well as to provide live updates related to political activity.
For instance, the political hashtag of #JeetKaNishanTeer is
promoted using bot accounts with political motives while the
hashtag of #GBElection2020 is used to provide live updates
regarding by-election.
To provide an in-depth analysis of PK-Trends-Local, Fig-
ure 8 shows the time series plot of six hashtags related to
political, sports, campaign, entertainment, military, and ‘Other’
category. In particular, the time-series plot of the number of
tweets posted by bots and humans as well as a manipulator
and organic users are presented. While the shaded region
highlights the time a hashtag is part of the trending panel.
First, focusing on manipulators, we observe that the political
hashtag of #JeetKaNishanTeer contains more tweets posted
by manipulators before trend time. While the organic users
discuss the hashtag after the trending time highlighting that
manipulators limit their activity after making the hashtags in
the trending panel. The military hashtag of #APSMartyrsDay
shows a similar pattern. Similarly, the entertainment hashtag
Figure 8: Time series of selected hashtags (highlighted regions represent the time in trending).
#BB14 has a higher number of tweets by manipulators before
the hashtag is seen in the trending panel. In addition, the
manipulators remain active after the hashtag is seen in the
trending panel. For the entertainment category, this result high-
lights that manipulators not only trend the hashtags but also put
effort to disseminate content related to hashtag. However, this
observation does not hold for the sports hashtags. The sports
hashtag #pakvsa contain very few tweets from manipulators.
Similarly, the limited number of tweets related #covidvaccine
and #JusticeForstudents hashtags are posted by manipulators.
This lower percentage of tweets generated by manipulators
shows that these topics are actually discussed by organic
Twitter users.
We further analyse the time-series plot for the tweets created
by the bot and human users. For example, a time-series plot
for the hashtag of #covidvaccine from the ‘Other’ category
shows that this trend is largely used by human accounts.
However, bot accounts have little participation in generating
the content for this hashtag. In addition, the sports hashtag
#pakvsa is used by bots after being initiated by human users.
Similarly, the hashtag #JeetKaNishanTeer from the political
category has a small number of bot accounts. In addition, the
time-series of campaign hashtag #JusticeForStudents shows
different behaviour. The content of this hashtag is majorly
posted by bot accounts and later humans joined the discussion
after seeing it in the trending panel. The hashtag #APSMar-
tyrsDay from the military category have a longer time-span (24
hours) in the trending panel and the 10-50% tweets posted by
bot users. Finally, the entertainment hashtag #BB14 shows a
large number of tweets by bots. Interestingly, this hashtag also
contains a large number of tweets generated by manipulators.
From this result we conclude the manipulation is performed
by bot accounts for the entertainment category.
From the analysis of sample hashtags, we conclude three
observations. First, the behaviour of manipulator and bot
accounts vary for each category, emphasizing that general-
ized patterns cannot be observed among different categories.
Second, the entertainment and campaign hashtags rely on
bot accounts for the manipulation while political hashtags
are manipulated by human users. Finally, for entertainment
hashtags, manipulators not only manipulate the trend to the
trending panel but also continue generating content to keep
users engaged.
VII. CAS E STU DY
Figure 2 shows the applications of our framework. In this
regard, this section presents two real-world case studies to
Table IX: Hashtag pairs generated by political rivals on Twitter
Bots Manipulators Bot Manipulators
Pair Hashtag Trend Time %Users %Tweets %Users %Tweets %Users %Tweets
1-ORIG 2020-12-14 06:00 58 52 1 8 1 2
1-RESP 2020-12-14 10:00 52 47 1 9 0.4 2
2-ORIG 2020-12-20 14:00 52 42 10 38 4 15
2-RESP 2020-12-20 18:00 56 57 11 51 5 29
3-ORIG 2021-01-28 13:00 48 46 9 39 3 17
3-RESP 2021-01-28 19:00 56 48 9 55 4 24
4-ORIG 2021-01-24 13:00 53 47 5 22 2 10
4-RESP 2021-01-24 17:00 56 55 4 25 2 17
show the efficacy of Manipify. First, we identify and analyze
hashtag-wars in trending panel of Pakistan using Pk-Trends
datasets. Next, we examine the fake trend spreading anti-state
propaganda on Twitter.
A. Hashtag wars
Hashtag wars is phenomenon of disseminating the hege-
monic narrative on social media platform to sabotage the
visibility of conversations with opposing views. In this regard,
we manually analyze trending hashtags in Pk-Trends dataset
and notice that few hashtags trending at the same time contain
the opposing narrative of rival parties. Table IX shows four
pairs of trending hashtags seen at same time in the trending
panel. Each pair contains hashtags with opposite narrative
related two major political parties of Pakistan Tehreek Insaf
(PTI) and Pakistan Muslim league (PMLN). Moreover, hashtag
that appears first in the trending panel is the original (ORIG)
hashtag and that appears later is the response (RESP) hashtag.
We initiate the analysis of these hashtags by comparing the
trending time of ‘ORIG’ hashtag with the time when the
first tweet of ‘RESP’ Hashtag is posted. We notice that first
tweet of ‘RESP’ hashtag is posted after the hashtag ‘ORIG’ is
seen in the trending panel for each pair. This analysis depicts
that the trend of ‘RESP’ Hashtag is deliberately created by
manipulation with the aim of spreading opposing political
narrative. In addition to manipulators, we also identify three
types of highly active users posting content related to both
hashtags in hashtag-war. First, few users belong to online
trend services and post all current trends periodically in their
tweets. Second, we discover news channels and journalist’s
accounts that actively post content related to trending hashtags.
Finally, users also post memes using both trending hashtags
for maximum visibility and reach.
B. Anti-state Propaganda
Recently, manipulators squads are identified creating fake
trends to malign the reputation of different countries [71]. The
identification of such users is a prime real-world application
of our research. In this regard, we notice that trend of a
hashtag #SanctionPakistan is created to damage the repute
of Pakistan worldwide. This trend also caught the attention
of the Pakistani government and a detailed analysis report
is issued by the government highlighting the key factors in
creating this trend [72]. This provides us with an excellent
opportunity to compare the results of our user classifier module
with statistics shared in the government report. According to
the report, coordinated efforts is made to spread anti-Pakistan
propaganda using this hashtag [73].
This hashtag trended in Pakistan on Aug 9, 2021. We fetch
all tweets for #SanctionPakistan using Twint [59] on Aug 12,
2021 posted before the date of data collection. We collected
a total of 113,855 original tweets posted by 23,012 unique
users related to this hashtag. Using our manipulator classifier,
we identify 800 (3.5%) users involved in manipulation. These
users posted 17,425 (15.3%) tweets. Figure 9 shows the
distribution of manipulators and organic users along with
tweets posted by these users with a time bin of one hour. We
also plot the average number of tweets per user (T weetsuser)
for each hour. Moreover, the trending period of the hashtag
is highlighted as a shaded area. We observe that the volume
of tweets by organic users increases enormously after the
trending time of hashtags. Moreover, we notice that Tw eetsuser
for manipulators lie in the range of 7to 75. While, for
the organic users T weetsuser vary from 1to 5. Surprisingly,
we observe 35-468 tweets per hour before the hashtag is
seen in the trending panel. This observation is imperceptible
because manipulators post a large number of tweets before
the trending time. For in-depth analysis, we manually ana-
lyze and compare the tweets of manipulators identified in
report [73]. In the report, only the top 9 contributors for
#SanctionPakistan hashtag are presented on the basis of total
tweets and retweet count. We observe that 2 users have been
removed from Twitter. Therefore, our collected data of 113K
tweets do not contain the tweets of removed users. This
result provides two insightful findings. First, Manipify needs
complete data for precise performance which can be obtained
by collecting tweets in real-time. Passive collection of data
results in loss of valuable data. Second, deletion of tweets is an
important feature that can be included in Manipify to enhance
its performance. However, in presence of complete data, we
believe that Manipify can automatically detect the large potion
of manipulators using user behaviour and content features.
Figure 9: Time series of #SanctionPakistan (Trending period is highlighted).
In contrast, manual analysis relying only on the number of
tweets/retweets is a time-consuming and inefficient approach
to detect all users manipulating the trend. Moreover, Manipify
can be adapted to detect manipulators posting content in any
natural language.
VIII. CONCLUSION
The dynamic behaviour of manipulators on the Twitter
platform makes the automatic detection of such users a chal-
lenging task. This challenge is further exacerbated due to the
involvement of both human and bot accounts in manipulation.
In this paper, we identify and study the characteristics of
users manipulating the trending panel of Pakistani Twitter. For
this purpose, we propose a novel framework of ‘Manipify’ to
detect and analyze the manipulators with 0.91 accuracy. We
further identify bot accounts and notice that human accounts
are more involved in manipulation. Manipify also classifies the
hashtags into six categories to analyze the behaviour of users
across different categories. Testing of the framework on our
Pk-Trends dataset highlights that political and entertainment
category hashtags are the most manipulated trends. Also, we
find 4.4% of user accounts as manipulators generating 25.6%
of tweets. This shows that a larger percentage of content is
posted by a small number of manipulators. Finally, we evident
the significance of the research by presenting two real-world
case studies of hashtag-wars and anti-state propaganda.
In future, we plan to extend the framework to identify
users who spread hate speech and propaganda in a coordinated
manner. Besides, considering the multi-faceted 5th generation
war on social media, we will work on location identification of
manipulators working for rival countries to create polarization
in the society.
IX. ACK NOWLEDGEMENT
This research work was funded by Higher Education Com-
mission (HEC) Pakistan and Ministry of Planning Develop-
ment and Reforms under National Center in Big Data and
Cloud Computing.
REFERENCES
[1] “Twitter trends faqs,” https://help.twitter.com/en/using-twitter/
\twitter-trending-faqs, 2021, Accessed: July 08, 2021.
[2] B. Arias, “How to cover breaking news on twitter,”
https://media.twitter.com/en/articles/best-practice/2018/
how-to-cover-breaking-news-on- twitter.html, Accessed: Jan 28,
2021.
[3] A. Karami, L. S. Bennett, and X. He, “Mining public opinion about
economic issues: Twitter and the us presidential election,Int. J.
Strategic Decision Sciences (IJSDS), vol. 9, no. 1, pp. 18–28, 2018.
[4] V. Taecharungroj, “Starbucks’ marketing communications strategy on
twitter,J. Marketing Communications, vol. 23, no. 6, pp. 552–571,
2017.
[5] D. McCorkle and J. Payan, “Using twitter in the marketing and advertis-
ing classroom to develop skills for social media marketing and personal
branding,” J. Advertising Education, vol. 21, no. 1, pp. 33–43, 2017.
[6] T. Rosenstiel, J. Sonderman, K. Loker, M. Ivancin, and N. Kjarval,
“Twitter and the news: How people use the social network to learn
about the world,” Online at www. americanpressinstitute. org, 2015.
[7] O. Gencoglu and M. Gruber, “Causal modeling of twitter activity during
covid-19,” Computation, vol. 8, no. 4, p. 85, 2020.
[8] D. Assenmacher, L. Clever, J. S. Pohl, H. Trautmann, and C. Grimme,
“A two-phase framework for detecting manipulation campaigns in social
media,” in International Conference on Human-Computer Interaction.
Springer, 2020, pp. 201–214.
[9] N. Abu-El-Rub and A. Mueen, “Botcamp: Bot-driven interactions in
social campaigns,” in The world wide web conference, 2019, pp. 2529–
2535.
[10] E. Gallagher, “Manipulating trends & gam-
ing twitter,” https://erin-gallagher.medium.com/
manipulating-trends- gaming-twitter-6fd31714c06c, 2016, Accessed:
Mar 25, 2021.
[11] P. N. Howard, B. Kollanyi, and S. Woolley, “Bots and automation
over twitter during the us election,Computational Propaganda Project:
Working Paper Series, pp. 1–5, 2016.
[12] J. Uyheng and K. M. Carley, “Bots and online hate during the covid-19
pandemic: case studies in the united states and the philippines,” Journal
of computational social science, vol. 3, no. 2, pp. 445–468, 2020.
[13] D. Ingram, “Critics want twitter to halt its trend-
ing lists. instead, twitter will make tweaks.” https:
//www.nbcnews.com/tech/tech-news/critics-want-twitter-\
halt-its- trending-lists- instead-twitter-will-n1238996, 2020, Accessed:
Mar 25, 2021.
[14] Y. Zhang, X. Ruan, H. Wang, H. Wang, and S. He, “Twitter trends
manipulation: a first look inside the security of twitter trending,” IEEE
Trans. Inf. Forensics Secur., vol. 12, no. 1, pp. 144–156, 2016.
[15] B. Nimmo, “Measuring traffic manipulation on twitter,” Working Paper
2019.1. Oxford: Project on Computational Propaganda., Tech. Rep.,
2019.
[16] T. Elmas, R. Overdorf, A. F. Ozkalay, and K. Aberer, “Ephemeral
astroturfing attacks: The case of fake twitter trends,” in IEEE European
Symposium on Security and Privacy (EuroS&P), 2021.
[17] D. Assenmacher, L. Clever, J. S. Pohl, H. Trautmann, and C. Grimme,
“A two-phase framework for detecting manipulation campaigns in social
media,” in Social Computing and Social Media. Design, Ethics, User
Behavior, and Social Network Analysis, G. Meiselwitz, Ed. Cham:
Springer Int. Publishing, 2020, pp. 201–214.
[18] A. Reyes-Menendez, J. R. Saura, and C. Alvarez-Alonso, “Under-
standing #worldenvironmentday user opinions in twitter: A topic-based
sentiment analysis approach,” Int. J. Environ. Res. and public health,
vol. 15, no. 11, p. 2537, 2018.
[19] A. S. M. Alharbi and E. de Doncker, “Twitter sentiment analysis with
a deep neural network: An enhanced approach using user behavioral
information,” Cognitive Syst. Res., vol. 54, pp. 50–61, 2019.
[20] L. Tavoschi, F. Quattrone, E. D’Andrea, P. Ducange, M. Vabanesi,
F. Marcelloni, and P. L. Lopalco, “Twitter as a sentinel tool to
monitor public opinion on vaccination: an opinion mining analysis
from september 2016 to august 2017 in italy,Human vaccines &
immunotherapeutics, vol. 16, no. 5, pp. 1062–1069, 2020.
[21] M. O. Lwin, J. Lu, A. Sheldenkar, P. J. Schulz, W. Shin, R. Gupta,
and Y. Yang, “Global sentiments surrounding the covid-19 pandemic on
twitter: analysis of twitter trends,” JMIR public health and surveillance,
vol. 6, no. 2, p. e19447, 2020.
[22] H. H¨
onings, D. Knapp, B. C. Nguyen, D. Richter, K. Williams, I. Dorsch,
and K. J. Fietkiewicz, “Health information diffusion on twitter: The
content and design of who tweets matter,Health Information &
Libraries J., 2021.
[23] W. Chung, E. Mustaine, and D. Zeng, “Criminal intelligence surveillance
and monitoring on social media: Cases of cyber-trafficking,” in 2017
IEEE International Conference on Intelligence and Security Informatics
(ISI), 2017, pp. 191–193.
[24] Z. Abbass, Z. Ali, M. Ali, B. Akbar, and A. Saleem, “A framework to
predict social crime through twitter tweets by using machine learning,”
in 2020 IEEE 14th Int. Conf. on Semantic Computing (ICSC). IEEE,
2020, pp. 363–368.
[25] Z. Alom, B. Carminati, and E. Ferrari, “A deep learning model for twitter
spam detection,” Online Social Networks and Media, vol. 18, p. 100079,
2020.
[26] S. Madisetty and M. S. Desarkar, “A neural network-based ensemble
approach for spam detection in twitter,IEEE Transactions on Compu-
tational Social Systems, vol. 5, no. 4, pp. 973–984, 2018.
[27] H. Purohit, C. Castillo, and R. Pandey, “Ranking and grouping social
media requests for emergency services using serviceability model,
Social Network Analysis and Mining, vol. 10, no. 1, pp. 1–17, 2020.
[28] J. Kersten and F. Klan, “What happens where during disasters? a
workflow for the multifaceted characterization of crisis events based
on twitter data,” J. of Contingencies and Crisis Management, vol. 28,
no. 3, pp. 262–280, 2020.
[29] V. Lorini, C. Castillo, F. Dottori, M. Kalas, D. Nappo, and P. Salamon,
“Integrating social media into a pan-european flood awareness system:
A multilingual approach,” arXiv preprint arXiv:1904.10876, 2019.
[30] X. Zhang, Z. Li, S. Zhu, and W. Liang, “Detecting spam and promoting
campaigns in twitter,ACM Trans. on the Web (TWEB), vol. 10, no. 1,
pp. 1–28, 2016.
[31] F. Benevenuto, G. Magno, T. Rodrigues, and V. Almeida, “Detecting
spammers on twitter,” in Collab., Electron. Messaging, Anti-abuse and
Spam Conf. (CEAS), vol. 6, no. 2010, 2010, p. 12.
[32] B. Kollanyi, P. N. Howard, and S. C. Woolley, “Bots and automation
over twitter during the first us presidential debate,Comprop data memo,
vol. 1, pp. 1–4, 2016.
[33] A. Anwar and U. Yaqub, “Bot detection in twitter landscape using
unsupervised learning,” in The 21st Annual Int. Conf. Dig. Gov. Res.,
2020, pp. 329–330.
[34] H. Reelfs, T. Mohaupt, O. Hohlfeld, and N. Henckell, “Hashtag usage
in a geographically-local microblogging app,” in Companion Proc. of
The 2019 WWW Conf., 2019, pp. 919–927.
[35] O. Z. G¨
okc¸e, E. Hatipo ˘
glu, G. G¨
okt¨
urk, B. Luetgert, and Y. Saygin,
“Twitter and politics: Identifying turkish opinion leaders in new social
media,” Turkish Studies, vol. 15, no. 4, pp. 671–688, 2014.
[36] A. Zubiaga, D. Spina, V. Fresno, and R. Mart´
ınez, “Classifying trending
topics: a typology of conversation triggers on twitter,” in Proc. 20th ACM
Int. Conf. Inf. and knowl. Manage., 2011, pp. 2461–2464.
[37] H. U. Khan, S. Nasir, K. Nasim, D. Shabbir, and A. Mahmood, “Twitter
trends: a ranking algorithm analysis on real time data,” Expert Syst. with
Appl., vol. 164, p. 113990, 2021.
[38] P. G. Efthimion, S. Payne, and N. Proferes, “Supervised machine
learning bot detection techniques to identify social twitter bots,” SMU
Data Sci. Rev., vol. 1, no. 2, p. 5, 2018.
[39] J. Rodr´
ıguez-Ruiz, J. I. Mata-S´
anchez, R. Monroy, O. Loyola-Gonzalez,
and A. L´
opez-Cuevas, “A one-class classification approach for bot
detection on twitter,Comput. & Security, vol. 91, p. 101715, 2020.
[40] P.-M. Hui, K.-C. Yang, C. Torres-Lugo, Z. Monroe, M. McCarty, B. D.
Serrette, V. Pentchev, and F. Menczer, “Botslayer: real-time detection of
bot amplification on twitter,J. Open Source Software, vol. 4, no. 42,
p. 1706, 2019.
[41] A. Minnich, N. Chavoshi, D. Koutra, and A. Mueen, “Botwalk: Efficient
adaptive exploration of twitter bot networks,” in Proc. 2017 IEEE/ACM
Int. Conf. Adv. Social Netw. Anal. and Mining 2017, 2017, pp. 467–474.
[42] M. Sayyadiharikandeh, O. Varol, K.-C. Yang, A. Flammini, and
F. Menczer, “Detection of novel social bots by ensembles of specialized
classifiers,” in Proc. 29th ACM Int. Conf. Inf. & Knowl. Manage., 2020,
pp. 2725–2732.
[43] R. Motamedi, S. Jamshidi, R. Rejaie, and W. Willinger, “Examining
the evolution of the twitter elite network,Social Network Analysis and
Mining, vol. 10, no. 1, pp. 1–18, 2020.
[44] Z. Wood-Doughty, M. Smith, D. Broniatowski, and M. Dredze, “How
does twitter user behavior vary across demographic groups?” in Proc.
of 2nd Workshop on NLP and Computational Social Science, 2017, pp.
83–89.
[45] U. Yaqub, S. A. Chun, V. Atluri, and J. Vaidya, “Analysis of political
discourse on twitter in the context of the 2016 us presidential elections,”
Government Information Quarterly, vol. 34, no. 4, pp. 613–626, 2017.
[46] V. Gupta and R. Hewett, “Real-time tweet analytics using hybrid
hashtags on twitter big data streams,” Information, vol. 11, no. 7, p.
341, 2020.
[47] D. M. Romero, B. Meeder, and J. Kleinberg, “Differences in the me-
chanics of information diffusion across topics: idioms, political hashtags,
and complex contagion on twitter,” in Proc. 20th Int. Conf. World wide
web, 2011, pp. 695–704.
[48] M. Jeon, S. Jun, and E. Hwang, “Hashtag recommendation based on
user tweet and hashtag classification on twitter,” in Int. Conf. Web-Age
Inf. Manage. Springer, 2014, pp. 325–336.
[49] L. Posch, C. Wagner, P. Singer, and M. Strohmaier, “Meaning as
collective use: predicting semantic hashtag categories on twitter,” in
Proc. 22nd Int. Conf. World Wide Web, 2013, pp. 621–628.
[50] P. Ferragina, F. Piccinno, and R. Santoro, “On analyzing hashtags in
twitter,” in 9th Int. AAAI Conf. Web and Social Media. Citeseer, 2015.
[51] S. Needle, “How does twitter decide what is trending?” https:
//rethinkmedia.org/blog/how-does-twitter- decide-what- \trending, 2016,
Accessed: Mar 18, 2021.
[52] Y. Hua, M. Naaman, and T. Ristenpart, “Characterizing twitter users
who engage in adversarial interactions against political candidates,” in
Proc. 2020 CHI Conf. Human Factors in Comput. Syst., 2020, pp. 1–13.
[53] Y. Hua, T. Ristenpart, and M. Naaman, “Towards measuring adversarial
twitter interactions against candidates in the us midterm elections,” in
Proc. Int. AAAI Conf. Web and Social Media, vol. 14, 2020, pp. 272–
282.
[54] K.-C. Yang, O. Varol, C. A. Davis, E. Ferrara, A. Flammini, and
F. Menczer, “Arming the public with artificial intelligence to counter
social bots,” Human Behavior and Emerging Techno., vol. 1, no. 1, pp.
48–61, 2019.
[55] “Crowdflower,” http://faircrowd.work/platform/crowdflower, Accessed:
Mar 09, 2021.
[56] M. Mazza, S. Cresci, M. Avvenuti, W. Quattrociocchi, and M. Tesconi,
“Rtbust: Exploiting temporal patterns for botnet detection on twitter,” in
Proc. 10th ACM Conf. Web Sci., 2019, pp. 183–192.
[57] K. Makice, Twitter API: Up and Running Learn How to Build Applica-
tions with the Twitter API, 1st ed. O’Reilly Media, Inc., 2009.
[58] “Pakistan twitter trending hashtags and topics,” https://getdaytrends.com/
pakistan/, Accessed: Mar 09, 2021.
[59] C. Zacharias and F. Poldi, “Twint · pypi,” https://pypi.org/project/twint/,
2018, Accessed: Mar 09, 2021.
[60] A. Gotter, “Best time to post on twitter in 2021?” https://adespresso.
com/blog/best-time- to-post- on-twitter/, 2021, Accessed: Jan 28, 2021.
[61] “Twitter’s platform manipulation and spam policy — twitter help,
https://help.twitter.com/en/rules-and-policies/platform- manipulation,
Accessed: Aug 31, 2021.
[62] F. P´
erez-Hern´
andez, S. Tabik, A. Lamas, R. Olmos, H. Fujita, and
F. Herrera, “Object detection binary classifiers methodology based on
deep learning to identify small objects handled similarly: Application
in video surveillance,” Knowledge-Based Systems, vol. 194, p. 105590,
2020.
[63] T. Takenouchi and S. Ishii, “Binary classifiers ensemble based on
bregman divergence for multi-class classification,Neurocomputing, vol.
273, pp. 424–434, 2018.
[64] K. Lee, D. Palsetia, R. Narayanan, M. M. A. Patwary, A. Agrawal, and
A. Choudhary, “Twitter trending topic classification,” in 2011 IEEE 11th
Int. Conf. Data Mining Workshops. IEEE, 2011, pp. 251–258.
[65] P. Sheldon, E. Herzfeldt, and P. A. Rauschnabel, “Culture and social
media: the relationship between cultural values and hashtagging styles,
Behaviour & Inf. Technol., vol. 39, no. 7, pp. 758–770, 2020.
[66] M. Binder, “How to calculate the twitter impressions and reach,
https://www.tweetbinder.com/blog/twitter-impressions, 2020, Accessed:
Sep 21, 2020.
[67] S. Cresci, R. Di Pietro, M. Petrocchi, A. Spognardi, and M. Tesconi,
“The paradigm-shift of social spambots: Evidence, theories, and tools
for the arms race,” in Proc. 26th Int. Conf. world wide web companion,
2017, pp. 963–972.
[68] H. K. Evans, S. Smith, A. Gonzales, and K. Strouse, “Mudslinging on
twitter during the 2014 election,” Social Media+ Society, vol. 3, no. 2,
p. 2056305117704408, 2017.
[69] F. Abdulrahman and A. Subedar, “How much to fake a trend on twitter? -
bbc news,” https://www.bbc.com/news/blogs-trending-43218939, 2018,
Accessed: Mar 25, 2021.
[70] Z. Gilani, L. Wang, J. Crowcroft, M. Almeida, and R. Farahbakhsh,
“Stweeler: A framework for twitter bot analysis,” in Proc. 25th Int.
Conf. Companion World Wide Web, 2016, pp. 37–38.
[71] Y. Golovchenko, C. Buntain, G. Eady, M. A. Brown, and J. A. Tucker,
“Cross-platform state propaganda: Russian trolls on twitter and youtube
during the 2016 u.s. presidential election,” The Int. J. of Press/Politics,
vol. 25, no. 3, pp. 357–389, 2020.
[72] M. Sarmad, “Analysis shows “#sanctionpakistan” trend on twitter
was fake — pakistan defence,” https://defence.pk/pdf/threads/
analysis-shows-sanctionpakistan-trend-on-twitter-was- fake.719732/,
2021, Accessed: Aug 20, 2021.
[73] “Anti-state trends - ptm, political parties, indian and fake news nexus,
https://dmw.gov.pk/deep-analytics, Aug 2021, Accessed: Aug 20, 2021).
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Background Micro‐blogging services empower health institutions to quickly disseminate health information to many users. By analysing user data, infodemiology (i.e. improving public health using user contributed health related content) can be measured in terms of information diffusion. Objectives Tweets by the WHO were examined in order to identify tweet attributes that lead to a high information diffusion rate using Twitter data collected between November 2019 and January 2020. Methods One thousand hundred and seventy‐seven tweets were collected using Python's Tweepy library. Afterwards, k‐means clustering and manual coding were used to classify tweets by theme, sentiment, length and count of emojis, pictures, videos and links. Resulting groups with different characteristics were analysed for significant differences using Mann–Whitney U‐ and Kruskal–Wallis H‐tests. Results The topic of the tweet, the included links, emojis and (one) picture as well as the tweet length significantly affected the tweets’ diffusion, whereas sentiment and videos did not show any significant influence on the diffusion of tweets. Discussion The findings of this study give insights on why specific health topics might generate less attention and do not showcase sufficient information diffusion. Conclusion The subject and appearance of a tweet influence its diffusion, making the design equally essential to the preparation of its content.
Article
Full-text available
Understanding the characteristics of public attention and sentiment is an essential prerequisite for appropriate crisis management during adverse health events. This is even more crucial during a pandemic such as COVID-19, as primary responsibility of risk management is not centralized to a single institution, but distributed across society. While numerous studies utilize Twitter data in descriptive or predictive context during COVID-19 pandemic, causal modeling of public attention has not been investigated. In this study, we propose a causal inference approach to discover and quantify causal relationships between pandemic characteristics (e.g., number of infections and deaths) and Twitter activity as well as public sentiment. Our results show that the proposed method can successfully capture the epidemiological domain knowledge and identify variables that affect public attention and sentiment. We believe our work contributes to the field of infodemiology by distinguishing events that correlate with public attention from events that cause public attention.
Article
Full-text available
Twitter data are a valuable source of information for rescue and helping activities in case of natural disasters and technical accidents. Several methods for disaster- and event-related tweet filtering and classification are available to analyse social media streams. Rather than processing single tweets, taking into account space and time is likely to reveal even more insights regarding local event dynamics and impacts on population and environment. This study focuses on the design and evaluation of a generic workflow for Twitter data analysis that leverages that additional information to characterize crisis events more comprehensively. The workflow covers data acquisition, analysis and visualization, and aims at the provision of a multifaceted and detailed picture of events that happen in affected areas. This is approached by utilizing agile and flexible analysis methods providing different and complementary views on the data. Utilizing state‐of‐the‐art deep learning and clustering methods, we are interested in the question, whether our workflow is suitable to reconstruct and picture the course of events during major natural disasters from Twitter data. Experimental results obtained with a data set acquired during hurricane Florence in September 2018 demonstrate the effectiveness of the applied methods but also indicate further interesting research questions and directions.
Article
Full-text available
The most-followed Twitter users and their pairwise relationships form a subgraph of Twitter users that we call the Twitter elite network. The connectivity patterns and information exchanges (in terms of replies and retweets) among these elite users illustrate how the “important” users connect and interact with one another on Twitter. At the same time, such an elite-focused view also provides valuable information about the structure of the Twitter network as a whole. This paper presents a detailed characterization of the structure and evolution of the top 10K Twitter elite network. We describe our technique for efficiently and accurately constructing the Twitter elite network along with social attributes of individual elite accounts and apply it to capture two snapshots of the top 10K elite network that are some 2.75 years apart. We show that a sufficiently large elite network is typically composed of 14–20 stable and cohesive communities that are recognizable in both snapshots, thus representing “socially meaningful” components of the elite network. We examine the changes in the identity and connectivity of individual elite users over time and characterize the community-level structure of the elite network in terms of bias in directed pairwise connectivity and relative reachability. We also show that both the reply and retweet activity between elite users are effectively contained within individual elite communities and are generally aligned with the centrality of the elite community users in both snapshots of the elite network. Finally, we observe that the majority of the regular Twitter users tend to have elite friends that belong to a single elite community. This finding offers a promising criterion for grouping regular users into “shadow partitions” based on their association with elite communities.
Article
Full-text available
Online hate speech represents a serious problem exacerbated by the ongoing COVID-19 pandemic. Although often anchored in real-world social divisions, hate speech in cyberspace may also be fueled inorganically by inauthentic actors like social bots. This work presents and employs a methodological pipeline for assessing the links between hate speech and bot-driven activity through the lens of social cybersecurity. Using a combination of machine learning and network science tools, we empirically characterize Twitter conversations about the pandemic in the United States and the Philippines. Our integrated analysis reveals idiosyncratic relationships between bots and hate speech across datasets, highlighting different network dynamics of racially charged toxicity in the US and political conflicts in the Philippines. Most crucially, we discover that bot activity is linked to higher hate in both countries , especially in communities which are denser and more isolated from others. We discuss several insights for probing issues of online hate speech and coordinated disinformation, especially through a global approach to computational social science.
Article
Hashtags, originally introduced in Twitter, are now becoming the most used way to tag short messages in social networks since this facilitates subsequent search, classification and clustering over those messages. However, extracting information from hashtags is difficult because their composition is not constrained by any (linguistic) rule and they usually appear in short and poorly written messages which are difficult to analyze with classic IR techniques. In this paper we address two challenging problems regarding the meaning of hashtags — namely, hashtag relatedness and hashtag classification - and we provide two main contributions. First we build a novel graph upon hashtags and (Wikipedia) entities drawn from the tweets by means of topic annotators (such as TagME); this graph will allow us to model in an efficacious way not only classic co-occurrences but also semantic relatedness among hashtags and entities, or between entities themselves. Based on this graph, we design algorithms that significantly improve state-of-the-art results upon known publicly available datasets. The second contribution is the construction and the public release to the research community of two new datasets: the former is a new dataset for hashtag relatedness, the latter is a dataset for hashtag classification that is up to two orders of magnitude larger than the existing ones. These datasets will be used to show the robustness and efficacy of our approaches, showing improvements in F1 up to two-digits in percentage (absolute).
Article
Social media has recently become popular due to its vast applications. The common people all over the world uses its diverse channels to express personal views, experiences and opinions regarding diverse topics. Social media has revolutionized the way people interact and communicate with each other and overall, it has changed the methods and approaches in about all the aspects of life such as social issue, business, education, health, etc. Thus, sales and marketing departments of multinational industries are focusing on social media trends to analyze current trends and predict future trends by analyzing user generated content on Facebook, Flickr, Twitter, etc. However, the prediction process becomes challenging as the multiplicity of factors affect the popular elements in the social media content. This research paper aims to work on Twitter trend analysis and proposes a trend detection process over streams of tweets. The proposed approach detects the trending topics of the real-time Twitter trends along with ranking the top terms and hashtags. The paper further discusses the motivation for trend prediction over the social media; In addition to exploratory data analysis, the research paper explores the Term Frequency-Inverse Document Frequency (Tf-IDF), Combined Component Approach (CCA) and Biterm Topic Model (BTM) approaches for finding the topics and terms within given topics. In modern competitive world, this research provides investors, advertisers, industries and all the stakeholders. a detailed and comprehensive data analysis which may help them to focus their investment, area of work, marketing, and product.
Chapter
The identification of coordinated campaigns within Social Media is a complex task that is often hindered by missing labels and large amounts of data that have to be processed. We propose a new two-phase framework that uses unsupervised stream clustering for detecting suspicious trends over time in a first step. Afterwards, traditional offline analyses are applied to distinguish between normal trend evolution and malicious manipulation attempts. We demonstrate the applicability of our framework in the context of the final days of the Brexit in 2019/2020.