Conference PaperPDF Available

A Model for Identifying Misinformation in Online Social Networks

Authors:

Abstract and Figures

Online Social Networks (OSNs) have become increasingly popular means of information sharing among users. The spread of news regarding emergency events is common in OSNs and so is the spread of misinformation related to the event. We define as misinformation any false or inaccurate information that is spread either intentionally or unintentionally. In this paper we study the problem of misinformation identification in OSNs, and we focus in particular on the Twitter social network. Based on user and tweets characteristics, we build a misinformation detection model that identifies suspicious behavioral patterns and exploits supervised learning techniques to detect misinformation. Our extensive experimental results on 80294 unique tweets and 59660 users illustrate that our approach effectively identifies misinformation during emergencies. Furthermore, our model manages to timely identify misinformation, a feature that can be used to limit the spread of the misinformation.
Content may be subject to copyright.
A Model for Identifying Misinformation
in Online Social Networks
Sotirios Antoniadis1, Iouliana Litou2(B
), and Vana Kalogeraki2
1Nokia Solutions and Networks Hellas A.E., Athens, Greece
sotiris.antoniadis@nsn.com
2Deptartment of Informatics,
Athens University of Economics and Business, Athens, Greece
{litou,vana}@aueb.gr
Abstract. Online Social Networks (OSNs) have become increasingly
popular means of information sharing among users. The spread of news
regarding emergency events is common in OSNs and so is the spread
of misinformation related to the event. We define as misinformation
any false or inaccurate information that is spread either intentionally
or unintentionally. In this paper we study the problem of misinformation
identification in OSNs, and we focus in particular on the Twitter social
network. Based on user and tweets characteristics, we build a misin-
formation detection model that identifies suspicious behavioral patterns
and exploits supervised learning techniques to detect misinformation.
Our extensive experimental results on 80294 unique tweets and 59660
users illustrate that our approach effectively identifies misinformation
during emergencies. Furthermore, our model manages to timely identify
misinformation, a feature that can be used to limit the spread of the
misinformation.
1 Introduction
Online Social Networks (OSNs) have evolved into major means of communica-
tion and information spreading. They enumerate over 1.61 billion users, which
corresponds to 22% of the world’s population. However, one major challenge is
that the information communicated through the network is not always credible.
Previous studies confirm the existence of spam campaigns in OSNs [1][2]. Spam
campaigns are organized attempts towards spreading false or malicious content
through the coordination of accounts or other illicit means in the network. It is
estimated that, among the messages published on Twitter, 1% of the messages
is spam while 5% of the accounts are spammers1.
Twitter2has evolved as one of the most popular microblogging services. Users
publish short messages (tweets) of at most 140 characters and follow any other
S. Antoniadis—Part of this work was performed when this author was at Athens
University of Economics and Business.
1http://digital.cs.usu.edu/ kyumin/tutorial/www-tutorial.pdf
2https://twitter.com/
c
Springer International Publishing Switzerland 2015
C. Debruyne et al. (Eds.): OTM 2015 Conferences, LNCS 9415, pp. 473–482, 2015.
DOI: 10.1007/978-3-319-26148-5 32
474 S. Antoniadis et al.
registered users to receive status updates. Twitter offers to users the opportu-
nity to report a tweet as spam,compromised or abusive. Other filters to detect
spam (e.g the number of followers in regard to followees, random favorites and
retweets etc.) are also used. Still, tweets containing misinformation regarding
an emergency event may not be identified based solely on the aforementioned
mechanisms.
The credibility of images propagated in the network has been the focus of
recent work. Zubiaga and Ji [3] and Gupta et al. [4] focus on the credibility
of images propagated in the network during emergency events but not in the
content of the information. The works closest to ours is that of Castillo et al.
[5]andXiaetal.[6]. Both works use supervised learning and Bayesian Net-
work classification to identify credible information propagated in the network.
Castillo et al. [5] cluster the instances to newsworthy or chats and later perform
credibility analysis on the newsworthy clusters. Xia et al. [6]proposeamodelto
detect an emergency event and identify credible tweets. As we illustrate in our
experimental evaluation, our approach performs better than both approaches in
identifying misinformative tweets with over 14% higher accuracy.
In this work we suggest a methodology for identifying and limiting misin-
formation spread in OSNs during emergency events by identifying tweets that
are most likely to be inaccurate or irrelevant to an event. Our work makes the
following contributions:
– We present a novel filtering process for identifying misinformation during
emergencies that is fast and effective. The filters are identified based on an
extensive analysis conducted on a large dataset of users and tweets related to
the emergency event of Hurricane Sandy. As our experimental results illus-
trate, the filtering process extracts over 81% of the misinformative tweets,
while identifying over 23% of the tweets that contain misinformation.
We employ a number of supervised learning algorithms that very effectively
classify credible or misinformative tweets. Based on the features we propose,
classification techniques achieve weighted average accuracy of 77%.
– Our experiments suggest that without considering propagation of tweets,
our classification methodology achieves 77.8% weighted average accuracy,
offering the ability to timely limit the spread of false news before cascading.
Furthermore, the filtering process and the classification algorithms perform
in less than 2 seconds, making our methodology appropriate for real-time
applications.
2 Problem Description, Parameters and Methodology
Several studies reveal that news spread faster in the network of Twitter compared
to traditional news media [7,8]. Yet, a fundamental challenge is the quality of
content published in OSNs. Distinguishing between credible and inaccurate infor-
mation regarding emergency events is important, since misguidance or inability
to timely detect useful information may have critical effects.
A Model for Identifying Misinformation in Online Social Networks 475
Fig. 1. Example of misinformation for Hurricane Sandy.
Objective: The objective of our work is to detect misinformation related to
emergency events in the Twitter social network. We define as misinformation
any false or inaccurate information that is spread either intentionally or uninten-
tionally. An example of misinformation concerning the event of Sandy hurricane
is presented in Figure 1.
Our Approach: Our approach for solving the problem of misinformation iden-
tification follows 3 discrete steps: (i) Given a set of tweets Trelated to an event
and a set of users Uthat published at least one tweet tT, we conduct an
extensive analysis on characteristics of tweets and users who published them. Our
analysis focuses on a number of features and combinations of them and assists in
identifying abnormal behaviors of users and characteristics of tweets. (ii) Based
on the findings of the analysis, we extract extreme or suspicious behaviors and
exploit them to filter tweets that are more likely to constitute misinformation.
(iii) We then apply a series of learning algorithms implemented on Weka [9]to
identify misinformative tweets, using supervised learning techniques.
2.1 Parameters
Tweet Features: Each tweet tTis represented as a feature vector Itthat
includes information about the tweet and the user who published it. Thus,
each tweet tTis characterized by the following information: (i) Number
of characters - words: Short messages may not contain useful information, while
long messages may cover unrelated topics. (ii) Number of favorites - retweets
- replies: The popularity of a tweet may be an indication of its content. We
expect that tweets of interest will be cascaded in the network and thus be more
476 S. Antoniadis et al.
retweeted or favorited. (iv) Number of mentions - hashtags - URLs - media:
Features related to the structure of the tweet are considered to draw conclusions
about the quality of the tweet.
User Features: For each user uUthat published at least one tweet tTwe
consider the following characteristics: (i) Number of followers - followees: Tru s t -
worthy users such as news agencies are expected to have many followers [7], while
spammers may have more followees. We define as followees the number of users an
account follows. (ii) Followers-Followees Ratio (FF-Ratio): We compute the FF-
Ratio of a user uas FF Ratio =followers(u)/(followers(u)+followees(u)).
(iii) Total tweets - Tweets during the event: We suspect illegitimate users may
be more active for a short time (e.g. the time of the event), thus we also consider
the number of tweets users publish. (iv) Days Registered: The days the user is
registered in the network before publishing a tweet. Recent users have greater
chances of being spammers in contrast to older ones.
Additional Features: Finally, for each tweet we extract the following set of fea-
tures: (i) URLs to Tweets (UtT) - Media to Tweets (MtT): For tweets published
by a user we compute the ratio of tweets containing URLs and Media separately.
We suspect that users frequently publishing URLs are candidate spammers, while
media may be irrelevant to the event. (ii) Followers to Replies (FtR) - Retweets
(FtRt) - Favorited (FtFav): Less popular tweets may indicate disapproval from
followers. Therefore we consider the ratio of followers to the features indicating
the popularity. (iii) Average Tweets per Day (ATpD): The average number
of tweets published by user uthat is registered da(u) days in the network is
computed as AT pD =t(u)/d a(u), where t(u) is the total number tweets u
published. (iv) Positive / Negative / Average Sentiment: We use SentiStrength
[10] to extract the positive and negative sentiment rate of the tweet text and
compute the average sentiment.
3 Data Analysis
In order to evaluate the performance of our approach for detecting misinforma-
tion during emergency events we used a dataset of tweets related to the Sandy
Hurricane, a major emergency event that unfolded in 2012, from October 22 to
November 2, and severely affected the area of New York City 3. Tweets related
to the event were collected based on the keywords “sandy” and “hurricane”, as
described in [3]. We then use the findings of the analysis to decide which values
constitute a possible indication of misinformation.
Analysis of User Characteristics: In Figures 2and 3we present the number
of tweets published by users, both total and during the event. We split the
number of tweets in buckets of 100, i.e., bucket 0 contains the number of users
that published 1 to 99 tweets in total. The Power Low distribution shown in the
Figures is in accordance to findings of Bagrow et al. in [7]. Most of the users
3http://en.wikipedia.org/wiki/Hurricane Sandy
A Model for Identifying Misinformation in Online Social Networks 477
published tweets related to the event with a frequency of 60 to 1000 seconds,
while there are users that published more than one tweet per minute. The number
of users’ followers and followees are presented in Figures 4and 5respectively.
The trend is similar for both connection types, with the majority of users having
few followers and followees. The peak in the number of users that have up to
2000 followees is due to Twitter policy that limits the users to follow up to 2000
users and is later differentiated based on the followers to followees fraction. In
Figure 6we also present the FF-Ratio. The FF-Ratio approaches a Gaussian
distribution with most users having an average ratio of around 0.5, meaning
that they have equal amount of followers and followees, although we can observe
another peak from 0.9 to 1.
Analysis of Tweet Characteristics: In Figure 7we present an analysis of the
number of words contained in a tweet. The majority of the tweets include 20 to 120
characters and 5 to 20 words. We further consider retweets, favorites and replies
of a tweet to determine its popularity. As observed in Figures 8and 9,thenum-
ber of retweets and favorites follows a power law distribution. Finally, most of the
tweets have fewer than 20 replies, but after a point the tweets containing more than
Fig. 2. User total tweets. Fig. 3. User Sandy Tweets. Fig. 4. User Followers.
Fig. 5. User Followees. Fig. 6. User FF-Ratio. Fig. 7. Words in tweets.
Fig. 8. Retweets received by
tweets.
Fig. 9. Favorites received by
tweets.
Fig. 10. Replies received by
tweets.
478 S. Antoniadis et al.
Fig. 11. Tweets with
mentions.
Fig. 12. Tweets with
URLs.
Fig. 13. Tweets with
hashtags.
Fig. 14. Tweets with
media.
20 replies rises (Figure 10). Figures 11 through 14 depict the number of mentions,
URLs, hastags and media presented in a tweet. Most of the tweets contain at most
one mentions and the majority of the tweets related to the event contain no link,
while there are tweets with over two links. 70% of the tweets contain no hashtags,
while the number of tweets containing a media is restricted to less than 10%.
4 Experimental Evaluation
We evaluated the performance of our approach on 80294 tweets related to hur-
ricane Sandy from 59660 users. In the first set of experiments we focus on
estimating the performance of the filtering process. By applying the filters of
Table 1, 12955 tweets are returned. We manually annotate a sample of 4000
randomly selected tweets among them. Humorous, irrelevant and deleted tweets
and accounts are considered as misinformation (assuming they are reported or
deleted due to violations [11]). For 176 of the tweets we could not draw conclu-
sions. Out of the remaining 3824 tweets, 898 constitute misinformation. Since
tweets are randomly selected, we conclude that over 23% of the filtered tweets
are indeed misinformation. We also annotated 4000 random tweets among those
that did not meet the filtering criteria. For the 3559 tweets that could be clas-
sified, 212 are identified as misinformation, i.e., less than 6%. Overall, 1110 out
the total 7383 labelled tweets constitute misinformation. The filtering process
captures 898, yielding recall values of over 81%.
Supervised Learning: We exploited a set of different supervised learning
algorithms implemented on Weka [9] to evaluate the performance of information
identification on the set of features we considered. We use 10-fold cross validation
for evaluating the classification results. The labelled dataset of the 3824 filtered
tweets is used as input to Weka. In Table 2we present the classification results.
Weighted Average F1 measure indicates that Bootstrap Aggregating has the
best performance. Regarding average precision, Random Forest achieves better
results, with 0.792 average precision.
Table 1. Filters applied during the filtering process.
Wor d s 30 Characters == 140 Favorite s ( 2&& 10) (1100)
Hashtags 4Mentions 4Retweets ((2&& 10) (1000))
Media 2Followees (10 ≥100000) Followers (10 ≥200000)
Replies 11 URLs 3Followers/Followees30000
Event tweets 7Total Tweets 500000 Interval (300sec ≥70000sec)
A Model for Identifying Misinformation in Online Social Networks 479
Table 2. Summary of Classification using Supervised Learning Algorithms.
Precision Recall F-Measure Precision Recall F-Measure
Bayes Network J48
Credible 0,845 0,834 0,839 0,825 0,889 0,856
Misinformation 0,480 0,500 0,490 0,516 0,385 0,44
Weig h t ed Avg. 0,759 0,755 0,757 0,752 0,771 0,758
k-Nearest Neighbors Random Forest
Credible 0,800 0,951 0,869 0,821 0,958 0,884
Misinformation 0,586 0,225 0,325 0,699 0,318 0,438
Weig h t ed Avg. 0,750 0,781 0,741 0,792 0,808 0,779
Adaptive Boosting Bootstrap Aggregating
Credible 0,839 0,888 0,863 0,828 0,931 0,877
Misinformation 0,549 0,447 0,493 0,622 0,372 0,466
Weig h t ed Avg. 0,771 0,784 0,776 0,780 0,799 0,780
Real-Time Misinformation Identification: The values of retweets, favorites
and replies are unknown at the time the tweet is published. Thus, to evaluate the
performance of our approach in timely detecting misinformation we conducted
another set of experiments ignoring the above attributes and features related to
them. The results for Bootstrap Aggregating and Random Forest are presented in
Table 3. The table shows that precision and recall drop slightly. Still, weighted
average precision is over 0.77, indicating the approach is appropriate for real
time misinformation identification. The filtering process requires just 963ms and
adding the execution times of the algorithms, less than 2 seconds are needed to
efficiently extract tweets containing misinformation, proving that the method is
efficient under real-time constraints.
Table 3. Classification with features known at run time.
Precision Recall F-Measure Precision Recall F-Measure
Random Forest Bootstrap Aggregating
Credible 0,812 0,949 0,875 0,816 0,951 0,879
Misinformation 0,631 0,286 0,394 0,655 0,301 0,412
Weig h t ed Avg. 0,770 0,793 0,762 0,778 0,799 0,769
Execution time: 0.41 sec 0.89 sec
5 Related Work
Castillo et al. [5] aim at automatically detecting credibility of information in
the network of Twitter. They use a number of features related to tweets and
supervised learning to distinguish between newsworthy or false news and later
perform credibility analysis. Gupta et al. [12] present TweetCred, an extension
of the previous work that enables users feedback. Xia et al. [6] also study the
480 S. Antoniadis et al.
problem of information credibility on Twitter after the event is detected and
relevant tweets are retrieved. Bosma et el. [13] suggest a framework for spam
detection using unsupervised learning. Anagnostopoulos et al. [14] study the
role of homophily in misinformation spread. McCord and Chuah [15] are using
traditional classifiers to detect spams on Twitter. Stringhini et al. in [11] aim at
identifying spammers in social networks. Identifying spammers on the network
of Twitter is also the objective of Benevenuto et al. in [16]. They extract features
of the account that may be indication of spamming behavior and use SVM learn-
ing model to verify their approach. Zubiaga and Ji in [3] and Gupta et al. [4]
focus on the credibility of images propagated in the network during emergency
events. They consider a number of features related to the image and the tweet.
Budak et al. [17] address the problem of misinformation spread limitation by
performing an extensive study on influence limitation. Faloutsos [18] developed
a botnet-detection method and a Facebook application, called MyPageKeeper,
that quantifies the presence of malware on Facebook and protects end-users.
Ghosh et al. [19] examine suspended accounts on Twitter and investigate link
farming and finally discourage spammers to acquire large number of following
links. Thomas et al. [1] identify the behaviors of spammers by analyzing tweets
of suspended users in retrospect. Mendoza et al. [20] focus on cascades of tweets
during emergency events and study the propagation of rumours. They conclude
that this defers from the propagation of news tweets and it is possible to detect
rumors by aggregating analysis on the tweets. Liu et al. [21]proposeahybrid
model that utilizes user behavior information, network attributes and text con-
tent to identify spams.
6 Conclusions
In this work we presented a methodology for identifying misinformation on social
networks during emergency events. As we illustrate in our experiments our app-
roach manages to correctly identify misinformation achieving accuracy of up to
77%. The filtering process suggested in this work identifies over 81% of misin-
formative tweets. Our approach is fast and effective and timely identifies misin-
formation, offering the ability to limit the spread in the network.
Acknowledgment. This research has been co-financed by the European Union (Euro-
pean Social Fund ESF) and Greek national funds through the Operational Program
“Education and Lifelong Learning” of the National Strategic Reference Framework
(NSRF) - Research Funding Program:Thalis-DISFER, Aristeia-MMD, Investing in
knowledge society through the European Social Fund, the FP7 INSIGHT project and
the ERC IDEAS NGHCS project.
A Model for Identifying Misinformation in Online Social Networks 481
References
1. Thomas, K., Grier, C., Song, D., Paxson, V.: Suspended accounts in retrospect: an
analysis of twitter spam. In: Internet Measurement Conference, pp. 243–258 (2011)
2. Gao, H., Hu, J., Wilson, C., Li, Z., Chen, Y., Zhao, B.Y.: Detecting and charac-
terizing social spam campaigns. In: ACM Conference on Computer and Commu-
nications Security, pp. 681–683 (2010)
3. Zubiaga, A., Ji, H.: Tweet, but verify: Epistemic study of information verification
on twitter (2013). CoRR, vol. abs/1312.5297
4. Gupta, A., Lamba, H., Kumaraguru, P., Joshi, A.: Faking sandy: characterizing
and identifying fake images on twitter during hurricane sandy. In: Ser. WWW 2013
Companion (2013)
5. Castillo, C., Mendoza, M., Poblete, B.: Predicting information credibility in time-
sensitive social media. Internet Research 23(5), 560–588 (2013)
6. Xia, X., Yang, X., Wu, C., Li, S., Bao, L.: Information credibility on twitter in
emergency situation. In: Chau, M., Wang, G.A., Yue, W.T., Chen, H. (eds.) PAISI
2012. LNCS, vol. 7299, pp. 45–59. Springer, Heidelberg (2012)
7. Bagrow, J.P., Wang, D., Barabasi, A.-L.: Collective response of human populations
to large-scale emergencies (2011). CoRR, vol. abs/1106.0560
8. Guy, M., Earle, P., Ostrum, C., Gruchalla, K., Horvath, S.: Integration and dis-
semination of citizen reported and seismically derived earthquake information via
social network technologies. In: Cohen, P.R., Adams, N.M., Berthold, M.R. (eds.)
IDA 2010. LNCS, vol. 6065, pp. 42–53. Springer, Heidelberg (2010)
9. Weka. http://www.cs.waikato.ac.nz/ml/weka/
10. Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., Kappas, A.: Sentiment in short
strength detection informal text. J. Am. Soc. Inf. Sci. Technol. 61(12), 2544–2558
(2010)
11. Stringhini, G., Kruegel, C., Vigna, G.: Detecting spammers on social networks. In:
ACSAC, pp. 1–9 (2010)
12. Gupta, A., Kumaraguru, P., Castillo, C., Meier, P.: Tweetcred: real-time credibility
assessment of content on twitter. In: Aiello, L.M., McFarland, D. (eds.) SocInfo
2014. LNCS, vol. 8851, pp. 228–243. Springer, Heidelberg (2014)
13. Bosma, M., Meij, E., Weerkamp, W.: A framework for unsupervised spam detec-
tion in social networking sites. In: Baeza-Yates, R., de Vries, A.P., Zaragoza, H.,
Cambazoglu, B.B., Murdock, V., Lempel, R., Silvestri, F. (eds.) ECIR 2012. LNCS,
vol. 7224, pp. 364–375. Springer, Heidelberg (2012)
14. Anagnostopoulos, A., Bessi, A., Caldarelli, G., Vicario, M.D., Petroni, F.,
Scala, A., Zollo, F., Quattrociocchi, W.: Viral misinformation: The role of
homophily and polarization (2014). CoRR, vol. abs/1411.2893
15. McCord, M., Chuah, M.: Spam detection on twitter using traditional classifiers.
In: Calero, J.M.A., Yang, L.T., M´armol, F.G., Garc´ıa Villalba, L.J., Li, A.X.,
Wang, Y. (eds.) ATC 2011. LNCS, vol. 6906, pp. 175–186. Springer, Heidelberg
(2011)
16. Benevenuto, F., Magno, G., Rodrigues, T., Almeida, V.: Detecting spammers on
twitter. In: CEAS (2010)
17. Budak, C., Agrawal, D.: Abbadi, A.E.: Limiting the spread of misinformation in
social networks. In: WWW, pp. 665–674 (2011)
18. Faloutsos, M.: Detecting malware with graph-based methods: traffic classification,
botnets, and facebook scams. In: WWW (Companion Volume), pp. 495–496 (2013)
482 S. Antoniadis et al.
19. Ghosh, S., Viswanath, B., Kooti, F., Sharma, N.K., Korlam, G., Benevenuto, F.,
Ganguly, N., Gummadi, P.K.: Understanding and combating link farming in the
twitter social network. In: WWW, pp. 61–70 (2012)
20. Mendoza, M., Poblete, B., Castillo, C.: Twitter under crisis: can we trust what we
rt? In: Proceedings of the First Workshop on Social Media Analytics, ser. SOMA
2010, pp. 71–79. ACM, New York (2010)
21. Liu, Y., Wu, B., Wang, B., Li, G.: Sdhm: a hybrid model for spammer detection in
weibo. In: 2014 IEEE/ACM International Conference on ASONAM, pp. 942–947,
August 2014
... However, it is expected that these malicious accounts will be deleted in the near future [146]. A survey of the literature has identified numerous studies [1,8,38,45,63,91,113,172,177,178] that describe the important characteristics which can be used for the identification of bots on Twitter. Despite these attempts, limitations still exist in employing these characteristics for detecting fake news, especially, early detection of fake news during it's propagation. ...
... Moreover, a lack of know-how about social networks makes it even more challenging to discern the credibility of shared information [176]. Antoniadis et al. [8] developed a detection model to identify misinformation and suspicious behavioural patterns during emergency events on the Twitter platform. The model was based on a supervised learning technique using the user's profile and tweets. ...
... and Huan liu[184], Hartwig et al.[72], Davis et al.[38], Cha et al.[23], Weng et al.[179], Canini et al.[20],Thomas Kurt [164], Holton et al.[74], Antoniadis et al.[8], Alessandro et al.[12], Zhao et al.[193], Gupta et al.[68], Khan and Michalas[83] Facebook Tambuscio et al.[159], Joon Ian Wong[180], Monther et al.[4], Potthast et al.[130], Alexey Grigorev[66], Fake News Guard 14 , BuzzFace[143], FacebookHoax[151,158], Sebastian et al.[167], Detective[96] ...
Article
Full-text available
Social Networks' omnipresence and ease of use has revolutionized the generation and distribution of information in today's world. However, easy access to information does not equal an increased level of public knowledge. Unlike traditional media channels, social networks also facilitate faster and wider spread of disinformation and misinformation. Viral spread of false information has serious implications on the behaviours, attitudes and beliefs of the public, and ultimately can seriously endanger the democratic processes. Limiting false information's negative impact through early detection and control of extensive spread presents the main challenge facing researchers today. In this survey paper, we extensively analyse a wide range of different solutions for the early detection of fake news in the existing literature. More precisely, we examine Machine Learning (ML) models for the identification and classification of fake news, online fake news detection competitions, statistical outputs as well as the advantages and disadvantages of some of the available data sets. Finally, we evaluate the online web browsing tools available for detecting and mitigating fake news and present some open research challenges.
... However, it is expected that these malicious accounts will be deleted in the near future [146]. A survey of the literature has identified numerous studies [1,8,38,45,63,91,113,172,177,178] that describe the important characteristics which can be used for the identification of bots on Twitter. Despite these attempts, limitations still exist in employing these characteristics for detecting fake news, especially, early detection of fake news during it's propagation. ...
... Moreover, a lack of know-how about social networks makes it even more challenging to discern the credibility of shared information [176]. Antoniadis et al. [8] developed a detection model to identify misinformation and suspicious behavioural patterns during emergency events on the Twitter platform. The model was based on a supervised learning technique using the user's profile and tweets. ...
... and Huan liu[184], Hartwig et al.[72], Davis et al.[38], Cha et al.[23], Weng et al.[179], Canini et al.[20],Thomas Kurt [164], Holton et al.[74], Antoniadis et al.[8], Alessandro et al.[12], Zhao et al.[193], Gupta et al.[68], Khan and Michalas[83] Facebook Tambuscio et al.[159], Joon Ian Wong[180], Monther et al.[4], Potthast et al.[130], Alexey Grigorev[66], Fake News Guard 14 , BuzzFace[143], FacebookHoax[151,158], Sebastian et al.[167], Detective[96] ...
Preprint
Full-text available
Social Networks' omnipresence and ease of use has revolutionized the generation and distribution of information in today's world. However, easy access to information does not equal an increased level of public knowledge. Unlike traditional media channels, social networks also facilitate faster and wider spread of disinformation and misinformation. Viral spread of false information has serious implications on the behaviors, attitudes and beliefs of the public, and ultimately can seriously endanger the democratic processes. Limiting false information's negative impact through early detection and control of extensive spread presents the main challenge facing researchers today. In this survey paper, we extensively analyze a wide range of different solutions for the early detection of fake news in the existing literature. More precisely, we examine Machine Learning (ML) models for the identification and classification of fake news, online fake news detection competitions, statistical outputs as well as the advantages and disadvantages of some of the available data sets. Finally, we evaluate the online web browsing tools available for detecting and mitigating fake news and present some open research challenges.
... Due to its harmful impact on society, research has explored, analyzed and characterized misinformation and how to counter it [25]. These efforts can be grouped into three: (1) Misinformation or Counter-Misinformation Detection: Different data sources and methods have been developed to detect misinformation, including user reports through reinforcement learning [16], textual and visual features from posts by adversarial neural network [17], new user and tweets features with Bayes Network, k-Nearest Neighbors and Adaptive Boosting supervised learning framework [26], among others. However, these methods do not study counter-misinformation. ...
Preprint
Fact checking by professionals is viewed as a vital defense in the fight against misinformation.While fact checking is important and its impact has been significant, fact checks could have limited visibility and may not reach the intended audience, such as those deeply embedded in polarized communities. Concerned citizens (i.e., the crowd), who are users of the platforms where misinformation appears, can play a crucial role in disseminating fact-checking information and in countering the spread of misinformation. To explore if this is the case, we conduct a data-driven study of misinformation on the Twitter platform, focusing on tweets related to the COVID-19 pandemic, analyzing the spread of misinformation, professional fact checks, and the crowd response to popular misleading claims about COVID-19.In this work, we curate a dataset of false claims and statements that seek to challenge or refute them. We train a classifier to create a novel dataset of 155,468 COVID-19-related tweets, containing 33,237 false claims and 33,413 refuting arguments.Our findings show that professional fact-checking tweets have limited volume and reach. In contrast, we observe that the surge in misinformation tweets results in a quick response and a corresponding increase in tweets that refute such misinformation. More importantly, we find contrasting differences in the way the crowd refutes tweets, some tweets appear to be opinions, while others contain concrete evidence, such as a link to a reputed source. Our work provides insights into how misinformation is organically countered in social platforms by some of their users and the role they play in amplifying professional fact checks.These insights could lead to development of tools and mechanisms that can empower concerned citizens in combating misinformation. The code and data can be found in http://claws.cc.gatech.edu/covid_counter_misinformation.html.
... Although Tulving (1986) mentioned that episodic memory is more vulnerable to interference, subsequent studies have shown that semantic memory is not fully safe from interference. One type of interference is inaccurate information, or misinformation; this is false information spread intentionally or unintentionally, and the reader does not consciously realize the errors presented (Antoniadis, Litou, & Kalogeraki, 2015). Inaccuracy can be produced by intentionally manipulating information or by an unintentional error that makes a text unreliable (Rapp & Braasch, 2014). ...
Article
Digital false information is a global problem and the European Union (EU) has taken profound actions to counter it. However, from an academic perspective the United States has attracted particular attention. This article aims at mapping the current state of academic inquiry into false information at scale in the EU across fields. Systematic filtering of academic contributions resulted in the identification of 93 papers. We found that Italy is the most frequently studied country, and the country of affiliation for most contributing authors. The fields that are best represented are computer science and information studies, followed by social science, communication, and media studies. Based on the review, we call for (1) a greater focus on cross-platform studies; (2) resampling of similar events, such as elections, to detect reoccurring patterns; and (3) longitudinal studies across events to detect similarities, for instance, in who spreads misinformation.
Chapter
This last decade, the amount of data exchanged on the Internet and more specifically on social media networks is growing exponentially. Fake News phenomenon has become a major problem threatening the credibility of these social networks. Machine Learning (ML) techniques represent a promising solution to deal with this issue. For that, several solutions and algorithms using Machine Learning have been proposed in literature in the recent time for detecting fake news generated by different digital media platforms. This chapter aims to conduct a systematic mapping study to analyze and synthesize studies concerning the utilization of machine learning techniques for detecting fake news. Therefore, a total number of 76 relevant papers published on this subject between 1 January 2010 and 30 June 2021 were carefully selected. The selected articles were classified and analyzed according to eight criteria: channel and year of publication, research type, study domain, study platform, study context, study category, feature, and machine learning techniques used to handle categorical data. The results showed that most of the selected papers use both features text/content and linguistic to design machine learning models. Furthermore, SVM technique, and Deep Neural Network (DNN) technique were the most binary classification algorithms used to combat fake news.
Article
Full-text available
During emergencies, exposure to false information can increase individual vulnerability. More research is needed on how emergency management institutions understand the effects of false information and what are the various approaches to handling it. Our document analysis and 95 expert interviews in eight European countries – Germany, Italy, Belgium, Sweden, Hungary, Norway, Finland, and Estonia – show that approaches vary considerably: some have instituted central management of identifying and tackling false information while others prioritise the spreading of accurate information. A review of national practices and an analysis of recent crisis cases show that both approaches may be necessary. The diffusion of false information is strongly affected by the lack of timely and verifiable information from governments. We also find that in several countries, the emergence of false information is often associated with malicious foreign influence activities. Our study contributes to a better understanding of how the effects of false information are mitigated by the emergency management systems in Europe.
Chapter
We discuss a computationally meaningful process for evaluating misinformation detection tools in the context of immigration in Austria, admitting for the wide variety of qualitative and quantitative data involved. The evaluation machinery is based on a library of tools to support the process in both the elicitation and evaluation phases, including automatized preference elicitation procedures, development of result robustness measures as well as algorithms for co-evaluating quantitative and qualitative data in various formats. The focus of our work is on the Austrian limited profit housing sector, which makes up 24% of the total housing stock and more than 30% of the total of new construction, with a high share of migrants as tenants. We describe the results from workshops analysing the existing misinformation on migration issues in Austria, where we also introduced a co-creation phase. To better understand the stakeholder ecosystem and the lifecycle of misinformation towards social conflicts, we applied a software for integrated multi-stakeholder-multi-attribute problems under risk subject to incomplete or imperfect information, based on the evaluation machinery. Perceived counter-measures of importance turned out to be institutional and regulatory measures in combination with the creation of info-points, measures to raise awareness and stimulate critical thinking, production of tools to deal with misinformation, provision of reliable sources of information, and creation of a culture of thinking.
Article
Full-text available
Social media has become the primary source for rumor spreading, and information quality is an increasingly important issue in this context. In the last years, many researchers have been working on methods to improve the rumor classification, especially on the identification of fake news in social media, with good results. However, due to the complexity of natural language, this task presents difficult challenges, and many research opportunities. This survey analyzes 87 distinct publications, which were systematically selected out of 1333 candidates. This work covers eight years of research on fake news applied in social media and presents the main methods, text and user features, and datasets used in literature.
Article
Full-text available
The World Economic Forum listed massive digital misinformation as one of the main threats for our society. The spreading of unsubstantiated rumors may have serious consequences on public opinion such as in the case of rumors about Ebola causing disruption to health-care workers. In this work we target Facebook to characterize information consumption patterns of 1.2 M Italian users with respect to verified (science news) and unverified (conspiracy news) contents. Through a thorough quantitative analysis we provide important insights about the anatomy of the system across which misinformation might spread. In particular, we show that users’ engagement on verified (or unverified) content correlates with the number of friends having similar consumption patterns (homophily). Finally, we measure how this social system responded to the injection of 4,709 false information. We find that the frequent (and selective) exposure to specific kind of content (polarization) is a good proxy for the detection of homophile clusters where certain kind of rumors are more likely to spread.
Conference Paper
Full-text available
During sudden onset crisis events, the presence of spam, rumors and fake content on Twitter reduces the value of information contained on its messages (or “tweets”). A possible solution to this problem is to use machine learning to automatically evaluate the credibility of a tweet, i.e. whether a person would deem the tweet believable or trustworthy. This has been often framed and studied as a supervised classification problem in an off-line (post-hoc) setting. In this paper, we present a semi-supervised ranking model for scoring tweets according to their credibility. This model is used in TweetCred, a real-time system that assigns a credibility score to tweets in a user’s timeline. TweetCred, available as a browser plug-in, was installed and used by 1,127 Twitter users within a span of three months. During this period, the credibility score for about 5.4 million tweets was computed, allowing us to evaluate TweetCred in terms of response time, effectiveness and usability. To the best of our knowledge, this is the first research work to develop a real-time system for credibility on Twitter, and to evaluate it on a user base of this size.
Working Paper
Full-text available
The spreading of unsubstantiated rumors on online social networks (OSN) either unintentionally or intentionally (e.g., for political reasons or even trolling) can have serious consequences such as in the recent case of rumors about Ebola causing disruption to health-care workers. Here we show that indicators aimed at quantifying information consumption patterns might provide important insights about the virality of false claims. In particular, we address the driving forces behind the popularity of contents by analyzing a sample of 1.2M Facebook Italian users consuming different (and opposite) types of information (science and conspiracy news). We show that users' engagement across different contents correlates with the number of friends having similar consumption patterns (homophily), indicating the area in the social network where certain types of contents are more likely to spread. Then, we test diffusion patterns on an external sample of $4,709$ intentional satirical false claims showing that neither the presence of hubs (structural properties) nor the most active users (influencers) are prevalent in viral phenomena. Instead, we found out that in an environment where misinformation is pervasive, users' aggregation around shared beliefs may make the usual exposure to conspiracy stories (polarization) a determinant for the virality of false information.
Article
Full-text available
Purpose – Twitter is a popular microblogging service which has proven, in recent years, its potential for propagating news and information about developing events. The purpose of this paper is to focus on the analysis of information credibility on Twitter. The purpose of our research is to establish if an automatic discovery process of relevant and credible news events can be achieved. Design/methodology/approach – The paper follows a supervised learning approach for the task of automatic classification of credible news events. A first classifier decides if an information cascade corresponds to a newsworthy event. Then a second classifier decides if this cascade can be considered credible or not. The paper undertakes this effort training over a significant amount of labeled data, obtained using crowdsourcing tools. The paper validates these classifiers under two settings: the first, a sample of automatically detected Twitter “trends” in English, and second, the paper tests how well this model transfers to Twitter topics in Spanish, automatically detected during a natural disaster. Findings – There are measurable differences in the way microblog messages propagate. The paper shows that these differences are related to the newsworthiness and credibility of the information conveyed, and describes features that are effective for classifying information automatically as credible or not credible. Originality/value – The paper first tests the approach under normal conditions, and then the paper extends the findings to a disaster management situation, where many news and rumors arise. Additionally, by analyzing the transfer of our classifiers across languages, the paper is able to look more deeply into which topic-features are more relevant for credibility assessment. To the best of our knowledge, this is the first paper that studies the power of prediction of social media for information credibility, considering model transfer into time-sensitive and language-sensitive contexts.
Conference Paper
Full-text available
In today's world, online social media plays a vital role during real world events, especially crisis events. There are both positive and negative effects of social media coverage of events, it can be used by authorities for effective disaster management or by malicious entities to spread rumors and fake news. The aim of this paper, is to highlight the role of Twitter, during Hurricane Sandy (2012) to spread fake images about the disaster. We identified 10,350 unique tweets containing fake images that were circulated on Twitter, during Hurricane Sandy. We performed a characterization analysis, to understand the temporal, social reputation and influence patterns for the spread of fake images. Eighty six percent of tweets spreading the fake images were retweets, hence very few were original tweets. Our results showed that top thirty users out of 10,215 users (0.3%) resulted in 90% of the retweets of fake images; also network links such as follower relationships of Twitter, contributed very less (only 11%) to the spread of these fake photos URLs. Next, we used classification models, to distinguish fake images from real images of Hurricane Sandy. Best results were obtained from Decision Tree classifier, we got 97% accuracy in predicting fake images from real. Also, tweet based features were very effective in distinguishing fake images tweets from real, while the performance of user based features was very poor. Our results, showed that, automated techniques can be used in identifying real images from fake images posted on Twitter.
Conference Paper
Twitter has shown its greatest power of influence for its fast information diffusion. Previous research has shown that most of the tweets posted are truthful, but as some people post the rumors and spams on Twitter in emergence situation, the direction of public opinion can be misled and even the riots are caused. In this paper, we focus on the methods for the information credibility in emergency situation. More precisely, we build a novel Twitter monitor model to monitoring Twitter online. Within the novel monitor model, an unsupervised learning algorithm is proposed to detect the emergency situation. A collection of training dataset which includes the tweets of typical events is gathered through the Twitter monitor. Then we manually dispatch the dataset to experts who label each tweet into two classes: credibility or incredibility. With the classified tweets, a number of features related to the user social behavior, the tweet content, the tweet topic and the tweet diffusion are extracted. A supervised method using learning Bayesian Network is used to predict the tweets credibility in emergency situation. Experiments with the tweets of UK Riots related topics show that our procedure achieves good performance to classify the tweets compared with other state-of-art algorithms.