PreprintPDF Available

ReCOVery: A Multimodal Repository for COVID-19 News Credibility Research

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

First identified in Wuhan, China, in December 2019, the outbreak of COVID-19 has been declared as a global emergency in January, and a pandemic in March 2020 by the World Health Organization (WHO). Along with this pandemic, we are also experiencing an "infodemic" of information with low credibility such as fake news and conspiracies. In this work, we present ReCOVery a repository designed and constructed to facilitate research on combating such information regarding COVID-19. We first broadly search and investigate ~2,000 news publishers, from which 60 are identified with extreme [high or low] levels of credibility. By inheriting the credibility of the media on which they were published, a total of 2,029 news articles on coronavirus, published from January to May 2020, are collected in the repository, along with 140,820 tweets that reveal how these news articles have spread on the Twitter social network. The repository provides multimodal information of news articles on coronavirus, including textual, visual, temporal, and network information. The way that news credibility is obtained allows a trade-off between dataset scalability and label accuracy. Extensive experiments are conducted to present data statistics and distributions, as well as to provide baseline performances for predicting news credibility so that future methods can be compared. Our repository is available at http://coronavirus-fakenews.com.
Content may be subject to copyright.
ReCOVery: A Multimodal Repository for COVID-19 News
Credibility Research
Xinyi Zhou
zhouxinyi@data.syr.edu
Data Lab, EECS Department
Syracuse University
Apurva Mulay
asmulay@syr.edu
Data Lab, EECS Department
Syracuse University
Emilio Ferrara
emiliofe@usc.edu
Information Sciences Institute,
University of Southern California
Reza Zafarani
reza@data.syr.edu
Data Lab, EECS Department
Syracuse University
ABSTRACT
First identied in Wuhan, China, in December 2019, the outbreak
of COVID-19 has been declared as a global emergency in January,
and a pandemic in March 2020 by the World Health Organization
(WHO). Along with this pandemic, we are also experiencing an
“infodemic” of information with low credibility such as fake news
and conspiracies. In this work, we present
ReCOVery
, a reposi-
tory designed and constructed to facilitate research on combating
such information regarding COVID-19. We rst broadly search and
investigate
2,000 news publishers, from which 60 are identied
with extreme [high or low] levels of credibility. By inheriting the
credibility of the media on which they were published, a total of
2,029 news articles on coronavirus, published from January to May
2020, are collected in the repository, along with 140,820 tweets that
reveal how these news articles have spread on the Twitter social
network. The repository provides multimodal information of news
articles on coronavirus, including textual, visual, temporal, and
network information. The way that news credibility is obtained
allows a trade-o between dataset scalability and label accuracy.
Extensive experiments are conducted to present data statistics and
distributions, as well as to provide baseline performances for pre-
dicting news credibility so that future methods can be compared.
Our repository is available at http://coronavirus-fakenews.com.
CCS CONCEPTS
Information systems Collaborative and social comput-
ing systems and tools
;
Clustering and classication
;
Secu-
rity and privacy Social aspects of security and privacy.
KEYWORDS
Repository; COVID-19; coronavirus; pandemic; infodemic; infor-
mation credibility; fake news; multimodal; social media
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
CIKM ’20, October 19–23, 2020, Virtual Event, Ireland
©2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-6859-9/20/10.. .$15.00
https://doi.org/10.1145/3340531.3412880
ACM Reference Format:
Xinyi Zhou, Apurva Mulay, Emilio Ferrara, and Reza Zafarani. 2020.
ReCOVery
:
A Multimodal Repository for COVID-19 News Credibility Research. In Pro-
ceedings of the 29th ACM International Conference on Information and Knowl-
edge Management (CIKM ’20), October 19–23, 2020, Virtual Event, Ireland.
ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3340531.3412880
1 INTRODUCTION
As of June 4
th
, the COVID-19 pandemic has resulted in over 6.4
million conrmed cases and over 380,000 deaths globally.
1
Gov-
ernments have enforced border shutdowns, travel restrictions, and
quarantines to “atten the curve”. The COVID-19 outbreak has had
a detrimental impact on not only the healthcare sector but also
every aspect of human life such as education and economic sec-
tors [
11
]. For example, over 100 countries have imposed nationwide
(even complete) closures of education facilities, which has lead to
over 900 million learners being aected.
2
Statistics indicate that
3.3 million Americans applied for unemployment benets in the
week ending on March 21
th
and the number doubled in the fol-
lowing week, before which the highest number of unemployment
applications ever received in one week was 695,000 in 1982.3
Along with the COVID-19 pandemic, we are also experiencing an
“infodemic” of information with low credibility regarding COVID-
19.
4
Hundreds of news websites have contributed to publishing
false coronavirus information.
5
Individuals who believe false news
articles (e.g., claiming that eating boiled garlic or drinking chlorine
dioxide, an industrial bleach, can cure or prevent coronavirus),
might take ineective or extremely dangerous actions to protect
themselves from the virus.6
Given this background, research is motivated to combat this in-
fodemic. Hence, we design and construct a multimodal repository,
ReCOVery
, to facilitate reliability assessment of news on COVID-
19. We rst broadly search and investigate
2,000 news publishers,
1
https://www.who.int/docs/default-source/coronaviruse/situation- reports/
20200604-covid- 19-sitrep-136.pdf
2https://en.unesco.org/covid19/educationresponse
3
https://www.npr.org/2020/03/26/821580191/unemployment-claims- expected-to-
shatter-records
4
https://www.un.org/en/un-coronavirus-communications- team/un-tackling-
%E2%80%98infodemic%E2%80%99- misinformation- and-cybercrime-covid-19
5https://www.newsguardtech.com/coronavirus-misinformation-tracking-center/
6
https://www.factcheck.org/2020/02/fake-coronavirus-cures- part-1-mms-is-
industrial-bleach/
arXiv:2006.05557v2 [cs.SI] 17 Aug 2020
from which 60 with various political polarizations and from dier-
ent countries are identied with extreme [high or low] credibility.
As past literature has indicated, there is a close relationship between
the credibility of news articles and their publication sources [
20
]. In
total, 2,029 news articles on coronavirus are nally collected in the
repository along with 140,820 tweets that reveal how these news
articles are spread on the social network. The main contributions
of this work are summarized as follows:
(1)
We construct a repository to support the research that in-
vestigates (i) how news with low credibility is created and
spread in the COVID-19 pandemic and (ii) ways to predict
such “fake” news. The manner in which the ground truth
of news credibility is obtained allows a scalable repository,
as annotators need not label each news article that is time-
consuming and instead they can directly label the news site;
(2) ReCOVery
provides multimodal information on COVID-19
news articles. For each news article, we collect its news
content and social information revealing how it spreads on
social media, which covers textual, visual, temporal, and
network information; and
(3)
We conduct extensive experiments using
ReCOVery
data,
which includes data analyses (data statistics and distribu-
tions) and baseline performances for predicting news credi-
bility. These baselines allow future methods to be compared
to. Baselines are obtained using either single-modal or multi-
modal information of news articles and utilize either tradi-
tional statistical learning or deep learning.
The rest of this paper is organized as follows. We rst review
the related datasets in Section 2. Then, we detail how the data is
collected in Section 3. The statistics and distributions of the data are
presented and analyzed in Section 4. Experiments that use the data
to predict news credibility are designed and conducted in Section 5,
whose results can be used as benchmarks. We conclude in Section 6.
2 RELATED WORK
Related datasets can be generally grouped as (I) COVID-19 datasets
and (II) “fake” news and rumor datasets.
COVID-19 Datasets. As a global emergency, the outbreak of
COVID-19 has been labelled as a black swan event and likened
to the economic scene of World War II [
11
]. With this background,
a group of datasets have emerged, whose contributions range from
real-time tracking of COVID-19 to help epidemiological forecast-
ing (e.g., [
5
] and [
18
]) and collecting scholarly COVID-19 articles
for literature-based discoveries (e.g., CORD-19
7
), to tracking the
spreading of COVID-19 information on Twitter (e.g., [2]).
Specically, researchers at Johns Hopkins University develop
a Web-based dashboard
8
to visualize and track reported cases of
COVID-19 in real-time. The dashboard is released on January 22
nd
,
presenting the location and number of conrmed COVID-19 cases,
deaths, and recoveries for all aected countries [
5
]. Another dataset
shared publicly on March 24
th
is constructed to aid the analysis
and tracking of the COVID-19 epidemic, which provides real-time
individual-level data (e.g., symptoms; date of onset, admission, and
7https://www.semanticscholar.org/cord19
8https://coronavirus.jhu.edu/map.html
conrmation; and travel history) from national, provincial, and
municipal health reports [
18
]. The Allen Institute for AI has con-
tributed a free and dynamic database of more than 128,000 scholarly
articles about COVID-19, named CORD-19, to the global research
community.
7
The intention is to mobilize researchers to apply re-
cent advances in Natural Language Processing (NLP) to generate
new insights to support the ght against COVID-19. Furthemore,
Chen et al. [
2
] release the rst large-scale COVID-19 twitter dataset.
The dataset, updated regularly, collects COVID-19 tweets that are
posted from January 21st and across languages.
Though these datasets have been broadly investigated and have
contributed to the research on coronavirus pandemic, they do not
provide the ground truth on the credibility of information on coro-
navirus to help ght the coronavirus infodemic.
“Fake” News and Rumor Datasets. Existing “fake” news and rumor
datasets are collected with various focuses. These datasets may
(i) only contain news content that can be full articles (e.g., NELA-GT-
2018 [
12
] or short claims (e.g., FEVER [
16
]); (ii) only contain social
media information (e.g., CREDBANK[
10
]), where news refers to
user posts; or (iii) contain both content and social media information
(e.g., LIAR [17], FakeNewsNet [14], and FakeHealth [4]).
Specically, NELA-GT-2018 [
12
] is a large-scale dataset of around
713,000 news articles from February to November 2018. News
articles are collected from 194 news medium with multiple la-
bels directly obtained from NewsGuard, Pew Research Center,
Wikipedia, OpenSources, MBFC, AllSides, BuzzFeed News, and
PolitiFact. These labels refer to news credibility, transparency, po-
litical polarizations, and authenticity. FEVER dataset [
16
] consists
of
185,000 claims and is constructed following two steps: claim
generation and annotation. First, the authors extract sentences
from Wikipedia, and then the annotators manually generate a set
of claims based on the extracted sentences. Then, the annotators
label each claim as “supported”, “refuted”, or “not enough informa-
tion” by comparing it with the original sentence from which it is
developed. On the other hand, some datasets focus on user posts on
social media, for example, CREDBANK [
10
] includes more than 60
million tweets grouped into 1049 real-world events, each of which is
annotated by 30 human annotators, while some contain both news
content and social media information. By collecting both claims
and fact-check results (labels, i.e., “true”, “mostly true”, “half-true”,
“mostly false”, and “pants on re”) directly from PolitiFact, Wang
establishes the LIAR dataset [
17
] containing around 12,800 veried
statements made in public speeches and social medium. The afore-
mentioned datasets only contain textual information valuable for
NLP research with limited information on how “fake” news and
rumors spread on social networks, which motivate the construction
of FakeNewsNet and FakeHealth dataset [
4
,
14
]. The FakeNewsNet
dataset collects fact-checked (real or fake) full news articles from
PolitiFact (#=1,056) and GossipCop (#=22,140), respectively and
tracks news spreading on Twitter. The FakeHealth dataset collects
veried (real or fake) news reviews from HealthNewsReview.org
with detailed explanations and social engagements regarding news
spreading on Twitter that includes a user-user social network.
Note that FakeHealth concentrates on healthcare data, so does
CoAID, a recently released dataset for COVID-19 misinformation
research [3].
Annotating
Credibi lit y of
News Sit es
Collecting
COVID-19
News Arti cles
Tracking
News Spread
on Social
Media
News Sit es News Arti cles
Annotating
Credibi lit y of
News Sit es
News Tweets Users
Social Context
Figure 1: Data Collection Process for ReCOVery
In general, compared to datasets such as NELA-GT-2018, FEVER,
and LIAR, our repository provides multimodal information and
social engagements of news articles. Compared to CREDBANK
and FakeNewsNet, ReCOVery aims to ght the coronavirus info-
demic and presents a novel approach to collecting and annotating
data, which allows the trade-o between data scalability and label
accuracy. Compared to FakeHealth and CoAID, news articles in
ReCOVery are from a mix of domains that include healthcare.
3 DATA COLLECTION
The overall process that we collect the data, including news content
and social media information, is presented in Figure 1. To facilitate
scalability, news credibility is assessed based on the credibility
of the media (site) that publishes the news article. Based on the
process outlined in Figure 1, we will further detail how the data
is collected, answering the following three questions: (1) how to
identify reliable (or unreliable) news sites mainly releasing real
news (or fake news)? (which we address in Section 3.1); having
determined such news sites, (2) how do we crawl COVID-19 news
articles from these sites and what news components are valuable
for collection? (Section 3.2); and given COVID-19 news articles, (3)
how can we track their spread on social networks? (Section 3.3)
3.1 Filtering News Sites
To determine a list of reliable and unreliable news sites, we primarily
rely on two resources: NewsGuard and Media Bias/Fact Check.
NewsGuard.
9
NewsGuard is developed to review and rate news
websites. Its reliability rating team is formed by trained journalists
and experienced editors, whose credentials and backgrounds are
all transparent and available on the site. The performance (credibil-
ity) of each news website is assessed based on the following nine
journalistic criteria:
(1) Does not repeatedly publish false content, (22 points)
(2) Gathers and presents information responsibly, (18 points)
(3) Regularly corrects or claries errors, (12.5 points)
(4)
Handles the dierence between news and opinion responsi-
bly, (12.5 points)
(5) Avoids deceptive headlines, (10 points)
(6) Website discloses ownership and nancing, (7.5 points)
(7) Clearly labels advertising, (7.5 points)
9https://www.newsguardtech.com/
(8)
Reveals whoâĂŹs in charge, including possible conicts of
interest, and (5 points)
(9)
The site provides the names of content creators, along with
either contact or biographical information, (5 points)
where the overall score of a site is between 0 to 100; 0 indicates the
lowest credibility, and 100 indicates the highest credibility. A news
website with a NewsGuard score higher than 60 is often labeled
as reliable; otherwise, it is unreliable. NewsGuard has provided
ground truth for the construction of news datasets such as NELA-
GT-2018 [12] for studying misinformation.
Media Bias/Fact Check (MBFC).
10
MBFC is a website that rates
factual accuracy and political bias of news medium. The fact-checking
team consists of Dave Van Zandt, the primary editor and the web-
site owner, and some journalists and researchers (more details can
be found on its “About” page). MBFC labels each news media as
one of six factual-accuracy levels based on the fact-checking results
of the news articles it has published (more details can be found on
its “Methodology” page): (i) very high, (ii) high, (iii) most factual,
(iv) mixed, (v) low, and (vi) very low. Such information has been
used as ground truth for automatic fact-checking studies. [1]
What Are Our Criteria? Referenced by NewsGuard and MBFC,
our criteria for determining reliable and unreliable news sites are:
Reliable
A news site is reliable if its NewsGuard score is
greater than 90,
and
its factual reporting on MBFC
is very high or high.
×Unreliable
A news site is unreliable if its NewsGuard score is
less than 30,
and
its factual reporting on MBFC is
below mixed.
Our search towards news medium with high credibility is con-
ducted among news medium listed in MBFC (
2,000). To nd news
medium with low credibility, we search in MBFC and the newly
released “Coronavirus Misinformation Tracking Center”
5
of News-
Guard, which provides a list of websites publishing false coron-
avirus information. Ultimately, we obtain a total of 60 news sites,
from which 22 are the sources of reliable news articles (e.g., National
Public Radio
11
and Reuters
12
) and the remaining 38 are sources to
collect unreliable news articles (e.g., Human Are Free
13
and Natural
10https://mediabiasf actcheck.com/
11https://www.npr.org
12https://www.reuters.com
13http://humansaref ree.com/
(a) Reliable News Sites
(b) Unreliable News Sites
Figure 2: Credibility Distribution of Determined News Sites
(a) Reliable News15 (b) Unreliable News16
Figure 3: Examples of News Articles Collected
News
14
). The full list of sites considered in our repository is also
available at http://coronavirus-fakenews.com. Note that several
“fake” news medium are not included, such as 70 News,Conserva-
tive 101, and Denver Guardian, since they no longer exist or their
domains have been unavailable.
Also note that to achieve a good trade-o between dataset scala-
bility and label accuracy, we utilize more extreme threshold scores
(30 and 90) compared to the initial one provided by NewsGuard (60).
In this way, the selected news sites exhibit an extreme reliability (or
unreliability), which helps reduce the number of false positives and
false negatives in news labels in our repository; ideally, each news
article published on a reliable site is factual, and on an unreliable
14https://www.naturalnews.com
15
https://www.npr.org/sections/coronavirus-live-updates/2020/05/17/857512288/ob
ama-malala- jonas-brothers-send-off- class-of -2020-in-virtual-graduation
16
https://humansarefree.com/2020/05/researchers-100-covid-19- cure-rate-using-
intravenous-chlorine- dioxide.html
site is false. Figure 2 illustrates the credibility distributions of reli-
able and unreliable news sites. It can be observed from the gure
that for reliable news, most of them have a full mark on NewsGuard
and are labeled as “high"ly factual by MBFC; “very high” is rare for
all sites listed in MBFC. In contrast, unreliable news sites share an
average NewsGuard score of
15 and a low factual label by MBFC;
similarly, “very low” is rarely given on MBFC.
3.2 Collecting COVID-19 News Content
To crawl COVID-19 news articles from selected news sites, we
rst determine whether the news article is about COVID-19; the
process is detailed in Section 3.2.1. Next, we detail how the data is
crawled and the news content components that are included in our
repository in Section 3.2.2.
3.2.1 News Topic Identification. To identify news articles on COVID-
19, we use a list of keywords:
SARS-CoV-2,
COVID-19, and
Coronavirus.
News articles whose content contains any of the keywords (case-
insensitive) are considered related to COVID-19. These three key-
words are the ocial names announced by the WHO on February
11
th
, where SARS-CoV-2 (standing for Severe Acute Respiratory
Syndrome CoronaVirus 2) is the virus name, and Coronavirus and
COVID-19 are the name of the disease that the virus causes. Before
the WHO announcement, COVID-19 was previously known as the
“2019 novel coronavirus,17, which also includes the coronavirus
keyword which we are considering. We merely consider ocial
names as keywords to avoid potential biases or even discrimination
in naming. Furthermore, a news media (article) that is credible, or
pretends to be credible, often acts professionally and adopts the
ocial name(s) of the disease/virus. Compared to those articles that
use biased and/or inaccurate terms, false news pretending to be
professional is more detrimental and challenging to detect, which
has become the focus of current fake news studies. [
24
] Examples
of such news articles are illustrated in Figure 3.
3.2.2 Crawling News Content. Content crawler relies on Newspa-
per Python library.
18
The content of each news article corresponds
to twelve components:
(C1)
News ID: Each news article is assigned a unique id as the
identity;
(C2)
News URL: The URL of the news article. The URL helps us
verify the correctness of the collected data. It can also be used
as the reference and source when repository users would like
to extend the repository by fetching additional information;
(C3)
Publisher: The name of the news media (site) that publishes
the news article;
(C4)
Publication Date: The date (in
yyyy-mm-dd
format) on which
the news article was published on the site, which provides
temporal information to support the investigation of, e.g.,
the relationship between the misinformation volume and
the outbreak of COVID-19 over time;
17
https://www.who.int/emergencies/diseases/novel-coronavirus-2019/technical- gu
idance/naming-the- coronavirus-disease-(covid- 2019)-and- the-virus-that-causes-it
18https://github.com/codelucas/newspaper
(C5)
Author: The author(s) of the news article, whose number can
be none, one, or more than one. Note that some news articles
might have ctional author names. Author information is
valuable in evaluating news credibility by either investigat-
ing the collaboration network of authors [
15
] or exploring
its relationships with news publishers and content [21];
(C6-7)News Title and Bodytext as the main textual information;
(C8)
News Image as the main visual information, which is pro-
vided in the form of a link (URL). Note that most images
within the news page are noise – they can be advertise-
ments, images belonging to other news articles due to the
recommender systems embedded in news sites, logos of news
sites and/or social media icons, such as Twitter and Face-
book logos for sharing. Hence, we particularly fetch the
main/head/top image for each news article to reduce noise;
(C9)
Country: The name of country where the news is published;
(C10)
Political bias: Each news article is labeled as one of ‘extremely
left’, ‘left’, ‘left-center’, ‘center’, ‘right-center’, ‘right’, and
‘extremely right’ that is equivalent to the political bias of its
publisher. News political bias is veried by two resources,
AllSides
19
and MFBC, both of which rely on domain experts
to label media bias; and
(C11-12)
NewsGuard score and MBFC factual reporting as the original
ground truth of news credibility, which has been detailed in
Section 3.1.
3.3 Tracking News Spreading on Social Media
We rst use Twitter Premium Search API
20
to track the spread of
collected news articles on Twitter. Specically, our search is based
on the URL of each news article and looks for tweets posted after
the date when the news article was published to the current date
(for the current version of the dataset, this date is May 26
th
). Twit-
ter Search API can return the corresponding tweets with detailed
information such as their IDs, text, languages of text, times of being
created, statistics on retweeted/replied/liked. Also, it returns the
information of users who post these tweets, such as user IDs and
their number of followers/friends. To comply with Twitter’s Terms
of Service,
21
we only publicly release the IDs of the collected data
for non-commercial research use, but provide the instructions for
obtaining the tweets using the released IDs for user convenience.
More details can be seen in http://coronavirus-fakenews.com.
4 DATA STATISTICS AND DISTRIBUTIONS
The general statistics on our dataset is presented in Table 1. The
dataset contains 2,029 news articles, most of which have both tex-
tual and visual information for multimodal studies (#=2,017), [
23
]
and have been shared on social media (#=1,747). The dataset is im-
balanced in news class – the proportion of reliable versus unreliable
news articles is around 2:1. The number of users who spread reli-
able news (#=78,659) plus that of users spreading unreliable news
(#=17,323) is greater than the total number of users included in the
dataset (#=93,761). This observation indicates that users can both
engage in spreading reliable and unreliable news articles.
19https://www.allsides.com/unbiased-balanced- news
20https://developer.twitter.com/en/docs/tweets/search/overview/premium
21https://developer.twitter.com/en/developer-terms/agreement-and- policy
Table 1: Data Statistics
Reliable Unreliable Total
News articles 1,364 665 2,029
w/ images 1,354 663 2,017
w/ social information 1,219 528 1,747
Tweets 114,402 26,418 140,820
Users 78,659 17,323 93,761
Next, we visualize the distributions of data features/attributes.
Distribution of News Publishers. Figure 4 shows the number of
COVID-19 news articles published in each [extremely reliable or
extremely unreliable] news site. There are ve unreliable publishers
with no news on COVID-19; hence, they are not presented in the
gure. We keep these publishers in our repository as the data will be
updated over time and these publishers may publish news articles
on COVID-19 in the future.
News Publication Dates. The distribution of news publication
dates is presented in Figure 5, where all articles are published in
2020. We point out that from January to May, the number of COVID-
19 news articles published is signicantly (exponentially) increased.
The possible explanation for this phenomena is three-fold. First,
from the time that the outbreak was rst identied in Wuhan,
China (December 2019) [
7
] to May 2020, the number of conrmed
cases and deaths caused by SARS-CoV-2 have exponentially grown
globally.
1
Meanwhile, the virus has become a world topic and has
triggered more and more discussions on a world-wide scale. Sec-
ond, some older news articles are no longer available, which has
motivated us to timely update the dataset. Third, the keywords we
have used to identify COVID-19 news articles are the ocial ones
provided by the WHO in February.
17
Some news articles published
in January are also collected, as before the WHO announcement
COVID-19 was known as the “2019 novel coronavirus,” which also
includes one of our keywords “coronavirus.” We have detailed the
reasons behind our keyword selection in Section 3.2.1. Note that
there are a small group of news articles whose publication dates
are not accessible, which we denote as N/A in Figure 5.
News Authors and Author Collaborations. Figure 6 presents the
distribution of the number of authors contributing to news articles,
which is governed by a long-tail distribution: most articles are con-
tributed by
5authors. Instead of including the [real or virtual]
names of the authors, some articles provide publisher names as
authors. Considering such information has been available in the
repository, we leave the author information of these news arti-
cles blank, i.e., their number of authors is zero. Furthermore, we
construct the coauthorship network, shown in Figure 7. It can be
observed from the network that node degrees also follow a power-
law-like distribution: among 1,095 nodes (authors), over 90% of
them have less than or equal to two collaborators.
News Content Statistics. Both Figures 8 and 9 reveal textual char-
acteristics within news content (including news title and bodytext).
It can be observed from Figure 8 that the number of words within
news content follows a long-tail (power-low-like) distribution, with
an average value of
800 and a median value of
600. On the other
Figure 4: Distribution of News Publishers
Figure 5: Publication Date Figure 6: Author Count
hand, Figure 9 provides the word cloud for the entire repository. As
the news articles collected share the same COVID-19 topic, some
relevant topics and vocabularies have been naturally and frequently
used by the news authors, such as “coronavirus” (#=6465), “COVID”
(#=5413), “state” (#=4432), “test” (#=4274), “health” (#=3714), “pan-
demic” (#=3427), “virus” (#=2903), “home” (#=2871), “case” (#=2676),
and “Trump” (#=2431) that are illustrated with word font size scaled
to their frequencies.
Country Distribution. Figure 10 reveals the countries that news
and news publishers belong to. It can be observed that in total
six countries – United States (abbr. US), Russia (abbr. RU), United
Kingdom (abbr. UK), Iran (abbr. IR), Cyprus (abbr. CY), and Canada
(abbr. CA) – are covered, where US news and news publishers
constitute the majority of the population.
(a) Network (b) Degree Distribution
Figure 7: Author Collaborations
Figure 8: Word Count Figure 9: Word Cloud
(a) News Publishers (b) News Articles
Figure 10: Country
(a) News Publishers (b) News Articles
Figure 11: Political Bias
Figure 12: Spreading Frequency Figure 13: News Spreaders Figure 14: Follower Distribution Figure 15: Friend Distribution
Political Bias. Figure 11 provides the distribution of political bias
of news and news medium (publishers). It can be observed from
the gure that for both news and publishers, the distribution for
those exhibiting a right bias (including extremely right (abbr. Ex.
R), right (abbr. R), and right-center (abbr. R-C)) is more balanced
compared to those exhibiting a left bias (including extremely left
(abbr. Ex. L), left (abbr. L), and left-center (abbr. L-C)).
News Spreading Frequencies. Figure 12 shows the distribution of
the number of tweets sharing each news article. The distribution
exhibits a long tail – over 80% of news articles are spread less than
100 times while a few have been shared by thousands of tweets.
News Spreaders. The distribution of the number of spreaders for
each news article is shown in Figure 13. It diers from the distri-
bution in Figure 12 as one user can spread a news article multiple
times. As for social connections of news spreaders, the distributions
of their followers and friends are respectively presented in Figures
14 and 15, where the most popular spreader has over 40 million
followers (or 600,000 friends).
5 FORMING BASELINES: USING ReCOVery TO
PREDICT COVID-19 NEWS CREDIBILITY
In this section, several methods that often act as baselines are
utilized and developed to predict COVID-19 news credibility using
ReCOVery
data, hoping to facilitate future studies. These methods
(baselines) are rst specied in Section 5.1. The implementation
details of experiments are then provided in Section 5.2. Finally, we
present the performance results for these methods in Section 5.3.
5.1 Methods
We involve the following methods as our as baselines. These meth-
ods can be grouped by their learning framework, which is either a
traditional statistical learner such as SVM (e.g., LIWC) or a neural
network (e.g., Text-CNN and SAFE). Baselines can also be grouped
as single-modal methods (e.g., LIWC, RST, and Text-CNN) or multi-
modal methods (e.g., SAFE).
LIWC [
13
].
22
LIWC (Linguistic Inquiry and Word Count) is
a widely-accepted psycholinguistic lexicon. Given a news story,
LIWC can count the words in the text falling into one or more of
93 linguistic, psychological, and topical categories, based on which
93 features are extracted and often classied within a traditional
statistical learning framework [22].
RST. RST (Rhetorical Structure Theory) organizes a piece of con-
tent as a tree that captures the rhetorical relation among its phrases
and sentences. We use a pretrained RST parser [
8
]
23
to obtain the
tree for each news article and count each rhetorical relation (in
total, 45) within a tree, based on which 45 features are extracted
and classied in a traditional statistical learning framework.
Text-CNN [
9
]. Text-CNN relies on a Convolutional Neural Net-
works for text classication, which contains a convolutional layer
and max pooling.
SAFE [
23
].
24
SAFE is a neural-network-based method that uti-
lizes news multimodal information for fake news detection, where
news representation is learned jointly by news textual and visual
information along with their relationship. SAFE facilitates recogniz-
ing the news falseness in its text, images, and/or the “irrelevance”
between the text and images.
22https://liwc.wpengine.com/
23https://github.com/jiyfeng/DPLP
24https://github.com/Jindi0/SAFE
Table 2: Baselines Performance in Predicting COVID-19
News Credibility Using ReCOVery Data
Method Reliable news Unreliable news
Pre. Rec. F1Pre. Rec. F1
LIWC+DT 0.779 0.771 0.775 0.540 0.552 0.545
RST+DT 0.721 0.705 0.712 0.421 0.441 0.430
Text-CNN 0.746 0.782 0.764 0.522 0.472 0.496
SAFE 0.836 0.829 0.833 0.667 0.677 0.672
5.2 Implementation Details
The overall dataset is randomly divided into training and testing
datasets with a proportion of 0.8:0.2. As the dataset has an unbal-
anced distribution between reliable and unreliable news articles
(
2:1), we evaluate the prediction results in terms of precision, re-
call, and the
F1
score. For methods relying on traditional statistical
learners, multiple well-established classiers are adopted in our
experiments: Logistic Regression (LR), Naïve Bayes (NB),
k
-Nearest
Neighbor (
k
-NN), Random Forest (RF), Decision Tree (DT), and
Support Vector Machines (SVM). We merely present the one per-
forming best due to the space limitation. Codes are all available on
http://coronavirus-fakenews.com.
5.3 Experimental Results
Prediction results are provided in Table 2. We observe that four
baselines achieve an
F1
-score (precision, recall) score of 71% (72%,
71%) to 83% (84%, 83%) in identifying reliable news and between
43% (42%, 44%) to 67% (67%, 68%) for unreliable news. Additionally,
multimodal features are generally more representative than single-
modal features in predicting news credibility. We point out that
the four baselines are content-based methods; developing more
advanced methods by mining social media [19] are encouraged.
6 CONCLUSION
To ght the coronavirus infodemic, we construct a multimodal
repository for COVID-19 news credibility research, which provides
textual, visual, temporal, and network information regarding news
content and how news spreads on social media. The repository
balances data scalability and label accuracy. To facilitate future
studies, benchmarks are developed and their performances are pre-
sented on predicting news credibility using the data available in the
repository. We point out that the data could be further enhanced
(1) by including COVID-19 news articles in various languages such
as Chinese, Russian, Spanish, and Italian, as well as the information
on how these news articles spread on the popular local social media
for those languages, e.g., Sina Weibo (China). Furthermore, (2) ex-
tending the dataset by introducing the ground truth of, for example,
hate speech, clickbaits, and social bots [
6
] would help study the
bias and discrimination bred by the virus, as well as the correlation
among all information and accounts with low credibility. Both (1)
and (2) will be our future work.
ACKNOWLEDGMENTS
Emilio Ferrara is supported by the Defense Advanced Research
Projects Agency (DARPA, grant number W911NF-17-C-0094).
REFERENCES
[1]
Ramy Baly, Georgi Karadzhov, Dimitar Alexandrov, James Glass, and Preslav
Nakov. 2018. Predicting factuality of reporting and bias of news media sources.
arXiv preprint arXiv:1810.01765 (2018).
[2]
Emily Chen, Kristina Lerman, and Emilio Ferrara. 2020. Tracking Social Media
Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus
Twitter Data Set. JMIR Public Health and Surveillance 6, 2 (2020), e19273.
[3]
Limeng Cui and Dongwon Lee. 2020. CoAID: COVID-19 Healthcare Misinforma-
tion Dataset. arXiv preprint arXiv:2006.00885 (2020).
[4]
Enyan Dai, Yiwei Sun, and Suhang Wang. 2020. Ginger Cannot Cure Cancer:
Battling Fake Health News with a Comprehensive Data Repository. In Proceedings
of the International AAAI Conference on Web and Social Media, Vol. 14. 853–862.
[5]
Ensheng Dong, Hongru Du, and Lauren Gardner. 2020. An interactive web-based
dashboard to track COVID-19 in real time. The Lancet infectious diseases 20, 5
(2020), 533–534.
[6]
Emilio Ferrara. 2019. The history of digital spam. Commun. ACM 62, 8 (2019),
82–91.
[7]
Chaolin Huang, Yeming Wang, Xingwang Li, Lili Ren, Jianping Zhao, Yi Hu, Li
Zhang, Guohui Fan, Jiuyang Xu, Xiaoying Gu, et al
.
2020. Clinical features of
patients infected with 2019 novel coronavirus in Wuhan, China. The lancet 395,
10223 (2020), 497–506.
[8]
Yangfeng Ji and Jacob Eisenstein. 2014. Representation learning for text-level
discourse parsing. In Proceedings of the 52nd Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 13–24.
[9]
Yoon Kim. 2014. Convolutional neural networks for sentence classication. arXiv
preprint arXiv:1408.5882 (2014).
[10]
Tanushree Mitra and Eric Gilbert. 2015. CREDBANK: A Large-scale Social Media
Corpus with Associated Credibility Annotations. In Ninth International AAAI
Conference on Web and Social Media.
[11]
Maria Nicola, Zaid Alsa, Catrin Sohrabi, Ahmed Kerwan, Ahmed Al-Jabir, Chris-
tos Iosidis, Maliha Agha, and Riaz Agha. 2020. The socio-economic implications
of the coronavirus and COVID-19 pandemic: A review. International Journal of
Surgery (2020).
[12] Jeppe Nørregaard, Benjamin D Horne, and Sibel Adalı. 2019. NELA-GT-2018: A
Large Multi-Labelled News Dataset for the Study of Misinformation in News
Articles. In Proceedings of the International AAAI Conference on Web and Social
Media, Vol. 13. 630–638.
[13]
James W Pennebaker, Ryan L Boyd, Kayla Jordan, and Kate Blackburn. 2015. The
development and psychometric properties of LIWC2015. Technical Report.
[14]
Kai Shu, Deepak Mahudeswaran, Suhang Wang, Dongwon Lee, and Huan Liu.
2018. FakeNewsNet: A Data Repository with News Content, Social Context and
Dynamic Information for Studying Fake News on Social Media. arXiv preprint
arXiv:1809.01286 (2018).
[15]
Niraj Sitaula, Chilukuri K Mohan, Jennifer Grygiel, Xinyi Zhou, and Reza Zafarani.
2020. Credibility-based Fake News Detection. In Disinformation, Misinformation
and Fake News in Social Media: Emerging Research Challenges and Opportunities.
Springer.
[16]
James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal.
2018. FEVER: a large-scale dataset for fact extraction and verication. arXiv
preprint arXiv:1803.05355 (2018).
[17]
William Yang Wang. 2017. " liar, liar pants on re": A new benchmark dataset
for fake news detection. arXiv preprint arXiv:1705.00648 (2017).
[18]
Bo Xu, Bernardo Gutierrez, Sumiko Mekaru, Kara Sewalk, Lauren Goodwin,
Alyssa Loskill, Emily L Cohn, Yulin Hswen, Sarah C Hill, Maria M Cobo, et al
.
2020.
Epidemiological data from the COVID-19 outbreak, real-time case information.
Scientic data 7, 1 (2020), 1–6.
[19]
Reza Zafarani, Mohammad Ali Abbasi, and Huan Liu. 2014. Social media mining:
an introduction. Cambridge University Press.
[20]
Reza Zafarani, Xinyi Zhou, Kai Shu, and Huan Liu. 2019. Fake News Research:
Theories, Detection Strategies, and Open Problems. In Proceedings of the 25th
ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
ACM, 3207–3208.
[21]
Jiawei Zhang, Bowen Dong, and S Yu Philip. 2020. Fakedetector: Eective fake
news detection with deep diusive neural network. In 2020 IEEE 36th International
Conference on Data Engineering (ICDE). IEEE, 1826–1829.
[22]
Xinyi Zhou, Atishay Jain, Vir V Phoha, and Reza Zafarani. 2020. Fake News Early
Detection: A Theory-driven Model. Digital Threats: Research and Practice 1, 2
(2020), 1–25.
[23]
Xinyi Zhou, Jindi Wu, and Reza Zafarani. 2020. SAFE: Similarity-Aware Multi-
Modal Fake News Detection. In The 24th Pacic-Asia Conference on Knowledge
Discovery and Data Mining. Springer.
[24]
Xinyi Zhou and Reza Zafarani. 2020. A Survey of Fake News: Fundamental
Theories, Detection Methods, and Opportunities. ACM Computing Surveys (CSUR)
(2020).
Article
The COVID-19 pandemic has impacted daily lives around the globe. Since 2019, the amount of literature focusing on COVID-19 has risen exponentially. However, it is almost impossible for humans to read all of the studies and classify them. This article proposes a method of making an unsupervised model called a zero-shot classification model, based on the pre-trained BERT model. We used the CORD-19 dataset in conjunction with the LitCovid database to construct new vocabulary and prepare the test dataset. For NLI downstream task, we used three corpora: SNLI, MultiNLI, and MedNLI. We significantly reduced the training time by 98.2639% to build a task-specific machine learning model, using only one Nvidia Tesla V100. The final model can run faster and use fewer resources than its comparators. It has an accuracy of 27.84%, which is lower than the best-achieved accuracy by 6.73%, but it is comparable. Finally, we identified that the tokenizer and vocabulary more specific to COVID-19 could not outperform the generalized ones. Additionally, it was found that BART architecture affects the classification results.
Article
Full-text available
The rampant of COVID-19 infodemic has almost been simultaneous with the outbreak of the pandemic. Many concerted efforts are made to mitigate its negative effect to information credibility and data legitimacy. Existing work mainly focuses on fact-checking algorithms or multi-class labeling models that are less aware of the intrinsic characteristics of the language. Nor is it discussed how such representations can account for the common psycho-socio-behavior of the information consumers. This work takes a data-driven analytical approach to (1) describe the prominent lexical and grammatical features of COVID-19 misinformation; (2) interpret the underlying (psycho-)linguistic triggers in terms of sentiment, power and activity based on the affective control theory; (3) study the feature indexing for anti-infodemic modeling. The results show distinct language generalization patterns of misinformation of favoring evaluative terms and multimedia devices in delivering a negative sentiment. Such appeals are effective to arouse people's sympathy toward the vulnerable community and foment their spreading behavior.
Article
Full-text available
The explosive growth in fake news and its erosion to democracy, justice, and public trust has increased the demand for fake news detection and intervention. This survey reviews and evaluates methods that can detect fake news from four perspectives: (1) the false knowledge it carries, (2) its writing style, (3) its propagation patterns, and (4) the credibility of its source. The survey also highlights some potential research tasks based on the review. In particular, we identify and detail related fundamental theories across various disciplines to encourage interdisciplinary research on fake news. We hope this survey can facilitate collaborative efforts among experts in computer and information sciences, social sciences, political science, and journalism to research fake news, where such efforts can lead to fake news detection that is not only efficient but more importantly, explainable.
Article
Full-text available
Cases of a novel coronavirus were first reported in Wuhan, Hubei province, China, in December 2019 and have since spread across the world. Epidemiological studies have indicated human-to-human transmission in China and elsewhere. To aid the analysis and tracking of the COVID-19 epidemic we collected and curated individual-level data from national, provincial, and municipal health reports, as well as additional information from online reports. All data are geo-coded and, where available, include symptoms, key dates (date of onset, admission, and confirmation), and travel history. The generation of detailed, real-time, and robust data for emerging disease outbreaks is important and can help to generate robust evidence that will support and inform public health decision making.
Article
Full-text available
Background: A recent cluster of pneumonia cases in Wuhan, China, was caused by a novel betacoronavirus, the 2019 novel coronavirus (2019-nCoV). We report the epidemiological, clinical, laboratory, and radiological characteristics and treatment and clinical outcomes of these patients. Methods: All patients with suspected 2019-nCoV were admitted to a designated hospital in Wuhan. We prospectively collected and analysed data on patients with laboratory-confirmed 2019-nCoV infection by real-time RT-PCR and next-generation sequencing. Data were obtained with standardised data collection forms shared by the International Severe Acute Respiratory and Emerging Infection Consortium from electronic medical records. Researchers also directly communicated with patients or their families to ascertain epidemiological and symptom data. Outcomes were also compared between patients who had been admitted to the intensive care unit (ICU) and those who had not. Findings: By Jan 2, 2020, 41 admitted hospital patients had been identified as having laboratory-confirmed 2019-nCoV infection. Most of the infected patients were men (30 [73%] of 41); less than half had underlying diseases (13 [32%]), including diabetes (eight [20%]), hypertension (six [15%]), and cardiovascular disease (six [15%]). Median age was 49·0 years (IQR 41·0-58·0). 27 (66%) of 41 patients had been exposed to Huanan seafood market. One family cluster was found. Common symptoms at onset of illness were fever (40 [98%] of 41 patients), cough (31 [76%]), and myalgia or fatigue (18 [44%]); less common symptoms were sputum production (11 [28%] of 39), headache (three [8%] of 38), haemoptysis (two [5%] of 39), and diarrhoea (one [3%] of 38). Dyspnoea developed in 22 (55%) of 40 patients (median time from illness onset to dyspnoea 8·0 days [IQR 5·0-13·0]). 26 (63%) of 41 patients had lymphopenia. All 41 patients had pneumonia with abnormal findings on chest CT. Complications included acute respiratory distress syndrome (12 [29%]), RNAaemia (six [15%]), acute cardiac injury (five [12%]) and secondary infection (four [10%]). 13 (32%) patients were admitted to an ICU and six (15%) died. Compared with non-ICU patients, ICU patients had higher plasma levels of IL2, IL7, IL10, GSCF, IP10, MCP1, MIP1A, and TNFα. Interpretation: The 2019-nCoV infection caused clusters of severe respiratory illness similar to severe acute respiratory syndrome coronavirus and was associated with ICU admission and high mortality. Major gaps in our knowledge of the origin, epidemiology, duration of human transmission, and clinical spectrum of disease need fulfilment by future studies. Funding: Ministry of Science and Technology, Chinese Academy of Medical Sciences, National Natural Science Foundation of China, and Beijing Municipal Science and Technology Commission.
Conference Paper
Full-text available
Fake news has become a global phenomenon due its explosive growth, particularly on social media. The goal of this tutorial is to (1) clearly introduce the concept and characteristics of fake news and how it can be formally differentiated from other similar concepts such as mis-/dis-information, satire news, rumors, among others, which helps deepen the understanding of fake news; (2) provide a comprehensive review of fundamental theories across disciplines and illustrate how they can be used to conduct interdisciplinary fake news research, facilitating a concerted effort of experts in computer and information science, political science, journalism, social science, psychology and economics. Such concerted efforts can result in highly efficient and explainable fake news detection; (3) systematically present fake news detection strategies from four perspectives (i.e., knowledge, style, propagation, and credibility) and the ways that each perspective utilizes techniques developed in data/graph mining, machine learning, natural language processing, and information retrieval; and (4) detail open issues within current fake news studies to reveal great potential research opportunities, hoping to attract researchers within a broader area to work on fake news detection and further facilitate its development. The tutorial aims to promote a fair, healthy and safe online information and news dissemination ecosystem, hoping to attract more researchers, engineers and students with various interests to fake news research. Few prerequisite are required for KDD participants to attend.
Chapter
Fake news can significantly misinform people who often rely on online sources and social media for their information. Current research on fake news detection has mostly focused on analyzing fake news content and how it propagates on a network of users. In this paper, we emphasize the detection of fake news by assessing its credibility. By analyzing public fake news data, we show that information on news sources (and authors) can be a strong indicator of credibility. Our findings suggest that an author’s history of association with fake news, and the number of authors of a news article, can play a significant role in detecting fake news. Our approach can help improve traditional fake news detection methods, wherein content features are often used to detect fake news.
Article
Massive dissemination of fake news and its potential to erode democracy has increased the demand for accurate fake news detection. Recent advancements in this area have proposed novel techniques that aim to detect fake news by exploring how it propagates on social networks. Nevertheless, to detect fake news at an early stage, i.e., when it is published on a news outlet but not yet spread on social media, one cannot rely on news propagation information as it does not exist. Hence, there is a strong need to develop approaches that can detect fake news by focusing on news content. In this article, a theory-driven model is proposed for fake news detection. The method investigates news content at various levels: lexicon-level, syntax-level, semantic-level, and discourse-level. We represent news at each level, relying on well-established theories in social and forensic psychology. Fake news detection is then conducted within a supervised machine learning framework. As an interdisciplinary research, our work explores potential fake news patterns, enhances the interpretability in fake news feature engineering, and studies the relationships among fake news, deception/disinformation, and clickbaits. Experiments conducted on two real-world datasets indicate the proposed method can outperform the state-of-the-art and enable fake news early detection when there is limited content information.
Article
Background: At the time of this writing, the novel coronavirus (COVID-19) pandemic outbreak has already put tremendous strain on many countries' citizens, resources and economies around the world. Social distancing measures, travel bans, self-quarantines, and business closures are changing the very fabric of societies worldwide. With people forced out of public spaces, much conversation about these phenomena now occurs online, e.g., on social media platforms like Twitter. Objective: In this paper, we describe a multilingual coronavirus (COVID-19) Twitter dataset that we are making available to the research community via our COVID-19-TweetIDs Github repository. Methods: We started this ongoing data collection on January 28, 2020, leveraging Twitter's Streaming API and Tweepy to follow certain keywords and accounts that were trending at the time the collection began, and used Twitter's Search API to query for past tweets, resulting in the earliest tweets in our collection dating back to January 21, 2020. Results: Since the inception of our collection, we have actively maintained and updated our Github repository on a weekly basis. We have published over 123 million tweets, with over 60% of the tweets in English. This manuscript also presents basic analysis that shows that Twitter activity responds and reacts to coronavirus-related events. Conclusions: It is our hope that our contribution will enable the study of online conversation dynamics in the context of a planetary-scale epidemic outbreak of unprecedented proportions and implications. This dataset could also help track scientific coronavirus misinformation and unverified rumors or enable the understanding of fear and panic - and undoubtedly more. Clinicaltrial:
Article
An unprecedented outbreak of pneumonia of unknown aetiology in Wuhan City, Hubei province in China emerged in December of 2019. A novel coronavirus was identified as the causative agent and was subsequently termed COVID-19 by the World Health Organization (WHO). Considered a relative of severe acute respiratory syndrome (SARS) and Middle East respiratory syndrome (MERS), COVID-19 is a betacoronavirus that affects the lower respiratory tract and manifests as pneumonia in humans. Despite rigorous global containment and quarantine efforts, the incidence of COVID-19 continues to rise, with 50,580 laboratory-confirmed cases and 1,526 deaths worldwide. In response to this global outbreak, we summarise the current state of knowledge surrounding COVID-19.