ReCOVery: A Multimodal Repository for COVID-19 News
Data Lab, EECS Department
Data Lab, EECS Department
Information Sciences Institute,
University of Southern California
Data Lab, EECS Department
First identied in Wuhan, China, in December 2019, the outbreak
of COVID-19 has been declared as a global emergency in January,
and a pandemic in March 2020 by the World Health Organization
(WHO). Along with this pandemic, we are also experiencing an
“infodemic” of information with low credibility such as fake news
and conspiracies. In this work, we present
, a reposi-
tory designed and constructed to facilitate research on combating
such information regarding COVID-19. We rst broadly search and
2,000 news publishers, from which 60 are identied
with extreme [high or low] levels of credibility. By inheriting the
credibility of the media on which they were published, a total of
2,029 news articles on coronavirus, published from January to May
2020, are collected in the repository, along with 140,820 tweets that
reveal how these news articles have spread on the Twitter social
network. The repository provides multimodal information of news
articles on coronavirus, including textual, visual, temporal, and
network information. The way that news credibility is obtained
allows a trade-o between dataset scalability and label accuracy.
Extensive experiments are conducted to present data statistics and
distributions, as well as to provide baseline performances for pre-
dicting news credibility so that future methods can be compared.
Our repository is available at http://coronavirus-fakenews.com.
•Information systems →Collaborative and social comput-
ing systems and tools
Clustering and classication
rity and privacy →Social aspects of security and privacy.
Repository; COVID-19; coronavirus; pandemic; infodemic; infor-
mation credibility; fake news; multimodal; social media
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from firstname.lastname@example.org.
CIKM ’20, October 19–23, 2020, Virtual Event, Ireland
©2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-6859-9/20/10.. .$15.00
ACM Reference Format:
Xinyi Zhou, Apurva Mulay, Emilio Ferrara, and Reza Zafarani. 2020.
A Multimodal Repository for COVID-19 News Credibility Research. In Pro-
ceedings of the 29th ACM International Conference on Information and Knowl-
edge Management (CIKM ’20), October 19–23, 2020, Virtual Event, Ireland.
ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3340531.3412880
As of June 4
, the COVID-19 pandemic has resulted in over 6.4
million conrmed cases and over 380,000 deaths globally.
ernments have enforced border shutdowns, travel restrictions, and
quarantines to “atten the curve”. The COVID-19 outbreak has had
a detrimental impact on not only the healthcare sector but also
every aspect of human life such as education and economic sec-
]. For example, over 100 countries have imposed nationwide
(even complete) closures of education facilities, which has lead to
over 900 million learners being aected.
Statistics indicate that
3.3 million Americans applied for unemployment benets in the
week ending on March 21
and the number doubled in the fol-
lowing week, before which the highest number of unemployment
applications ever received in one week was 695,000 in 1982.3
Along with the COVID-19 pandemic, we are also experiencing an
“infodemic” of information with low credibility regarding COVID-
Hundreds of news websites have contributed to publishing
false coronavirus information.
Individuals who believe false news
articles (e.g., claiming that eating boiled garlic or drinking chlorine
dioxide, an industrial bleach, can cure or prevent coronavirus),
might take ineective or extremely dangerous actions to protect
themselves from the virus.6
Given this background, research is motivated to combat this in-
fodemic. Hence, we design and construct a multimodal repository,
, to facilitate reliability assessment of news on COVID-
19. We rst broadly search and investigate
2,000 news publishers,
%E2%80%98infodemic%E2%80%99- misinformation- and-cybercrime-covid-19
arXiv:2006.05557v2 [cs.SI] 17 Aug 2020
from which 60 with various political polarizations and from dier-
ent countries are identied with extreme [high or low] credibility.
As past literature has indicated, there is a close relationship between
the credibility of news articles and their publication sources [
total, 2,029 news articles on coronavirus are nally collected in the
repository along with 140,820 tweets that reveal how these news
articles are spread on the social network. The main contributions
of this work are summarized as follows:
We construct a repository to support the research that in-
vestigates (i) how news with low credibility is created and
spread in the COVID-19 pandemic and (ii) ways to predict
such “fake” news. The manner in which the ground truth
of news credibility is obtained allows a scalable repository,
as annotators need not label each news article that is time-
consuming and instead they can directly label the news site;
provides multimodal information on COVID-19
news articles. For each news article, we collect its news
content and social information revealing how it spreads on
social media, which covers textual, visual, temporal, and
network information; and
We conduct extensive experiments using
which includes data analyses (data statistics and distribu-
tions) and baseline performances for predicting news credi-
bility. These baselines allow future methods to be compared
to. Baselines are obtained using either single-modal or multi-
modal information of news articles and utilize either tradi-
tional statistical learning or deep learning.
The rest of this paper is organized as follows. We rst review
the related datasets in Section 2. Then, we detail how the data is
collected in Section 3. The statistics and distributions of the data are
presented and analyzed in Section 4. Experiments that use the data
to predict news credibility are designed and conducted in Section 5,
whose results can be used as benchmarks. We conclude in Section 6.
2 RELATED WORK
Related datasets can be generally grouped as (I) COVID-19 datasets
and (II) “fake” news and rumor datasets.
COVID-19 Datasets. As a global emergency, the outbreak of
COVID-19 has been labelled as a black swan event and likened
to the economic scene of World War II [
]. With this background,
a group of datasets have emerged, whose contributions range from
real-time tracking of COVID-19 to help epidemiological forecast-
ing (e.g., [
] and [
]) and collecting scholarly COVID-19 articles
for literature-based discoveries (e.g., CORD-19
), to tracking the
spreading of COVID-19 information on Twitter (e.g., ).
Specically, researchers at Johns Hopkins University develop
a Web-based dashboard
to visualize and track reported cases of
COVID-19 in real-time. The dashboard is released on January 22
presenting the location and number of conrmed COVID-19 cases,
deaths, and recoveries for all aected countries [
]. Another dataset
shared publicly on March 24
is constructed to aid the analysis
and tracking of the COVID-19 epidemic, which provides real-time
individual-level data (e.g., symptoms; date of onset, admission, and
conrmation; and travel history) from national, provincial, and
municipal health reports [
]. The Allen Institute for AI has con-
tributed a free and dynamic database of more than 128,000 scholarly
articles about COVID-19, named CORD-19, to the global research
The intention is to mobilize researchers to apply re-
cent advances in Natural Language Processing (NLP) to generate
new insights to support the ght against COVID-19. Furthemore,
Chen et al. [
] release the rst large-scale COVID-19 twitter dataset.
The dataset, updated regularly, collects COVID-19 tweets that are
posted from January 21st and across languages.
Though these datasets have been broadly investigated and have
contributed to the research on coronavirus pandemic, they do not
provide the ground truth on the credibility of information on coro-
navirus to help ght the coronavirus infodemic.
“Fake” News and Rumor Datasets. Existing “fake” news and rumor
datasets are collected with various focuses. These datasets may
(i) only contain news content that can be full articles (e.g., NELA-GT-
] or short claims (e.g., FEVER [
]); (ii) only contain social
media information (e.g., CREDBANK[
]), where news refers to
user posts; or (iii) contain both content and social media information
(e.g., LIAR , FakeNewsNet , and FakeHealth ).
Specically, NELA-GT-2018 [
] is a large-scale dataset of around
713,000 news articles from February to November 2018. News
articles are collected from 194 news medium with multiple la-
bels directly obtained from NewsGuard, Pew Research Center,
Wikipedia, OpenSources, MBFC, AllSides, BuzzFeed News, and
PolitiFact. These labels refer to news credibility, transparency, po-
litical polarizations, and authenticity. FEVER dataset [
185,000 claims and is constructed following two steps: claim
generation and annotation. First, the authors extract sentences
from Wikipedia, and then the annotators manually generate a set
of claims based on the extracted sentences. Then, the annotators
label each claim as “supported”, “refuted”, or “not enough informa-
tion” by comparing it with the original sentence from which it is
developed. On the other hand, some datasets focus on user posts on
social media, for example, CREDBANK [
] includes more than 60
million tweets grouped into 1049 real-world events, each of which is
annotated by 30 human annotators, while some contain both news
content and social media information. By collecting both claims
and fact-check results (labels, i.e., “true”, “mostly true”, “half-true”,
“mostly false”, and “pants on re”) directly from PolitiFact, Wang
establishes the LIAR dataset [
] containing around 12,800 veried
statements made in public speeches and social medium. The afore-
mentioned datasets only contain textual information valuable for
NLP research with limited information on how “fake” news and
rumors spread on social networks, which motivate the construction
of FakeNewsNet and FakeHealth dataset [
]. The FakeNewsNet
dataset collects fact-checked (real or fake) full news articles from
PolitiFact (#=1,056) and GossipCop (#=22,140), respectively and
tracks news spreading on Twitter. The FakeHealth dataset collects
veried (real or fake) news reviews from HealthNewsReview.org
with detailed explanations and social engagements regarding news
spreading on Twitter that includes a user-user social network.
Note that FakeHealth concentrates on healthcare data, so does
CoAID, a recently released dataset for COVID-19 misinformation
Credibi lit y of
News Sit es
News Arti cles
News Sit es News Arti cles
Credibi lit y of
News Sit es
News Tweets Users
Figure 1: Data Collection Process for ReCOVery
In general, compared to datasets such as NELA-GT-2018, FEVER,
and LIAR, our repository provides multimodal information and
social engagements of news articles. Compared to CREDBANK
and FakeNewsNet, ReCOVery aims to ght the coronavirus info-
demic and presents a novel approach to collecting and annotating
data, which allows the trade-o between data scalability and label
accuracy. Compared to FakeHealth and CoAID, news articles in
ReCOVery are from a mix of domains that include healthcare.
3 DATA COLLECTION
The overall process that we collect the data, including news content
and social media information, is presented in Figure 1. To facilitate
scalability, news credibility is assessed based on the credibility
of the media (site) that publishes the news article. Based on the
process outlined in Figure 1, we will further detail how the data
is collected, answering the following three questions: (1) how to
identify reliable (or unreliable) news sites mainly releasing real
news (or fake news)? (which we address in Section 3.1); having
determined such news sites, (2) how do we crawl COVID-19 news
articles from these sites and what news components are valuable
for collection? (Section 3.2); and given COVID-19 news articles, (3)
how can we track their spread on social networks? (Section 3.3)
3.1 Filtering News Sites
To determine a list of reliable and unreliable news sites, we primarily
rely on two resources: NewsGuard and Media Bias/Fact Check.
NewsGuard is developed to review and rate news
websites. Its reliability rating team is formed by trained journalists
and experienced editors, whose credentials and backgrounds are
all transparent and available on the site. The performance (credibil-
ity) of each news website is assessed based on the following nine
(1) Does not repeatedly publish false content, (22 points)
(2) Gathers and presents information responsibly, (18 points)
(3) Regularly corrects or claries errors, (12.5 points)
Handles the dierence between news and opinion responsi-
bly, (12.5 points)
(5) Avoids deceptive headlines, (10 points)
(6) Website discloses ownership and nancing, (7.5 points)
(7) Clearly labels advertising, (7.5 points)
Reveals whoâĂŹs in charge, including possible conicts of
interest, and (5 points)
The site provides the names of content creators, along with
either contact or biographical information, (5 points)
where the overall score of a site is between 0 to 100; 0 indicates the
lowest credibility, and 100 indicates the highest credibility. A news
website with a NewsGuard score higher than 60 is often labeled
as reliable; otherwise, it is unreliable. NewsGuard has provided
ground truth for the construction of news datasets such as NELA-
GT-2018  for studying misinformation.
Media Bias/Fact Check (MBFC).
MBFC is a website that rates
factual accuracy and political bias of news medium. The fact-checking
team consists of Dave Van Zandt, the primary editor and the web-
site owner, and some journalists and researchers (more details can
be found on its “About” page). MBFC labels each news media as
one of six factual-accuracy levels based on the fact-checking results
of the news articles it has published (more details can be found on
its “Methodology” page): (i) very high, (ii) high, (iii) most factual,
(iv) mixed, (v) low, and (vi) very low. Such information has been
used as ground truth for automatic fact-checking studies. 
What Are Our Criteria? Referenced by NewsGuard and MBFC,
our criteria for determining reliable and unreliable news sites are:
A news site is reliable if its NewsGuard score is
greater than 90,
its factual reporting on MBFC
is very high or high.
A news site is unreliable if its NewsGuard score is
less than 30,
its factual reporting on MBFC is
Our search towards news medium with high credibility is con-
ducted among news medium listed in MBFC (
2,000). To nd news
medium with low credibility, we search in MBFC and the newly
released “Coronavirus Misinformation Tracking Center”
Guard, which provides a list of websites publishing false coron-
avirus information. Ultimately, we obtain a total of 60 news sites,
from which 22 are the sources of reliable news articles (e.g., National
) and the remaining 38 are sources to
collect unreliable news articles (e.g., Human Are Free
(a) Reliable News Sites
(b) Unreliable News Sites
Figure 2: Credibility Distribution of Determined News Sites
(a) Reliable News15 (b) Unreliable News16
Figure 3: Examples of News Articles Collected
). The full list of sites considered in our repository is also
available at http://coronavirus-fakenews.com. Note that several
“fake” news medium are not included, such as 70 News,Conserva-
tive 101, and Denver Guardian, since they no longer exist or their
domains have been unavailable.
Also note that to achieve a good trade-o between dataset scala-
bility and label accuracy, we utilize more extreme threshold scores
(30 and 90) compared to the initial one provided by NewsGuard (60).
In this way, the selected news sites exhibit an extreme reliability (or
unreliability), which helps reduce the number of false positives and
false negatives in news labels in our repository; ideally, each news
article published on a reliable site is factual, and on an unreliable
ama-malala- jonas-brothers-send-off- class-of -2020-in-virtual-graduation
site is false. Figure 2 illustrates the credibility distributions of reli-
able and unreliable news sites. It can be observed from the gure
that for reliable news, most of them have a full mark on NewsGuard
and are labeled as “high"ly factual by MBFC; “very high” is rare for
all sites listed in MBFC. In contrast, unreliable news sites share an
average NewsGuard score of
15 and a low factual label by MBFC;
similarly, “very low” is rarely given on MBFC.
3.2 Collecting COVID-19 News Content
To crawl COVID-19 news articles from selected news sites, we
rst determine whether the news article is about COVID-19; the
process is detailed in Section 3.2.1. Next, we detail how the data is
crawled and the news content components that are included in our
repository in Section 3.2.2.
3.2.1 News Topic Identification. To identify news articles on COVID-
19, we use a list of keywords:
News articles whose content contains any of the keywords (case-
insensitive) are considered related to COVID-19. These three key-
words are the ocial names announced by the WHO on February
, where SARS-CoV-2 (standing for Severe Acute Respiratory
Syndrome CoronaVirus 2) is the virus name, and Coronavirus and
COVID-19 are the name of the disease that the virus causes. Before
the WHO announcement, COVID-19 was previously known as the
“2019 novel coronavirus,”17, which also includes the coronavirus
keyword which we are considering. We merely consider ocial
names as keywords to avoid potential biases or even discrimination
in naming. Furthermore, a news media (article) that is credible, or
pretends to be credible, often acts professionally and adopts the
ocial name(s) of the disease/virus. Compared to those articles that
use biased and/or inaccurate terms, false news pretending to be
professional is more detrimental and challenging to detect, which
has become the focus of current fake news studies. [
of such news articles are illustrated in Figure 3.
3.2.2 Crawling News Content. Content crawler relies on Newspa-
per Python library.
The content of each news article corresponds
to twelve components:
News ID: Each news article is assigned a unique id as the
News URL: The URL of the news article. The URL helps us
verify the correctness of the collected data. It can also be used
as the reference and source when repository users would like
to extend the repository by fetching additional information;
Publisher: The name of the news media (site) that publishes
the news article;
Publication Date: The date (in
format) on which
the news article was published on the site, which provides
temporal information to support the investigation of, e.g.,
the relationship between the misinformation volume and
the outbreak of COVID-19 over time;
idance/naming-the- coronavirus-disease-(covid- 2019)-and- the-virus-that-causes-it
Author: The author(s) of the news article, whose number can
be none, one, or more than one. Note that some news articles
might have ctional author names. Author information is
valuable in evaluating news credibility by either investigat-
ing the collaboration network of authors [
] or exploring
its relationships with news publishers and content ;
(C6-7)News Title and Bodytext as the main textual information;
News Image as the main visual information, which is pro-
vided in the form of a link (URL). Note that most images
within the news page are noise – they can be advertise-
ments, images belonging to other news articles due to the
recommender systems embedded in news sites, logos of news
sites and/or social media icons, such as Twitter and Face-
book logos for sharing. Hence, we particularly fetch the
main/head/top image for each news article to reduce noise;
Country: The name of country where the news is published;
Political bias: Each news article is labeled as one of ‘extremely
left’, ‘left’, ‘left-center’, ‘center’, ‘right-center’, ‘right’, and
‘extremely right’ that is equivalent to the political bias of its
publisher. News political bias is veried by two resources,
and MFBC, both of which rely on domain experts
to label media bias; and
NewsGuard score and MBFC factual reporting as the original
ground truth of news credibility, which has been detailed in
3.3 Tracking News Spreading on Social Media
We rst use Twitter Premium Search API
to track the spread of
collected news articles on Twitter. Specically, our search is based
on the URL of each news article and looks for tweets posted after
the date when the news article was published to the current date
(for the current version of the dataset, this date is May 26
ter Search API can return the corresponding tweets with detailed
information such as their IDs, text, languages of text, times of being
created, statistics on retweeted/replied/liked. Also, it returns the
information of users who post these tweets, such as user IDs and
their number of followers/friends. To comply with Twitter’s Terms
we only publicly release the IDs of the collected data
for non-commercial research use, but provide the instructions for
obtaining the tweets using the released IDs for user convenience.
More details can be seen in http://coronavirus-fakenews.com.
4 DATA STATISTICS AND DISTRIBUTIONS
The general statistics on our dataset is presented in Table 1. The
dataset contains 2,029 news articles, most of which have both tex-
tual and visual information for multimodal studies (#=2,017), [
and have been shared on social media (#=1,747). The dataset is im-
balanced in news class – the proportion of reliable versus unreliable
news articles is around 2:1. The number of users who spread reli-
able news (#=78,659) plus that of users spreading unreliable news
(#=17,323) is greater than the total number of users included in the
dataset (#=93,761). This observation indicates that users can both
engage in spreading reliable and unreliable news articles.
Table 1: Data Statistics
Reliable Unreliable Total
News articles 1,364 665 2,029
w/ images 1,354 663 2,017
w/ social information 1,219 528 1,747
Tweets 114,402 26,418 140,820
Users 78,659 17,323 93,761
Next, we visualize the distributions of data features/attributes.
Distribution of News Publishers. Figure 4 shows the number of
COVID-19 news articles published in each [extremely reliable or
extremely unreliable] news site. There are ve unreliable publishers
with no news on COVID-19; hence, they are not presented in the
gure. We keep these publishers in our repository as the data will be
updated over time and these publishers may publish news articles
on COVID-19 in the future.
News Publication Dates. The distribution of news publication
dates is presented in Figure 5, where all articles are published in
2020. We point out that from January to May, the number of COVID-
19 news articles published is signicantly (exponentially) increased.
The possible explanation for this phenomena is three-fold. First,
from the time that the outbreak was rst identied in Wuhan,
China (December 2019) [
] to May 2020, the number of conrmed
cases and deaths caused by SARS-CoV-2 have exponentially grown
Meanwhile, the virus has become a world topic and has
triggered more and more discussions on a world-wide scale. Sec-
ond, some older news articles are no longer available, which has
motivated us to timely update the dataset. Third, the keywords we
have used to identify COVID-19 news articles are the ocial ones
provided by the WHO in February.
Some news articles published
in January are also collected, as before the WHO announcement
COVID-19 was known as the “2019 novel coronavirus,” which also
includes one of our keywords “coronavirus.” We have detailed the
reasons behind our keyword selection in Section 3.2.1. Note that
there are a small group of news articles whose publication dates
are not accessible, which we denote as N/A in Figure 5.
News Authors and Author Collaborations. Figure 6 presents the
distribution of the number of authors contributing to news articles,
which is governed by a long-tail distribution: most articles are con-
5authors. Instead of including the [real or virtual]
names of the authors, some articles provide publisher names as
authors. Considering such information has been available in the
repository, we leave the author information of these news arti-
cles blank, i.e., their number of authors is zero. Furthermore, we
construct the coauthorship network, shown in Figure 7. It can be
observed from the network that node degrees also follow a power-
law-like distribution: among 1,095 nodes (authors), over 90% of
them have less than or equal to two collaborators.
News Content Statistics. Both Figures 8 and 9 reveal textual char-
acteristics within news content (including news title and bodytext).
It can be observed from Figure 8 that the number of words within
news content follows a long-tail (power-low-like) distribution, with
an average value of
800 and a median value of
600. On the other
Figure 4: Distribution of News Publishers
Figure 5: Publication Date Figure 6: Author Count
hand, Figure 9 provides the word cloud for the entire repository. As
the news articles collected share the same COVID-19 topic, some
relevant topics and vocabularies have been naturally and frequently
used by the news authors, such as “coronavirus” (#=6465), “COVID”
(#=5413), “state” (#=4432), “test” (#=4274), “health” (#=3714), “pan-
demic” (#=3427), “virus” (#=2903), “home” (#=2871), “case” (#=2676),
and “Trump” (#=2431) that are illustrated with word font size scaled
to their frequencies.
Country Distribution. Figure 10 reveals the countries that news
and news publishers belong to. It can be observed that in total
six countries – United States (abbr. US), Russia (abbr. RU), United
Kingdom (abbr. UK), Iran (abbr. IR), Cyprus (abbr. CY), and Canada
(abbr. CA) – are covered, where US news and news publishers
constitute the majority of the population.
(a) Network (b) Degree Distribution
Figure 7: Author Collaborations
Figure 8: Word Count Figure 9: Word Cloud
(a) News Publishers (b) News Articles
Figure 10: Country
(a) News Publishers (b) News Articles
Figure 11: Political Bias
Figure 12: Spreading Frequency Figure 13: News Spreaders Figure 14: Follower Distribution Figure 15: Friend Distribution
Political Bias. Figure 11 provides the distribution of political bias
of news and news medium (publishers). It can be observed from
the gure that for both news and publishers, the distribution for
those exhibiting a right bias (including extremely right (abbr. Ex.
R), right (abbr. R), and right-center (abbr. R-C)) is more balanced
compared to those exhibiting a left bias (including extremely left
(abbr. Ex. L), left (abbr. L), and left-center (abbr. L-C)).
News Spreading Frequencies. Figure 12 shows the distribution of
the number of tweets sharing each news article. The distribution
exhibits a long tail – over 80% of news articles are spread less than
100 times while a few have been shared by thousands of tweets.
News Spreaders. The distribution of the number of spreaders for
each news article is shown in Figure 13. It diers from the distri-
bution in Figure 12 as one user can spread a news article multiple
times. As for social connections of news spreaders, the distributions
of their followers and friends are respectively presented in Figures
14 and 15, where the most popular spreader has over 40 million
followers (or 600,000 friends).
5 FORMING BASELINES: USING ReCOVery TO
PREDICT COVID-19 NEWS CREDIBILITY
In this section, several methods that often act as baselines are
utilized and developed to predict COVID-19 news credibility using
data, hoping to facilitate future studies. These methods
(baselines) are rst specied in Section 5.1. The implementation
details of experiments are then provided in Section 5.2. Finally, we
present the performance results for these methods in Section 5.3.
We involve the following methods as our as baselines. These meth-
ods can be grouped by their learning framework, which is either a
traditional statistical learner such as SVM (e.g., LIWC) or a neural
network (e.g., Text-CNN and SAFE). Baselines can also be grouped
as single-modal methods (e.g., LIWC, RST, and Text-CNN) or multi-
modal methods (e.g., SAFE).
LIWC (Linguistic Inquiry and Word Count) is
a widely-accepted psycholinguistic lexicon. Given a news story,
LIWC can count the words in the text falling into one or more of
93 linguistic, psychological, and topical categories, based on which
93 features are extracted and often classied within a traditional
statistical learning framework .
RST. RST (Rhetorical Structure Theory) organizes a piece of con-
tent as a tree that captures the rhetorical relation among its phrases
and sentences. We use a pretrained RST parser [
to obtain the
tree for each news article and count each rhetorical relation (in
total, 45) within a tree, based on which 45 features are extracted
and classied in a traditional statistical learning framework.
]. Text-CNN relies on a Convolutional Neural Net-
works for text classication, which contains a convolutional layer
and max pooling.
SAFE is a neural-network-based method that uti-
lizes news multimodal information for fake news detection, where
news representation is learned jointly by news textual and visual
information along with their relationship. SAFE facilitates recogniz-
ing the news falseness in its text, images, and/or the “irrelevance”
between the text and images.
Table 2: Baselines Performance in Predicting COVID-19
News Credibility Using ReCOVery Data
Method Reliable news Unreliable news
Pre. Rec. F1Pre. Rec. F1
LIWC+DT 0.779 0.771 0.775 0.540 0.552 0.545
RST+DT 0.721 0.705 0.712 0.421 0.441 0.430
Text-CNN 0.746 0.782 0.764 0.522 0.472 0.496
SAFE 0.836 0.829 0.833 0.667 0.677 0.672
5.2 Implementation Details
The overall dataset is randomly divided into training and testing
datasets with a proportion of 0.8:0.2. As the dataset has an unbal-
anced distribution between reliable and unreliable news articles
2:1), we evaluate the prediction results in terms of precision, re-
call, and the
score. For methods relying on traditional statistical
learners, multiple well-established classiers are adopted in our
experiments: Logistic Regression (LR), Naïve Bayes (NB),
-NN), Random Forest (RF), Decision Tree (DT), and
Support Vector Machines (SVM). We merely present the one per-
forming best due to the space limitation. Codes are all available on
5.3 Experimental Results
Prediction results are provided in Table 2. We observe that four
baselines achieve an
-score (precision, recall) score of 71% (72%,
71%) to 83% (84%, 83%) in identifying reliable news and between
43% (42%, 44%) to 67% (67%, 68%) for unreliable news. Additionally,
multimodal features are generally more representative than single-
modal features in predicting news credibility. We point out that
the four baselines are content-based methods; developing more
advanced methods by mining social media  are encouraged.
To ght the coronavirus infodemic, we construct a multimodal
repository for COVID-19 news credibility research, which provides
textual, visual, temporal, and network information regarding news
content and how news spreads on social media. The repository
balances data scalability and label accuracy. To facilitate future
studies, benchmarks are developed and their performances are pre-
sented on predicting news credibility using the data available in the
repository. We point out that the data could be further enhanced
(1) by including COVID-19 news articles in various languages such
as Chinese, Russian, Spanish, and Italian, as well as the information
on how these news articles spread on the popular local social media
for those languages, e.g., Sina Weibo (China). Furthermore, (2) ex-
tending the dataset by introducing the ground truth of, for example,
hate speech, clickbaits, and social bots [
] would help study the
bias and discrimination bred by the virus, as well as the correlation
among all information and accounts with low credibility. Both (1)
and (2) will be our future work.
Emilio Ferrara is supported by the Defense Advanced Research
Projects Agency (DARPA, grant number W911NF-17-C-0094).
Ramy Baly, Georgi Karadzhov, Dimitar Alexandrov, James Glass, and Preslav
Nakov. 2018. Predicting factuality of reporting and bias of news media sources.
arXiv preprint arXiv:1810.01765 (2018).
Emily Chen, Kristina Lerman, and Emilio Ferrara. 2020. Tracking Social Media
Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus
Twitter Data Set. JMIR Public Health and Surveillance 6, 2 (2020), e19273.
Limeng Cui and Dongwon Lee. 2020. CoAID: COVID-19 Healthcare Misinforma-
tion Dataset. arXiv preprint arXiv:2006.00885 (2020).
Enyan Dai, Yiwei Sun, and Suhang Wang. 2020. Ginger Cannot Cure Cancer:
Battling Fake Health News with a Comprehensive Data Repository. In Proceedings
of the International AAAI Conference on Web and Social Media, Vol. 14. 853–862.
Ensheng Dong, Hongru Du, and Lauren Gardner. 2020. An interactive web-based
dashboard to track COVID-19 in real time. The Lancet infectious diseases 20, 5
Emilio Ferrara. 2019. The history of digital spam. Commun. ACM 62, 8 (2019),
Chaolin Huang, Yeming Wang, Xingwang Li, Lili Ren, Jianping Zhao, Yi Hu, Li
Zhang, Guohui Fan, Jiuyang Xu, Xiaoying Gu, et al
2020. Clinical features of
patients infected with 2019 novel coronavirus in Wuhan, China. The lancet 395,
10223 (2020), 497–506.
Yangfeng Ji and Jacob Eisenstein. 2014. Representation learning for text-level
discourse parsing. In Proceedings of the 52nd Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 13–24.
Yoon Kim. 2014. Convolutional neural networks for sentence classication. arXiv
preprint arXiv:1408.5882 (2014).
Tanushree Mitra and Eric Gilbert. 2015. CREDBANK: A Large-scale Social Media
Corpus with Associated Credibility Annotations. In Ninth International AAAI
Conference on Web and Social Media.
Maria Nicola, Zaid Alsa, Catrin Sohrabi, Ahmed Kerwan, Ahmed Al-Jabir, Chris-
tos Iosidis, Maliha Agha, and Riaz Agha. 2020. The socio-economic implications
of the coronavirus and COVID-19 pandemic: A review. International Journal of
 Jeppe Nørregaard, Benjamin D Horne, and Sibel Adalı. 2019. NELA-GT-2018: A
Large Multi-Labelled News Dataset for the Study of Misinformation in News
Articles. In Proceedings of the International AAAI Conference on Web and Social
Media, Vol. 13. 630–638.
James W Pennebaker, Ryan L Boyd, Kayla Jordan, and Kate Blackburn. 2015. The
development and psychometric properties of LIWC2015. Technical Report.
Kai Shu, Deepak Mahudeswaran, Suhang Wang, Dongwon Lee, and Huan Liu.
2018. FakeNewsNet: A Data Repository with News Content, Social Context and
Dynamic Information for Studying Fake News on Social Media. arXiv preprint
Niraj Sitaula, Chilukuri K Mohan, Jennifer Grygiel, Xinyi Zhou, and Reza Zafarani.
2020. Credibility-based Fake News Detection. In Disinformation, Misinformation
and Fake News in Social Media: Emerging Research Challenges and Opportunities.
James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal.
2018. FEVER: a large-scale dataset for fact extraction and verication. arXiv
preprint arXiv:1803.05355 (2018).
William Yang Wang. 2017. " liar, liar pants on re": A new benchmark dataset
for fake news detection. arXiv preprint arXiv:1705.00648 (2017).
Bo Xu, Bernardo Gutierrez, Sumiko Mekaru, Kara Sewalk, Lauren Goodwin,
Alyssa Loskill, Emily L Cohn, Yulin Hswen, Sarah C Hill, Maria M Cobo, et al
Epidemiological data from the COVID-19 outbreak, real-time case information.
Scientic data 7, 1 (2020), 1–6.
Reza Zafarani, Mohammad Ali Abbasi, and Huan Liu. 2014. Social media mining:
an introduction. Cambridge University Press.
Reza Zafarani, Xinyi Zhou, Kai Shu, and Huan Liu. 2019. Fake News Research:
Theories, Detection Strategies, and Open Problems. In Proceedings of the 25th
ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
Jiawei Zhang, Bowen Dong, and S Yu Philip. 2020. Fakedetector: Eective fake
news detection with deep diusive neural network. In 2020 IEEE 36th International
Conference on Data Engineering (ICDE). IEEE, 1826–1829.
Xinyi Zhou, Atishay Jain, Vir V Phoha, and Reza Zafarani. 2020. Fake News Early
Detection: A Theory-driven Model. Digital Threats: Research and Practice 1, 2
Xinyi Zhou, Jindi Wu, and Reza Zafarani. 2020. SAFE: Similarity-Aware Multi-
Modal Fake News Detection. In The 24th Pacic-Asia Conference on Knowledge
Discovery and Data Mining. Springer.
Xinyi Zhou and Reza Zafarani. 2020. A Survey of Fake News: Fundamental
Theories, Detection Methods, and Opportunities. ACM Computing Surveys (CSUR)