ArticlePublisher preview available

CHECKED: Chinese COVID-19 fake news dataset

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

COVID-19 has impacted all lives. To maintain social distancing and avoiding exposure, works and lives have gradually moved online. Under this trend, social media usage to obtain COVID-19 news has increased. Also, misinformation on COVID-19 is frequently spread on social media. In this work, we develop CHECKED, the first Chinese dataset on COVID-19 misinformation. CHECKED provides a total 2,104 verified microblogs related to COVID-19 from December 2019 to August 2020, identified by using a specific list of keywords. Correspondingly, CHECKED includes 1,868,175 reposts, 1,185,702 comments, and 56,852,736 likes that reveal how these verified microblogs are spread and reacted on Weibo. The dataset contains a rich set of multimedia information for each microblog including ground-truth label, textual, visual, temporal, and network information. Extensive experiments have been conducted to analyze CHECKED data and to provide benchmark results for well-established methods when predicting fake news using CHECKED. We hope that CHECKED can facilitate studies that target misinformation on coronavirus. The dataset is available at https://github.com/cyang03/CHECKED.
This content is subject to copyright. Terms and conditions apply.
Vol.:(0123456789)
1 3
Social Network Analysis and Mining (2021) 11:58
https://doi.org/10.1007/s13278-021-00766-8
ORIGINAL ARTICLE
CHECKED: Chinese COVID‑19 fake news dataset
ChenYang1· XinyiZhou1 · RezaZafarani1
Received: 15 October 2020 / Revised: 1 June 2021 / Accepted: 4 June 2021 / Published online: 22 June 2021
© The Author(s), under exclusive licence to Springer-Verlag GmbH Austria, part of Springer Nature 2021
Abstract
COVID-19 has impacted all lives. To maintain social distancing and avoiding exposure, works and lives have gradually moved
online. Under this trend, social media usage to obtain COVID-19 news has increased. Also, misinformation on COVID-19
is frequently spread on social media. In this work, we develop CHECKED, the first Chinese dataset on COVID-19 mis-
information. CHECKED provides a total 2,104 verified microblogs related to COVID-19 from December 2019 to August
2020, identified by using a specific list of keywords. Correspondingly, CHECKED includes 1,868,175 reposts, 1,185,702
comments, and 56,852,736 likes that reveal how these verified microblogs are spread and reacted on Weibo. The dataset
contains a rich set of multimedia information for each microblog including ground-truth label, textual, visual, temporal, and
network information. Extensive experiments have been conducted to analyze CHECKED data and to provide benchmark
results for well-established methods when predicting fake news using CHECKED. We hope that CHECKED can facilitate
studies that target misinformation on coronavirus. The dataset is available at https:// github. com/ cyang 03/ CHECK ED.
Keywords Dataset· COVID-19· Infodemic· Information credibility· Fake news· Multimedia· Social media
1 Introduction
Starting from its first case, confirmed on December 31 in
Sohrabi etal. (2020), the novel coronavirus has surged into
a world phenomenon rapidly. On January 30, the World
Health Organization (WHO) has declared its outbreak as
a global emergency (Sohrabi etal. 2020). As of October
13, the COVID-19 outbreak has caused over 3.7 million
confirmed cases and over 1 million deaths worldwide.1 To
combat the epidemic, maintaining social distance has been
considered effective. In turn, working and studying from
home has become a new trend. With the decrease in physical
social contacts and the rise of anxiety on the pandemic, the
frequency of social media usage has increased. The COVID-
19 outbreak as an international public health emergency is
closely connected with individuals’ health and lives. Any
news or information about the COVID-19 or a potential cure
highly attracts public attention and influences social media.
Therefore, it is of crucial importance to ensure information
spread on COVID-19 is credible.
With more information on COVID-19, people gain a
deeper understanding. To that end, a number of COVID-
related datasets have been released and studied. Existing
datasets have contributed to collecting either (i) Chinese
COVID data without identifying news credibility (e.g.,
see Hu etal. 2020; Gao etal. 2020); or (ii) non-Chinese
COVID data for news credibility (e.g., see Zhou etal.
2020a; Cui and Lee 2020; Li etal. 2020). Therefore, we
are motivated to build a dataset which contains data from
Chinese social media and includes ground-truth labels
(i.e., true/false).
Weibo (weibo. com), as a platform for information shar-
ing, dissemination, and acquisition based on user relations,
is one of the most popular social media in China. According
to Weibo’s first-quarter earnings report for 2020,2 Weibo
passed 500 million monthly active users and 200 million
daily active users in March. Due to the large number of
active users, we consider Weibo as one of the most used
* Xinyi Zhou
zhouxinyi@data.syr.edu
Chen Yang
cyang03@syr.edu
Reza Zafarani
reza@data.syr.edu
1 Data Lab, Department ofElectrical Engineering
andComputer Science, Syracuse University, Syracuse, USA
1 https:// covid 19. who. int/.
2 http:// ir. weibo. com/ news- relea ses/ news- relea se- detai ls/ weibo- repor
ts- first- quart er- 2020- unaud ited- finan cial- resul ts.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... Coronavirus disease 2019 rumors in China can be grouped into five categories: the nature of the virus, pandemic areas and confirmed cases, COVID-19 policies, authorities and organizations (e.g., WHO), and medical supplies (10). Rumors undermine the government's efforts to control the pandemic because many social media users cannot discern their authenticity (3,11). ...
... Similarly, Zhou et al. (24) and Luo et al. (25) found that the false statement was longer because deceivers must provide supporting evidence and details to persuade their receivers to construe the message. In addition, some studies have examined misinformation detection using content-based features such as paralanguage features (10,(26)(27)(28)(29). For example, Qazvinian et al. (27) reported that tweets with more hashtags were more likely to be misinformation than those without them. ...
... Chua and Banerjee (34) indicated that information using more exclusive words was more likely to be false. Furthermore, previous studies investigated whether the frequency of question marks, sentiment markers, arbitrary words, and tentative words (e.g., "maybe") could differentiate false rumors from true ones (10,13,35,36). ...
Article
Full-text available
Rumors regarding COVID-19 have been prevalent on the Internet and affect the control of the COVID-19 pandemic. Using 1,296 COVID-19 rumors collected from an online platform (piyao.org.cn) in China, we found measurable differences in the content characteristics between true and false rumors. We revealed that the length of a rumor's headline is negatively related to the probability of a rumor being true [odds ratio (OR) = 0.37, 95% CI (0.30, 0.44)]. In contrast, the length of a rumor's statement is positively related to this probability [OR = 1.11, 95% CI (1.09, 1.13)]. In addition, we found that a rumor is more likely to be true if it contains concrete places [OR = 20.83, 95% CI (9.60, 48.98)] and it specifies the date or time of events [OR = 22.31, 95% CI (9.63, 57.92)]. The rumor is also likely to be true when it does not evoke positive or negative emotions [OR = 0.15, 95% CI (0.08, 0.29)] and does not include a call for action [OR = 0.06, 95% CI (0.02, 0.12)]. By contrast, the presence of source cues [OR = 0.64, 95% CI (0.31, 1.28)] and visuals [OR = 1.41, 95% CI (0.53, 3.73)] is related to this probability with limited significance. Our findings provide some clues for identifying COVID-19 rumors using their content characteristics.
... It can be observed that most of the fake news studies were based on an empirical approach, specifically surveys (N = 43; 44%); content analysis (N = 23, 23.71%) and interviews (N = 3, 3.09%), followed by AI-based tools (N = 28, 28%). Only three studies (3%) focused on developing COVID-19 fake news datasets [28][29][30]. Specifically [28,29], created datasets based on social media comments (i.e., Twitter and Sina Weibo, respectively), whereas Islam et al. [30] compiled comments from social media platforms, fact-checking websites, and Google for vaccine-related fake news. ...
... Only three studies (3%) focused on developing COVID-19 fake news datasets [28][29][30]. Specifically [28,29], created datasets based on social media comments (i.e., Twitter and Sina Weibo, respectively), whereas Islam et al. [30] compiled comments from social media platforms, fact-checking websites, and Google for vaccine-related fake news. ...
Article
The spread of fake news increased dramatically during the COVID-19 pandemic worldwide. This study aims to synthesize the extant literature to understand the magnitude of this phenomenon in the wake of the pandemic in 2021, focusing on the motives and sociodemographic profiles, Artificial Intelligence (AI)-based tools developed, and the top trending topics related to fake news. A scoping review was adopted targeting articles published in five academic databases (January 2021–November 2021), resulting in 97 papers. Most of the studies were empirical in nature (N = 69) targeting the general population (N = 26) and social media users (N = 13), followed by AI-based detection tools (N = 27). Top motives for fake news sharing include low awareness, knowledge, and health/media literacy, Entertainment/Pass Time/Socialization, Altruism, and low trust in government/news media, whilst the phenomenon was more prominent among those with low education, males and younger. Machine and deep learning emerged to be the widely explored techniques in detecting fake news, whereas top topics were related to vaccine, virus, cures/remedies, treatment, and prevention. Immediate intervention and prevention efforts are needed to curb this anti-social behavior considering the world is still struggling to contain the spread of the COVID-19 virus.
... CHECKED is the first Chinese dataset on COVID-19 misinformation [37]. It contains 2,120 pieces of microblog posted from December 2019 to August 2020, of which 344 were "fake news" and 1,776 were "true news". ...
Article
Full-text available
Fake news detection mainly relies on the extraction of article content features with neural networks. However, it has brought some challenges to reduce the noisy data and redundant features, and learn the long-distance dependencies. To solve the above problems, Dual-channel Convolutional Neural Networks with Attention-pooling for Fake News Detection (abbreviated as DC-CNN) is proposed. This model benefits from Skip-Gram and Fasttext. It can effectively reduce noisy data and improve the learning ability of the model for non-derived words. A parallel dual-channel pooling layer was proposed to replace the traditional CNN pooling layer in DC-CNN. The Max-pooling layer, as one of the channels, maintains the advantages in learning local information between adjacent words. The Attention-pooling layer with multi-head attention mechanism serves as another pooling channel to enhance the learning of context semantics and global dependencies. This model benefits from the learning advantages of the two channels and solves the problem that pooling layer is easy to lose local-global feature correlation. This model is tested on two different COVID-19 fake news datasets, and the experimental results show that our model has the optimal performance in dealing with noisy data and balancing the correlation between local features and global features.
... The news media dominates the information available on the Internet or through digital media. While the Internet's existence has an impact on fake news (Hirlekar and Kumar, 2022;Yang, Zhou, and Zafarani, 2021), misleading information (Providel and Mendoza, 2021), and a source of legitimacy . Furthermore, a previous study has shown that headlines are a good indicator of whether or not something is false news (Grady, Ditto & Loftus, 2021). ...
... A myriad of misinformation datasets have emerged before the COVID-19 pandemic, which mainly concern fake news of the political discourse, such as LIAR, FEVER, and CRED-BANK [23][24][25]. Since the outbreak of COVID-19, a growing number of COVID-19 misinformation datasets have been compiled for research in combating the proliferation of COVID-19 misinformation online, e.g., CoAID, ReCOVery, COVID Fake News Dataset, and so on [21,[26][27][28][29][30][31][32][33][34][35]. We review the most relevant datasets to this work in Table 1. ...
Article
Full-text available
The rampant of COVID-19 infodemic has almost been simultaneous with the outbreak of the pandemic. Many concerted efforts are made to mitigate its negative effect to information credibility and data legitimacy. Existing work mainly focuses on fact-checking algorithms or multi-class labeling models that are less aware of the intrinsic characteristics of the language. Nor is it discussed how such representations can account for the common psycho-socio-behavior of the information consumers. This work takes a data-driven analytical approach to (1) describe the prominent lexical and grammatical features of COVID-19 misinformation; (2) interpret the underlying (psycho-)linguistic triggers in terms of sentiment, power and activity based on the affective control theory; (3) study the feature indexing for anti-infodemic modeling. The results show distinct language generalization patterns of misinformation of favoring evaluative terms and multimedia devices in delivering a negative sentiment. Such appeals are effective to arouse people's sympathy toward the vulnerable community and foment their spreading behavior.
... Since the outbreak of COVID-19, a growing number of COVID-19 misinformation datasets have been compiled for research in combating the proliferation of COVID-19 misinformation online, e.g. CoAID, ReCOVery, COVID Fake News Dataset, and so on [21,[26][27][28][29][30][31][32][33][34][35]. We review the most relevant datasets to this work in Table 1. ...
Preprint
The rampant of COVID-19 infodemic has almost been simultaneous with the outbreak of the pandemic. Many concerted efforts are made to mitigate its negative effect to information credibility and data legitimacy. Existing work mainly focus on fact-checking algorithms or multi-class labeling models that are less aware of the intrinsic characteristics of the language. Nor is it discussed how such representations can account for the common psycho-socio behavior of the information consumers. This work takes a data-driven analytical approach to 1) describe the prominent lexical and grammatical features of COVID-19 misinforma-tion; 2) interpret the underlying (psycho-)linguistic triggers in terms of sentiment, power and activity based on the Affective Control Theory ; 3) study the feature indexing for anti-infodemic modeling. Results show distinct language generalization patterns of misinformation of favoring evaluative terms and multi-media devices in delivering a negative sentiment. Such appeals are effective to arouse people's sympathy towards the vulnerable community and foment their spreading behavior.
Chapter
The rapid outbreak of COVID-19 has heightened interest in news about the pandemic. In addition to obtaining real-time developments about COVID-19, people have learned about prevention methods through the news media. Ironically, false COVID-19 news has spread faster than the virus, posing an additional health threat with advice being as dangerous as infection. In this study, we developed a Chinese news article dataset on COVID-19 misinformation, which contained 1266 verified articles from 118 Chinese digital newspaper platforms from January 2020 to January 2021. This dataset uses machine learning methods to detect false news in the Chinese language. Because automated classification methods, combined with human computation-based approaches, are effective for combating digital misinformation, we applied and evaluated a collaborative intelligence approach that leverages human fact-checking skills with feedback on news stories using four criteria: source, author, message, and spelling. The results show that reliable human feedback can help detect false news with high accuracy.
Article
Full-text available
Several people around the world have died from the coronavirus (COVID-19) disease. With the increase in COVID-19 cases, distribution, and deaths, much has occurred regarding the ban on travel, border closure, curfews, and the disturbance in the supply of services and goods. The world economy was severely affected by the spread of the virus. Every day, new discussions and debates started, and more people were in fear. Occasionally, unconfirmed information is shared on social media sites as if it were accurate information. Sometimes, it becomes viral and disturbs people's emotions and beliefs. Fake news and rumors are widespread forms of unconfirmed and false information. This type of news should be tracked speedily to prevent its negative impact on society. An ideal system is the dire need of modern-day society to evaluate the Internet rumors on COVID. Therefore, the current study has considered a probabilistic approach for evaluating the Internet rumors about COVID. The fuzzy logic tool in MATLAB was used for experimental and simulation purposes. The results revealed the effectiveness of the proposed work.
Preprint
Full-text available
The paper presents the outcomes of AI-COVID19, our project aimed at better understanding of misinformation flow about COVID-19 across social media platforms. The specific focus of the study reported in this paper is on collecting data from Telegram groups which are active in promotion of COVID-related misinformation. Our corpus collected so far contains around 28 million words, from almost one million messages. Given that a substantial portion of misinformation flow in social media is spread via multimodal means, such as images and video, we have also developed a mechanism for utilising such channels via producing automatic transcripts for videos and automatic classification for images into such categories as memes, screenshots of posts and other kinds of images. The accuracy of the image classification pipeline is around 87%.
Preprint
Full-text available
First identified in Wuhan, China, in December 2019, the outbreak of COVID-19 has been declared as a global emergency in January, and a pandemic in March 2020 by the World Health Organization (WHO). Along with this pandemic, we are also experiencing an "infodemic" of information with low credibility such as fake news and conspiracies. In this work, we present ReCOVery a repository designed and constructed to facilitate research on combating such information regarding COVID-19. We first broadly search and investigate ~2,000 news publishers, from which 60 are identified with extreme [high or low] levels of credibility. By inheriting the credibility of the media on which they were published, a total of 2,029 news articles on coronavirus, published from January to May 2020, are collected in the repository, along with 140,820 tweets that reveal how these news articles have spread on the Twitter social network. The repository provides multimodal information of news articles on coronavirus, including textual, visual, temporal, and network information. The way that news credibility is obtained allows a trade-off between dataset scalability and label accuracy. Extensive experiments are conducted to present data statistics and distributions, as well as to provide baseline performances for predicting news credibility so that future methods can be compared. Our repository is available at http://coronavirus-fakenews.com.
Chapter
Full-text available
Effective detection of fake news has recently attracted significant attention. Current studies have made significant contributions to predicting fake news with less focus on exploiting the relationship (similarity) between the textual and visual information in news articles. Attaching importance to such similarity helps identify fake news stories that, for example, attempt to use irrelevant images to attract readers’ attention. In this work, we propose a \(\mathsf {S}\)imilarity-\(\mathsf {A}\)ware \(\mathsf {F}\)ak\(\mathsf {E}\) news detection method (\(\mathsf {SAFE}\)) which investigates multi-modal (textual and visual) information of news articles. First, neural networks are adopted to separately extract textual and visual features for news representation. We further investigate the relationship between the extracted features across modalities. Such representations of news textual and visual information along with their relationship are jointly learned and used to predict fake news. The proposed method facilitates recognizing the falsity of news articles based on their text, images, or their “mismatches.” We conduct extensive experiments on large-scale real-world data, which demonstrate the effectiveness of the proposed method.
Article
Full-text available
Background: A recent cluster of pneumonia cases in Wuhan, China, was caused by a novel betacoronavirus, the 2019 novel coronavirus (2019-nCoV). We report the epidemiological, clinical, laboratory, and radiological characteristics and treatment and clinical outcomes of these patients. Methods: All patients with suspected 2019-nCoV were admitted to a designated hospital in Wuhan. We prospectively collected and analysed data on patients with laboratory-confirmed 2019-nCoV infection by real-time RT-PCR and next-generation sequencing. Data were obtained with standardised data collection forms shared by the International Severe Acute Respiratory and Emerging Infection Consortium from electronic medical records. Researchers also directly communicated with patients or their families to ascertain epidemiological and symptom data. Outcomes were also compared between patients who had been admitted to the intensive care unit (ICU) and those who had not. Findings: By Jan 2, 2020, 41 admitted hospital patients had been identified as having laboratory-confirmed 2019-nCoV infection. Most of the infected patients were men (30 [73%] of 41); less than half had underlying diseases (13 [32%]), including diabetes (eight [20%]), hypertension (six [15%]), and cardiovascular disease (six [15%]). Median age was 49·0 years (IQR 41·0-58·0). 27 (66%) of 41 patients had been exposed to Huanan seafood market. One family cluster was found. Common symptoms at onset of illness were fever (40 [98%] of 41 patients), cough (31 [76%]), and myalgia or fatigue (18 [44%]); less common symptoms were sputum production (11 [28%] of 39), headache (three [8%] of 38), haemoptysis (two [5%] of 39), and diarrhoea (one [3%] of 38). Dyspnoea developed in 22 (55%) of 40 patients (median time from illness onset to dyspnoea 8·0 days [IQR 5·0-13·0]). 26 (63%) of 41 patients had lymphopenia. All 41 patients had pneumonia with abnormal findings on chest CT. Complications included acute respiratory distress syndrome (12 [29%]), RNAaemia (six [15%]), acute cardiac injury (five [12%]) and secondary infection (four [10%]). 13 (32%) patients were admitted to an ICU and six (15%) died. Compared with non-ICU patients, ICU patients had higher plasma levels of IL2, IL7, IL10, GSCF, IP10, MCP1, MIP1A, and TNFα. Interpretation: The 2019-nCoV infection caused clusters of severe respiratory illness similar to severe acute respiratory syndrome coronavirus and was associated with ICU admission and high mortality. Major gaps in our knowledge of the origin, epidemiology, duration of human transmission, and clinical spectrum of disease need fulfilment by future studies. Funding: Ministry of Science and Technology, Chinese Academy of Medical Sciences, National Natural Science Foundation of China, and Beijing Municipal Science and Technology Commission.
Conference Paper
Full-text available
As news reading on social media becomes more and more popular, fake news becomes a major issue concerning the public and government. The fake news can take advantage of multimedia content to mislead readers and get dissemination, which can cause negative effects or even manipulate the public events. One of the unique challenges for fake news detection on social media is how to identify fake news on newly emerged events. Unfortunately, most of the existing approaches can hardly handle this challenge, since they tend to learn event-specific features that can not be transferred to unseen events. In order to address this issue, we propose an end-to-end framework named Event Adversarial Neural Network (EANN), which can derive event-invariant features and thus benefit the detection of fake news on newly arrived events. It consists of three main components: the multi-modal feature extractor, the fake news detector, and the event discriminator. The multi-modal feature extractor is responsible for extracting the textual and visual features from posts. It cooperates with the fake news detector to learn the discriminable representation for the detection of fake news. The role of event discriminator is to remove the event-specific features and keep shared features among events. Extensive experiments are conducted on multimedia datasets collected from Weibo and Twitter. The experimental results show our proposed EANN model can outperform the state-of-the-art methods, and learn transferable feature representations.
Article
Background: At the time of this writing, the novel coronavirus (COVID-19) pandemic outbreak has already put tremendous strain on many countries' citizens, resources and economies around the world. Social distancing measures, travel bans, self-quarantines, and business closures are changing the very fabric of societies worldwide. With people forced out of public spaces, much conversation about these phenomena now occurs online, e.g., on social media platforms like Twitter. Objective: In this paper, we describe a multilingual coronavirus (COVID-19) Twitter dataset that we are making available to the research community via our COVID-19-TweetIDs Github repository. Methods: We started this ongoing data collection on January 28, 2020, leveraging Twitter's Streaming API and Tweepy to follow certain keywords and accounts that were trending at the time the collection began, and used Twitter's Search API to query for past tweets, resulting in the earliest tweets in our collection dating back to January 21, 2020. Results: Since the inception of our collection, we have actively maintained and updated our Github repository on a weekly basis. We have published over 123 million tweets, with over 60% of the tweets in English. This manuscript also presents basic analysis that shows that Twitter activity responds and reacts to coronavirus-related events. Conclusions: It is our hope that our contribution will enable the study of online conversation dynamics in the context of a planetary-scale epidemic outbreak of unprecedented proportions and implications. This dataset could also help track scientific coronavirus misinformation and unverified rumors or enable the understanding of fear and panic - and undoubtedly more. Clinicaltrial:
Article
An unprecedented outbreak of pneumonia of unknown aetiology in Wuhan City, Hubei province in China emerged in December of 2019. A novel coronavirus was identified as the causative agent and was subsequently termed COVID-19 by the World Health Organization (WHO). Considered a relative of severe acute respiratory syndrome (SARS) and Middle East respiratory syndrome (MERS), COVID-19 is a betacoronavirus that affects the lower respiratory tract and manifests as pneumonia in humans. Despite rigorous global containment and quarantine efforts, the incidence of COVID-19 continues to rise, with 50,580 laboratory-confirmed cases and 1,526 deaths worldwide. In response to this global outbreak, we summarise the current state of knowledge surrounding COVID-19.
Conference Paper
Microblogs have become popular media for news propagation in recent years. Meanwhile, numerous rumors and fake news also bloom and spread wildly on the open social media platforms. Without verification, they could seriously jeopardize the credibility of microblogs. We observe that an increasing number of users are using images and videos to post news in addition to texts. Tweets or microblogs are commonly composed of text, image and social context. In this paper, we propose a novel Recurrent Neural Network with an attention mechanism (att-RNN) to fuse multimodal features for effective rumor detection. In this end-to-end network, image features are incorporated into the joint features of text and social context, which are obtained with an LSTM (Long-Short Term Memory) network, to produce a reliable fused classification. The neural attention from the outputs of the LSTM is utilized when fusing with the visual features. Extensive experiments are conducted on two multimedia rumor datasets collected from Weibo and Twitter. The results demonstrate the effectiveness of the proposed end-to-end att-RNN in detecting rumors with multimodal contents.
Article
This paper proposes a simple and efficient approach for text classification and representation learning. Our experiments show that our fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation. We can train fastText on more than one billion words in less than ten minutes using a standard multicore CPU, and classify half a million sentences among 312K classes in less than a minute.