Conference Paper

ComplexDataLab at W-NUT 2020 Task 2: Detecting Informative COVID-19 Tweets by Attending over Linked Documents

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Introduction: This study presents COVID-Twitter-BERT (CT-BERT), a transformer-based model that is pre-trained on a large corpus of COVID-19 related Twitter messages. CT-BERT is specifically designed to be used on COVID-19 content, particularly from social media, and can be utilized for various natural language processing tasks such as classification, question-answering, and chatbots. This paper aims to evaluate the performance of CT-BERT on different classification datasets and compare it with BERT-LARGE, its base model. Methods: The study utilizes CT-BERT, which is pre-trained on a large corpus of COVID-19 related Twitter messages. The authors evaluated the performance of CT-BERT on five different classification datasets, including one in the target domain. The model's performance is compared to its base model, BERT-LARGE, to measure the marginal improvement. The authors also provide detailed information on the training process and the technical specifications of the model. Results: The results indicate that CT-BERT outperforms BERT-LARGE with a marginal improvement of 10-30% on all five classification datasets. The largest improvements are observed in the target domain. The authors provide detailed performance metrics and discuss the significance of these results. Discussion: The study demonstrates the potential of pre-trained transformer models, such as CT-BERT, for COVID-19 related natural language processing tasks. The results indicate that CT-BERT can improve the classification performance on COVID-19 related content, especially on social media. These findings have important implications for various applications, such as monitoring public sentiment and developing chatbots to provide COVID-19 related information. The study also highlights the importance of using domain-specific pre-trained models for specific natural language processing tasks. Overall, this work provides a valuable contribution to the development of COVID-19 related NLP models.
Article
Full-text available
Predicting the number of new suspected or confirmed cases of novel coronavirus disease 2019 (COVID-19) is crucial in the prevention and control of the COVID-19 outbreak. Social media search indexes (SMSI) for dry cough, fever, chest distress, coronavirus, and pneumonia were collected from 31 December 2019 to 9 February 2020. The new suspected cases of COVID-19 data were collected from 20 January 2020 to 9 February 2020. We used the lagged series of SMSI to predict new suspected COVID-19 case numbers during this period. To avoid overfitting, five methods, namely subset selection, forward selection, lasso regression, ridge regression, and elastic net, were used to estimate coefficients. We selected the optimal method to predict new suspected COVID-19 case numbers from 20 January 2020 to 9 February 2020. We further validated the optimal method for new confirmed cases of COVID-19 from 31 December 2019 to 17 February 2020. The new suspected COVID-19 case numbers correlated significantly with the lagged series of SMSI. SMSI could be detected 6–9 days earlier than new suspected cases of COVID-19. The optimal method was the subset selection method, which had the lowest estimation error and a moderate number of predictors. The subset selection method also significantly correlated with the new confirmed COVID-19 cases after validation. SMSI findings on lag day 10 were significantly correlated with new confirmed COVID-19 cases. SMSI could be a significant predictor of the number of COVID-19 infections. SMSI could be an effective early predictor, which would enable governments’ health departments to locate potential and high-risk outbreak areas.
Article
Full-text available
In December 2019, a new virus (initially called 'Novel Coronavirus 2019-nCoV' and later renamed to SARS-CoV-2) causing severe acute respiratory syndrome (coronavirus disease COVID-19) emerged in Wuhan, Hubei Province, China, and rapidly spread to other parts of China and other countries around the world, despite China's massive efforts to contain the disease within Hubei. As with the original SARS-CoV epidemic of 2002/2003 and with seasonal influenza, geographic information systems and methods, including, among other application possibilities, online real- or near-real-time mapping of disease cases and of social media reactions to disease spread, predictive risk mapping using population travel data, and tracing and mapping super-spreader trajectories and contacts across space and time, are proving indispensable for timely and effective epidemic monitoring and response. This paper offers pointers to, and describes, a range of practical online/mobile GIS and mapping dashboards and applications for tracking the 2019/2020 coronavirus epidemic and associated events as they unfold around the world. Some of these dashboards and applications are receiving data updates in near-real-time (at the time of writing), and one of them is meant for individual users (in China) to check if the app user has had any close contact with a person confirmed or suspected to have been infected with SARS-CoV-2 in the recent past. We also discuss additional ways GIS can support the fight against infectious disease outbreaks and epidemics.
Article
Full-text available
Low-dimensional embeddings of nodes in large graphs have proved extremely useful in a variety of prediction tasks, from content recommendation to identifying protein functions. However, most existing approaches require that all nodes in the graph are present during training of the embeddings; these previous approaches are inherently transductive and do not naturally generalize to unseen nodes. Here we present GraphSAGE, a general, inductive framework that leverages node feature information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data. Instead of training individual embeddings for each node, we learn a function that generates embeddings by sampling and aggregating features from a node's local neighborhood. Our algorithm outperforms strong baselines on three inductive node-classification benchmarks: we classify the category of unseen nodes in evolving information graphs based on citation and Reddit post data, and we show that our algorithm generalizes to completely unseen graphs using a multi-graph dataset of protein-protein interactions.
Article
Full-text available
Social media is a platform to express one’s view in real time. This real time nature of social media makes it an attractive tool for disaster management, as both victims and officials can put their problems and solutions at the same place in real time. We investigate the Twitter post in a flood related disaster and propose an algorithm to identify victims asking for help. The developed system takes tweets as inputs and categorizes them into high or low priority tweets. User location of high priority tweets with no location information is predicted based on historical locations of the users using the Markov model. The system is working well, with its classification accuracy of 81%, and location prediction accuracy of 87%. The present system can be extended for use in other natural disaster situations, such as earthquake, tsunami, etc., as well as man-made disasters such as riots, terrorist attacks etc. The present system is first of its kind, aimed at helping victims during disasters based on their tweets.
Article
Full-text available
Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.
Article
Social media is becoming popular as a key source of information on disasters and crisis situations. Humanitarian Aid and Disaster Relief (HADR) respondents can gain valuable insights and situational awareness by monitoring social mediabased feeds. Specifically, the use of microblogs (i.e., Twitter) has been shown to provide new information not otherwise attainable. In this paper, we present a new application designed to help HADR relief organizations to track, analyze, and monitor tweets. The purpose of this tool is to help these first responders gain situational awareness immediately after a disaster or crisis. The tool is capable of monitoring and analyzing location and keyword specific Tweets with nearreal- time trending, data reduction, historical review, and integrated data mining tools. In this paper, we discuss the utility of this tool through a case study on tweets related to the Cholera crisis in Haiti.
Article
With the outbreak of the COVID-19 pandemic, people turned to social media to read and to share timely information including statistics, warnings, advice, and inspirational stories. Unfortunately, alongside all this useful information, there was also a new blending of medical and political misinformation and disinformation, which gave rise to the first global infodemic. While fighting this infodemic is typically thought of in terms of factuality, the problem is much broader as malicious content includes not only fake news, rumors, and conspiracy theories, but also promotion of fake cures, panic, racism, xenophobia, and mistrust in the authorities, among others. This is a complex problem that needs a holistic approach combining the perspectives of journalists, fact-checkers, policymakers, government entities, social media platforms, and society as a whole. With this in mind, we define an annotation schema and detailed annotation instructions that reflect these perspectives. We further deploy a multilingual annotation platform, and we issue a call to arms to the research community and beyond to join the fight by supporting our crowdsourcing annotation efforts. We perform initial annotations using the annotation schema, and our initial experiments demonstrated sizable improvements over the baselines.
Article
During the ongoing outbreak of coronavirus disease (COVID-19), people use social media to acquire and exchange various types of information at a historic and unprecedented scale. Only the situational information are valuable for the public and authorities to response to the epidemic. Therefore, it is important to identify such situational information and to understand how it is being propagated on social media, so that appropriate information publishing strategies can be informed for the COVID-19 epidemic. This article sought to fill this gap by harnessing Weibo data and natural language processing techniques to classify the COVID-19-related information into seven types of situational information. We found specific features in predicting the reposted amount of each type of information. The results provide data-driven insights into the information need and public attention.
Article
Social media has been widely used for emergency communication both in disaster-affected areas and unaffected areas. Comparing emotional reaction and information propagation between on-site users and off-site users from a spatiotemporal perspective can help better comprehend collective human behavior during natural disasters. In this study, we investigate sentiment and retweet patterns of disaster-affected areas and disaster-unaffected areas at different stages of Hurricane Harvey. The results show that off-site tweets were more negative than on-site tweets, especially during the disaster. As for retweet patterns, indifferent-neutral and positive tweets spread broader than mixed-neutral and negative tweets. However, negative tweets spread faster than positive tweets, which reveals that social media users were more sensitive to negative information in disaster situations. With the development of the disaster, social media users were more sensitive to on-site positive messages than off-site negative posts. This data-driven study reveals the significant effect of sentiment expression on the publication and re-distribution of disaster-related messages. It generates implications for emergency communication and disaster management.
Article
Social media was underutilised in disaster management practices, as it was not seen as a real-time ground level information harvesting tool during a disaster. In recent years, with the increasing popularity and use of social media, people have started to express their views, experiences, images, and video evidences through different social media platforms. Consequently, harnessing such crowdsourced information has become an opportunity for authorities to obtain enhanced situation awareness data for efficient disaster management practices. Nonetheless, the current disaster-related Twitter analytics methods are not versatile enough to define disaster impacts levels as interpreted by the local communities. This paper contributes to the existing knowledge by applying and extending a well-established data analysis framework, and identifying highly impacted disaster areas as perceived by the local communities. For this, the study used real-time Twitter data posted during the 2010–2011 South East Queensland Floods. The findings reveal that: (a) Utilising Twitter is a promising approach to reflect citizen knowledge; (b) Tweets could be used to identify the fluctuations of disaster severity over time; (c) The spatial analysis of tweets validates the applicability of geo-located messages to demarcate highly impacted disaster zones.
Article
We note that common implementations of adaptive gradient algorithms, such as Adam, limit the potential benefit of weight decay regularization, because the weights do not decay multiplicatively (as would be expected for standard weight decay) but by an additive constant factor. We propose a simple way to resolve this issue by decoupling weight decay and the optimization steps taken w.r.t. the loss function. We provide empirical evidence that our proposed modification (i) decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam, and (ii) substantially improves Adam's generalization performance, allowing it to compete with SGD with momentum on image classification datasets (on which it was previously typically outperformed by the latter). We also demonstrate that longer optimization runs require smaller weight decay values for optimal results and introduce a normalized variant of weight decay to reduce this dependence. Finally, we propose a version of Adam with warm restarts (AdamWR) that has strong anytime performance while achieving state-of-the-art results on CIFAR-10 and ImageNet32x32. Our source code is available at https://github.com/loshchil/AdamW-and-SGDW
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Conference Paper
Social media such as tweets are emerging as platforms contributing to situational awareness during disasters. Information shared on Twitter by both affected population (e.g., requesting assistance, warning) and those outside the impact zone (e.g., providing assistance) would help first responders, decision makers, and the public to understand the situation first-hand. Effective use of such information requires timely selection and analysis of tweets that are relevant to a particular disaster. Even though abundant tweets are promising as a data source, it is challenging to automatically identify relevant messages since tweet are short and unstructured, resulting to unsatisfactory classification performance of conventional learning-based approaches. Thus, we propose a simple yet effective algorithm to identify relevant messages based on matching keywords and hashtags, and provide a comparison between matching-based and learning-based approaches. To evaluate the two approaches, we put them into a framework specifically proposed for analyzing diaster-related tweets. Analysis results on eleven datasets with various disaster types show that our technique provides relevant tweets of higher quality and more interpretable results of sentiment analysis tasks when compared to learning approach.
Flair: An easy-to-use framework for state-of-the-art nlp
  • Alan Akbik
  • Tanja Bergmann
  • Duncan Blythe
  • Kashif Rasul
  • Stefan Schweter
  • Roland Vollgraf
Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif Rasul, Stefan Schweter, and Roland Vollgraf. 2019. Flair: An easy-to-use framework for state-of-the-art nlp. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 54-59. 2
Coaid: Covid-19 healthcare misinformation dataset
  • Limeng Cui
  • Dongwon Lee
Limeng Cui and Dongwon Lee. 2020. Coaid: Covid-19 healthcare misinformation dataset. 1
BERTweet: A pre-trained language model for English Tweets
Thanh Vu Dat Quoc Nguyen and Anh Tuan Nguyen. 2020. BERTweet: A pre-trained language model for English Tweets. arXiv preprint, arXiv:2005.10200. 2, 4
BERT: pre-training of deep bidirectional transformers for language understanding
  • Jacob Devlin
  • Ming-Wei Chang
  • Kenton Lee
  • Kristina Toutanova
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805. 1
Pytorch lightning. GitHub
  • Wa Falcon
WA Falcon. 2019. Pytorch lightning. GitHub. Note: https://github.com/PyTorchLightning/pytorchlightning Cited by, 3. 4
Twitter as a lifeline: Humanannotated twitter corpora for nlp of crisis-related messages
  • Muhammad Imran
  • Prasenjit Mitra
  • Carlos Castillo
Muhammad Imran, Prasenjit Mitra, and Carlos Castillo. 2016. Twitter as a lifeline: Humanannotated twitter corpora for nlp of crisis-related messages. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France. European Language Resources Association (ELRA). 1
Vgcnbert: Augmenting bert with graph embedding for text classification
  • Zhibin Lu
  • Pan Du
  • Jian-Yun Nie
Zhibin Lu, Pan Du, and Jian-Yun Nie. 2020. Vgcnbert: Augmenting bert with graph embedding for text classification. In European Conference on Information Retrieval, pages 369-382. Springer. 2
WNUT-2020 Task 2: Identification of Informative COVID-19 English Tweets
  • Thanh Dat Quoc Nguyen
  • Afshin Vu
  • Mai Hoang Rahimi
  • Linh The Dao
  • Long Nguyen
  • Doan
Dat Quoc Nguyen, Thanh Vu, Afshin Rahimi, Mai Hoang Dao, Linh The Nguyen, and Long Doan. 2020. WNUT-2020 Task 2: Identification of Informative COVID-19 English Tweets. In Proceedings of the 6th Workshop on Noisy User-generated Text. 1, 3
Knowledge-aware language model pretraining
  • Corby Rosset
  • Chenyan Xiong
  • Minh Q Phan
  • Xia Song
  • Paul
  • Saurabh Bennett
  • Tiwary
Corby Rosset, Chenyan Xiong, Minh Q. Phan, Xia Song, Paul. Bennett, and Saurabh Tiwary. 2020. Knowledge-aware language model pretraining. ArXiv, abs/2007.00655. 2
Huggingface's transformers: State-of-the-art natural language processing
  • Thomas Wolf
  • Lysandre Debut
  • Victor Sanh
  • Julien Chaumond
  • Clement Delangue
  • Anthony Moi
  • Pierric Cistac
  • Tim Rault
  • R'emi Louf
  • Morgan Funtowicz
  • Jamie Brew
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R'emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface's transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771. 4
G5: A universal graph-bert for graph-to-graph transfer and apocalypse learning
  • Jiawei Zhang
Jiawei Zhang. 2020. G5: A universal graph-bert for graph-to-graph transfer and apocalypse learning. arXiv preprint arXiv:2006.06183. 2
Transformer-xh: Multi-evidence reasoning with extra hop attention
  • Chen Zhao
  • Chenyan Xiong
  • Corby Rosset
  • Xia Song
  • Paul N Bennett
  • Saurabh Tiwary
Chen Zhao, Chenyan Xiong, Corby Rosset, Xia Song, Paul N. Bennett, and Saurabh Tiwary. 2020. Transformer-xh: Multi-evidence reasoning with extra hop attention. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. 2