Chapter

Discovering, Classification, and Localization of Emergency Events via Analyzing of Social Network Text Streams

If you want to read the PDF, try requesting it from the authors.

Abstract

We present text processing framework for discovering, classification, and localization emergency related events via analysis of information sources such as social networks. The framework performs focused crawling of messages from social networks, text parsing, information extraction, detection of messages related to emergencies, automatic novel event discovering, matching them across different sources, as well as event localization and visualization on a geographical map. For detection of emergency-related messages, we use CNN and word embeddings. The components of the framework are experimentally evaluated on Twitter and Facebook data.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Tasks and problems here are, for example, the automated detection of hate speech or cyber-bullying [17]. There are also efforts to detect texts that indicate health problems such as depression or even emergency event detection [18][19][20]. Social media researchers are dealing with concise texts compared to other applications on (stream) text data, such as spam email detection. Therefore, an algorithm has less contextual information to learn the true meaning of texts for classifying, e.g., sentiments [21,22]. ...
Article
Full-text available
Due to the rise of continuous data-generating applications, analyzing data streams has gained increasing attention over the past decades. A core research area in stream data is stream classification, which categorizes or detects data points within an evolving stream of observations. Areas of stream classification are diverse—ranging, e.g., from monitoring sensor data to analyzing a wide range of (social) media applications. Research in stream classification is related to developing methods that adapt to the changing and potentially volatile data stream. It focuses on individual aspects of the stream classification pipeline, e.g., designing suitable algorithm architectures, an efficient train and test procedure, or detecting so-called concept drifts. As a result of the many different research questions and strands, the field is challenging to grasp, especially for beginners. This survey explores, summarizes, and categorizes work within the domain of stream classification and identifies core research threads over the past few years. It is structured based on the stream classification process to facilitate coordination within this complex topic, including common application scenarios and benchmarking data sets. Thus, both newcomers to the field and experts who want to widen their scope can gain (additional) insight into this research area and find starting points and pointers to more in-depth literature on specific issues and research directions in the field
... In [9], [34], [62], tweets are processed to detect emergencies when previously defined keywords are found. A framework for the processing of textual input data when detecting emergencies and extracting emergencies metadata is proposed in [171], considering the processing of posts on Twitter and Facebook. With more efficient techniques, particularly considering the tools of artificial intelligence, the accuracy and processing time of textual processing algorithms based on social media data may be significantly improved, benefiting the detection of emergencies. ...
Article
Full-text available
The rapid urbanization process in the last century has deeply changed the way we live and interact with each other. As most people now live in urban areas, cities are experiencing growing demands for more efficient and sustainable public services that may improve the perceived quality of life, specially with the anticipated impacts of climatic changes. In this already complex scenario with increasingly overcrowded urban areas, different types of emergency situations may happen anywhere and anytime, with unpredictable costs in human lives and economic losses. In order to cope with unexpected and potentially dangerous emergencies, smart cities initiatives have been developed in different cities, addressing multiple aspects of emergencies detection, alerting, and mitigation. In this context, this article surveys recent smart city solutions for crisis management, proposing definitions for emergencies-oriented systems and classifying them according to the employed technologies and provided services. Additionally, recent developments in the domains of Internet of Things, Artificial Intelligence and Big Data are also highlighted when associated to the management of urban emergencies, potentially paving the way for new developments while classifying and organizing them according to different criteria. Finally, open research challenges will be identified, indicating promising trends and research directions for the coming years.
Conference Paper
Full-text available
We present text processing framework for discovering emergency related events via analysis of information sources such as social networks. The framework performs focused crawling of messages, text parsing, information extraction, detection of messages related to emergencies, as well as automatic novel event discovering and matching them across different information sources. For detection of emergency-related messages, we use CNN and word embeddings. For discovering novel events and matching them across different sources, we propose a multimodal topic model enriched with spatial information and a method based on Jensen-Shannon divergence. The components of the framework are experimentally evaluated on Twitter and Facebook data.
Article
Full-text available
Emergency events affect human security and safety as well as the integrity of the local infrastructure. Emergency response officials are required to make decisions using limited information and time. During emergency events, people post updates to social media networks, such as tweets, containing information about their status, help requests, incident reports, and other useful information. In this research project, the Latent Dirichlet Allocation (LDA) model is used to automatically classify incident-related tweets and incident types using Twitter data. Unlike the previous social media information models proposed in the related literature, the LDA is an unsupervised learning model which can be utilized directly without prior knowledge and preparation for data in order to save time during emergencies. Twitter data including messages and geolocation information during two recent events in New York City, the Chelsea explosion and Hurricane Sandy, are used as two case studies to test the accuracy of the LDA model for extracting incident-related tweets and labeling them by incident type. Results showed that the model could extract emergency events and classify them for both small and large-scale events, and the model’s hyper-parameters can be shared in a similar language environment to save model training time. Furthermore, the list of keywords generated by the model can be used as prior knowledge for emergency event classification and training of supervised classification models such as support vector machine and recurrent neural network.
Conference Paper
Full-text available
Exploratory search is a paradigm of information retrieval, in which the user’s intention is to learn the subject domain better. To do this the user repeats “query–browse–refine” interactions with the search engine many times. We consider typical exploratory search tasks formulated by long text queries. People usually solve such a task in about half an hour and find dozens of documents using conventional search facilities iteratively. The goal of this paper is to reduce the time-consuming multi-step process to one step without impairing the quality of the search. Probabilistic topic modeling is a suitable text mining technique to retrieve documents, which are semantically relevant to a long text query. We use the additive regularization of topic models (ARTM) to build a model that meets multiple objectives. The model should have sparse, diverse and interpretable topics. Also, it should incorporate meta-data and multimodal data such as n-grams, authors, tags and categories. Balancing the regularization criteria is an important issue for ARTM. We tackle this problem with coordinate-wise optimization technique, which chooses the regularization trajectory automatically. We use the parallel online implementation of ARTM from the open source library BigARTM. Our evaluation technique is based on crowdsourcing and includes two tasks for assessors: the manual exploratory search and the explicit relevance feedback. Experiments on two popular tech news media show that our topic-based exploratory search outperforms assessors as well as simple baselines, achieving precision and recall of about 85–92%.
Conference Paper
Full-text available
We present the ongoing work on text processing system for detection and analysis of events related to emergencies in the Arctic zone. The peculiarity of the task consists in data sparseness and scarceness of tools / language resources for processing such specific texts. The system performs focused crawling of documents related to emergencies in the Arctic region, text parsing including named entity recognition and geotagging, and indexing texts with their metadata for faceted search. The system aims at processing both English and Russian text messages and documents. We report the preliminary results of the experimental evaluation of the system components on Twitter data.
Conference Paper
Full-text available
We introduce a globally normalized transition-based neural network model that achieves state-of-the-art part-of-speech tagging, dependency parsing and sentence compression results. Our model is a simple feed-forward neural network that operates on a task-specific transition system, yet achieves comparable or better accuracies than recurrent models. The key insight is based on a novel proof illustrating the label bias problem and showing that globally normalized models can be strictly more expressive than locally normalized models.
Conference Paper
Full-text available
Social media proves to be a major source of timely information during mass emergencies. A considerable amount of recent research has aimed at developing methods to detect social media messages that report such disasters at early stages. In contrast to previous work, the goal of this paper is to identify messages relating to a very broad range of possible emergencies including technological and natural disasters. The challenge of this task is data heterogeneity: messages relating to different types of disasters tend to have different feature distributions. This makes it harder to learn the classification problem; a classifier trained on certain emergency types tends to perform poorly when tested on some other types of disasters. To counteract the negative effects of data heterogeneity, we present two novel methods. The first is an ensemble method, which combines multiple classifiers specific to each emergency type to classify previously unseen texts, and the second is a semi-supervised generic classification method which uses a large collection of unlabeled messages to acquire additional training data.
Conference Paper
Full-text available
Existing literature demonstrates the usefulness of system-mediated algorithms, such as supervised machine learning for detecting classes of messages in the social-data stream (e.g., topically relevant vs. irrelevant). The classification accuracies of these algorithms largely depend upon the size of labeled samples that are provided during the learning phase. Other factors such as class distribution, term distribution among the training set also play an important role on classifier's accuracy. However, due to several reasons (money / time constraints, limited number of skilled labelers etc.), a large sample of labeled messages is often not available immediately for learning an efficient classification model. Consequently, classifier trained on a poor model often mis-classifies data and hence, the applicability of such learning techniques (especially for the online setting) during ongoing crisis response remains limited. In this paper, we propose a post-classification processing step leveraging upon two additional content features-stable hashtag association and stable named entity association, to improve the classification accuracy for a classifier in realistic settings. We have tested our algorithms on two crisis datasets from Twitter (Hurricane Sandy 2012 and Queensland Floods 2013), and compared our results against the results produced by a "best-in-class'' baseline online classifier. By showing the consistent better quality results than the baseline algorithm i.e., by correctly classifying the misclassified data points from the prior step (false negative and false positive to true positive and true negative classes, respectively), we demonstrate the applicability of our approach in practice.
Article
Full-text available
Microblog is a popular and open platform for discovering and sharing the latest news about social issues and daily life. The quickly-updated microblog streams make it urgent to develop an effective tool to monitor such streams. Emerging topic tracking is one of such tools to reveal what new events are attracting the most online attention at present. However, due to the fast changing, high noise and short length of the microblog feeds, two challenges should be addressed in emerging topic tracking. One is the problem of detecting emerging topics early, long before they become hot, and the other is how to effectively monitor evolving topics over time. In this study, we propose a novel emerging topics tracking method, which aligns emerging word detection from temporal perspective with coherent topic mining from spatial perspective. Specifically, we first design a metric to estimate word novelty and fading based on local weighted linear regression (LWLR), which can highlight the word novelty of expressing an emerging topic and suppress the word novelty of expressing an existing topic. We then track emerging topics by leveraging topic novelty and fading probabilities, which are learnt by designing and solving an optimization problem. We evaluate our method on a microblog stream containing over one million feeds. Experimental results show the promising performance of the proposed method in detecting emerging topic and tracking topic evolution over time on both effectiveness and efficiency.
Conference Paper
Full-text available
Probabilistic topic modeling of text collections is a powerful tool for statistical text analysis. In this paper we announce the BigARTM open source project (http:// bigartm. org) for regularized multimodal topic modeling of large collections. Several experiments on Wikipedia corpus show that BigARTM performs faster and gives better perplexity comparing to other popular packages, such as Vowpal Wabbit and Gensim. We also demonstrate several unique BigARTM features, such as additive combination of regularizers, topic sparsing and decorrelation, multimodal and multilanguage modeling, which are not available in the other software packages for topic modeling.
Article
Full-text available
The advent of online social networks (OSNs) paired with the ubiquitous proliferation of smartphones have enabled social sensing systems. In the last few years, the aptitude of humans to spontaneously collect and timely share context information has been exploited for emergency detection and crisis management. Apart from event-specific features, these systems share technical approaches and architectural solutions to address the issues with capturing, filtering and extracting meaningful information from data posted to OSNs by networks of human sensors. This paper proposes a conceptual and architectural framework for the design of emergency detection systems based on the “human as a sensor” (HaaS) paradigm. An ontology for the HaaS paradigm in the context of emergency detection is defined. Then, a modular architecture, independent of a specific emergency type, is designed. The proposed architecture is demonstrated by an implemented application for detecting earthquakes via Twitter. Validation and experimental results based on messages posted during earthquakes occurred in Italy are reported.
Conference Paper
Full-text available
Bursty topics discovery in microblogs is important for people to grasp essential and valuable information. However, the task is challenging since microblog posts are particularly short and noisy. This work develops a novel probabilistic model, namely Bursty Biterm Topic Model (BBTM), to deal with the task. BBTM extends the Biterm Topic Model (BTM) by incorporating the burstiness of biterms as prior knowledge for bursty topic modeling, which enjoys the following merits: 1) It can well solve the data sparsity problem in topic modeling over short texts as the same as BTM; 2) It can automatically discover high quality bursty topics in microblogs in a principled and efficient way. Extensive experiments on a standard Twitter dataset show that our approach outperforms the state-of-the-art baselines significantly.
Article
Full-text available
Locating timely, useful information during crises and mass emergencies is critical for those forced to make potentially life-Altering decisions. As the use of Twitter to broadcast useful information during such situations becomes more widespread, the problem of finding it becomes more difficult. We describe an approach toward improving the recall in the sampling of Twitter communications that can lead to greater situational awareness during crisis situations. First, we create a lexicon of crisis-related terms that frequently appear in relevant messages posted during different types of crisis situations. Next, we demonstrate how we use the lexicon to automatically identify new terms that describe a given crisis. Finally, we explain how to efficiently query Twitter to extract crisis-related messages during emergency events. In our experiments, using a crisis lexicon leads to substantial improvements in terms of recall when added to a set of crisis-specific keywords manually chosen by experts; it also helps to preserve the original distribution of message types. Copyright © 2014, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Conference Paper
Full-text available
The use of social media to communicate timely information during crisis situations has become a common practice in recent years. In particular, the one-to-many nature of Twitter has created an opportunity for stakeholders to disseminate crisis-relevant messages, and to access vast amounts of information they may not otherwise have. Our goal is to understand what affected populations, response agencies and other stakeholders can expect—and not expect—from these data in various types of disaster situations. Anecdotal evidence suggests that different types of crises elicit different reactions from Twitter users, but we have yet to see whether this is in fact the case. In this paper, we investigate several crises— including natural hazards and human-induced disasters—in a systematic manner and with a consistent methodology. This leads to insights about the prevalence of different information types and sources across a variety of crisis situations.
Article
Full-text available
With the increasing number of real-world events that are originated and discussed over social networks, event detection is becoming a compelling research issue. However, the traditional approaches to event detection on large text streams are not designed to deal with a large number of short and noisy messages. This paper proposes an approach for the early detection of emerging hotspot events in social networks with location sensitivity. We consider the message-mentioned locations for identifying the locations of events. In our approach, we identify strong correlations between user locations and event locations in detecting the emerging events. We evaluate our approach based on a real-world Twitter dataset. Our experiments show that the proposed approach can effectively detect emerging events with respect to user locations that have different granularities.
Conference Paper
Full-text available
Event detection from tweets is an important task to understand the current events/topics attracting a large number of common users. However, the unique characteristics of tweets (e.g. short and noisy content, diverse and fast changing topics, and large data volume) make event detection a challenging task. Most existing techniques proposed for well written documents (e.g. news articles) cannot be directly adopted. In this paper, we propose a segment-based event detection system for tweets, called Twevent. Twevent first detects bursty tweet segments as event segments and then clusters the event segments into events considering both their frequency distribution and content similarity. More specifically, each tweet is split into non-overlapping segments (i.e. phrases possibly refer to named entities or semantically meaningful information units). The bursty segments are identified within a fixed time window based on their frequency patterns, and each bursty segment is described by the set of tweets containing the segment published within that time window. The similarity between a pair of bursty segments is computed using their associated tweets. After clustering bursty segments into candidate events, Wikipedia is exploited to identify the realistic events and to derive the most newsworthy segments to describe the identified events. We evaluate Twevent and compare it with the state-of-the-art method using 4.3 million tweets published by Singapore-based users in June 2010. In our experiments, Twevent outperforms the state-of-the-art method by a large margin in terms of both precision and recall. More importantly, the events detected by Twevent can be easily interpreted with little background knowledge because of the newsworthy segments. We also show that Twevent is efficient and scalable, leading to a desirable solution for event detection from tweets.
Conference Paper
Full-text available
We present AIDR (Artificial Intelligence for Disaster Response), a platform designed to perform automatic classification of crisis-related microblog communications. AIDR enables humans and machines to work together to apply human intelligence to large-scale data at high speed. The objective of AIDR is to classify messages that people post during disasters into a set of user-defined categories of information (e.g., "needs", "damage", etc.) For this purpose, the system continuously ingests data from Twitter, processes it (i.e., using machine learning classification techniques) and leverages human-participation (through crowdsourcing) in real-time. AIDR has been successfully tested to classify informative vs. non-informative tweets posted during the 2013 Pakistan Earthquake. Overall, we achieved a classification quality (measured using AUC) of 80%. AIDR is available at http://aidr.qcri.org/.
Article
Full-text available
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
Article
Full-text available
Microblogs such as Twitter reflect the general public’s reactions to major events. Bursty topics from microblogs reveal what events have attracted the most online attention. Although bursty event detection from text streams has been studied before, previous work may not be suitable for microblogs because compared with other text streams such as news articles and scientific publications, microblog posts are particularly diverse and noisy. To find topics that have bursty patterns on microblogs, we propose a topic model that simultaneousy captures two observations: (1) posts published around the same time are more likely to have the same topic, and (2) posts published by the same user are more likely to have the same topic. The former helps find event-driven posts while the latter helps identify and filter out “personal” posts. Our experiments on a large Twitter dataset show that there are more meaningful and unique bursty topics in the top-ranked results returned by our model than an LDA baseline and two degenerate variations of our model. We also show some case studies that demonstrate the importance of considering both the temporal information and users’ personal interests for bursty topic detection from microblogs.
Article
Full-text available
Latent topic analysis has emerged as one of the most effective methods for classifying, clustering and retrieving textual data. However, existing models such as Latent Dirichlet Allocation (LDA) were developed for static corpora of relatively large documents. In contrast, much of the textual content on the web, and especially social media, is temporally sequenced, and comes in short fragments, including microblog posts on sites such as Twitter and Weibo, status updates on social networking sites such as Facebook and LinkedIn, or comments on content sharing sites such as YouTube. In this paper we propose a novel topic model, Temporal-LDA or TM-LDA, for efficiently mining text streams such as a sequence of posts from the same author, by modeling the topic transitions that naturally arise in these data. TM-LDA learns the transition parameters among topics by minimizing the prediction error on topic distribution in subsequent postings. After training, TM-LDA is thus able to accurately predict the expected topic distribution in future posts. To make these predictions more efficient for a realistic online setting, we develop an efficient updating algorithm to adjust the topic transition parameters, as new documents stream in. Our empirical results, over a corpus of over 30 million microblog posts, show that TM-LDA significantly outperforms state-of-the-art static LDA models for estimating the topic distribution of new documents over time. We also demonstrate that TM-LDA is able to highlight interesting variations of common topic transitions, such as the differences in the work-life rhythm of cities, and factors associated with area-specific problems and complaints.
Article
Full-text available
Twitter is a user-generated content system that allows its users to share short text messages, called tweets, for a variety of purposes, including daily conversations, URLs sharing and information news. Considering its world-wide distributed network of users of any age and social condition, it represents a low level news flashes portal that, in its impressive short response time, has the principal advantage. In this paper we recognize this primary role of Twitter and we propose a novel topic detection technique that permits to retrieve in real-time the most emergent topics expressed by the community. First, we extract the contents (set of terms) of the tweets and model the term life cycle according to a novel aging theory intended to mine the emerging ones. A term can be defined as emerging if it frequently occurs in the specified time interval and it was relatively rare in the past. Moreover, considering that the importance of a content also depends on its source, we analyze the social relationships in the network with the well-known Page Rank algorithm in order to determine the authority of the users. Finally, we leverage a navigable topic graph which connects the emerging terms with other semantically related keywords, allowing the detection of the emerging topics, under user-specified time constraints. We provide different case studies which show the validity of the proposed approach.
Conference Paper
Full-text available
During a disastrous event, such as an earthquake or river flooding, information on what happened, who was affected and how, where help is needed, and how to aid people who were affected, is crucial. While communication is important in such times of crisis, damage to infrastructure such as telephone lines makes it difficult for authorities and victims to communicate. Microblogging has played a critical role as an important communication platform during crises when other media has failed. We demonstrate our ESA (Emergency Situation Awareness) system that mines microblogs in real-time to extract and visualise useful information about incidents and their impact on the community in order to equip the right authorities and the general public with situational awareness.
Conference Paper
Full-text available
Twitter, a popular microblogging service, has received much attention recently. An important characteristic of Twitter is its real-time nature. For example, when an earthquake occurs, people make many Twitter posts (tweets) related to the earthquake, which enables detection of earthquake occurrence promptly, simply by observing the tweets. As described in this paper, we investigate the real-time inter- action of events such as earthquakes, in Twitter, and pro- pose an algorithm to monitor tweets and to detect a target event. To detect a target event, we devise a classifier of tweets based on features such as the keywords in a tweet, the number of words, and their context. Subsequently, we produce a probabilistic spatiotemporal model for the tar- get event that can find the center and the trajectory of the event location. We consider each Twitter user as a sensor and apply Kalman filtering and particle filtering, which are widely used for location estimation in ubiquitous/pervasive computing. The particle filter works better than other com- pared methods in estimating the centers of earthquakes and the trajectories of typhoons. As an application, we con- struct an earthquake reporting system in Japan. Because of the numerous earthquakes and the large number of Twit- ter users throughout the country, we can detect an earth- quake by monitoring tweets with high probability (96% of earthquakes of Japan Meteorological Agency (JMA) seis- mic intensity scale 3 or more are detected). Our system detects earthquakes promptly and sends e-mails to regis- tered users. Notification is delivered much faster than the announcements that are broadcast by the JMA.
Conference Paper
Full-text available
This paper presents online topic model (OLDA), a topic model that automatically captures the thematic patterns and identifies emerging topics of text streams and their changes over time. Our approach allows the topic modeling framework, specifically the latent Dirichlet allocation (LDA) model, to work in an online fashion such that it incrementally builds an up-to-date model (mixture of topics per document and mixture of words per topic) when a new document (or a set of documents) appears. A solution based on the empirical Bayes method is proposed. The idea is to incrementally update the current model according to the information inferred from the new stream of data with no need to access previous data. The dynamics of the proposed approach also provide an efficient mean to track the topics over time and detect the emerging topics in real time. Our method is evaluated both qualitatively and quantitatively using benchmark datasets. In our experiments, the OLDA has discovered interesting patterns by just analyzing a fraction of data at a time. Our tests also prove the ability of OLDA to align the topics across the epochs with which the evolution of the topics over time is captured. The OLDA is also comparable to, and sometimes better than, the original LDA in predicting the likelihood of unseen documents.
Conference Paper
Full-text available
Geographically-grounded situational awareness (SA) is critical to crisis management and is essential in many other decision making domains that range from infectious disease monitoring, through regional planning, to political campaigning. Social media are becoming an important information input to support situational assessment (to produce awareness) in all domains. Here, we present a geovisual analytics approach to supporting SA for crisis events using one source of social media, Twitter. Specifically, we focus on leveraging explicit and implicit geographic information for tweets, on developing place-time-theme indexing schemes that support overview+detail methods and that scale analytical capabilities to relatively large tweet volumes, and on providing visual interface methods to enable understanding of place, time, and theme components of evolving situations. Our approach is user-centered, using scenario-based design methods that include formal scenarios to guide design and validate implementation as well as a systematic claims analysis to justify design choices and provide a framework for future testing. The work is informed by a structured survey of practitioners and the end product of Phase-I development is demonstrated / validated through implementation in SensePlace2, a map-based, web application initially focused on tweets but extensible to other media.
Article
Full-text available
Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized model is able to deal with domain#speci#c synonymy as well as with polysemous words. In contrast to standard Latent Semantic Indexing #LSI# by Singular Value Decomposition, the probabilistic variant has a solid statistical foundation and de#nes a proper generative data model. Retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methodsaswell as over LSI. In particular, the combination of models with di#erent dimensionalities has proven to be advantageous. 1
Conference Paper
Many modern NLP systems rely on word embeddings, previously trained in an unsupervised manner on large corpora, as base features. Efforts to obtain embeddings for larger chunks of text, such as sentences, have however not been so successful. Several attempts at learning unsupervised representations of sentences have not reached satisfactory enough performance to be widely adopted. In this paper, we show how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference datasets can consistently outperform unsupervised methods like SkipThought vectors on a wide range of transfer tasks. Much like how computer vision uses ImageNet to obtain features, which can then be transferred to other tasks, our work tends to indicate the suitability of natural language inference for transfer learning to other NLP tasks. Our encoder is publicly available.
Article
This paper proposes a simple and efficient approach for text classification and representation learning. Our experiments show that our fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation. We can train fastText on more than one billion words in less than ten minutes using a standard multicore CPU, and classify half a million sentences among 312K classes in less than a minute.
Article
Social media platforms such as Twitter provide valuable information for aiding disaster response during emergency events. Machine learning could be used to identify such information. However, supervised learning algorithms rely on labelled data, which is not readily available for an emerging target disaster. While labelled data might be available for a prior source disaster, supervised classifiers learned only from the source disaster may not perform well on the target disaster, as each event has unique characteristics (e.g., type, location, and culture) and may cause different social media responses. To address this limitation, we propose to use a domain adaptation approach, which learns classifiers from unlabelled target data, in addition to source labelled data. Our approach uses the Naïve Bayes classifier, together with an iterative Self-Training strategy. Experimental results on the task of identifying tweets relevant to a disaster of interest show that the domain adaptation classifiers are better as compared to the supervised classifiers learned only from labelled source data.
Article
This article presents a novel approach to detecting emergency events, such as power outages, that utilizes social media users as “social sensors” for virtual detection of such events. The proposed new method is based on the analysis of the Twitter data that leads to the detection of Twitter discussions about these emergency events. The method described in the article was implemented and deployed by one of the vendors in the context of detecting power outages as a part of their comprehensive social engagement platform. It was also field tested on Twitter users in an industrial setting and performed well during these tests. This article is summarized in: Computer Science Teachers Association CSTA's mission is to empower, engage and advocate for K-12 CS teachers worldwide.
Conference Paper
We present the text processing framework for detection and analysis of events related to emergencies in a specified region. We consider the Arctic zone as a particular example. The peculiarity of the task consists in data sparseness and scarceness of tools/language resources for processing such specific texts. The system performs focused crawling of texts related to emergencies in the Arctic region, information extraction including named entity recognition, geotagging, vessel name recognition, and detection of emergency related messages, as well as indexing of texts with their metadata for faceted search. The framework aims at processing both English and Russian text messages and documents. We report the results of the experimental evaluation of the framework components on Twitter data.
Article
The first objective towards the effective use of microblogging services such as Twitter for situational awareness during the emerging disasters is discovery of the disaster-related postings. Given the wide range of possible disasters, using a pre-selected set of disaster-related keywords for the discovery is suboptimal. An alternative that we focus on in this work is to train a classifier using a small set of labeled postings that are becoming available as a disaster is emerging. Our hypothesis is that utilizing large quantities of historical microblogs could improve the quality of classification, as compared to training a classifier only on the labeled data. We propose to use unlabeled microblogs to cluster words into a limited number of clusters and use the word clusters as features for classification. To evaluate the proposed semi-supervised approach, we used Twitter data from 6 different disasters. Our results indicate that when the number of labeled tweets is 100 or less, the proposed approach is superior to the standard classification based on the bag or words feature representation. Our results also reveal that the choice of the unlabeled corpus, the choice of word clustering algorithm, and the choice of hyperparameters can have a significant impact on the classification accuracy.
Article
Twitter has become one of the largest microblogging platforms for users around the world to share anything happening around them with friends and beyond. A bursty topic in Twitter is one that triggers a surge of relevant tweets within a short period of time, which often reflects important events of mass interest. How to leverage Twitter for early detection of bursty topics has therefore become an important research problem with immense practical value. Despite the wealth of research work on topic modelling and analysis in Twitter, it remains a challenge to detect bursty topics in real-time. As existing methods can hardly scale to handle the task with the tweet stream in real-time, we propose in this paper sf TopicSketch, a sketch-based topic model together with a set of techniques to achieve real-time detection. We evaluate our solution on a tweet stream with over 30 million tweets. Our experiment results show both efficiency and effectiveness of our approach. Especially it is also demonstrated that TopicSketch on a single machine can potentially handle hundreds of millions tweets per day, which is on the same scale of the total number of daily tweets in Twitter, and present bursty events in finer-granularity.
Article
Twitter, as a form of social media, is fast emerging in recent years. Users are using Twitter to report real-life events. This paper focuses on detecting those events by analyzing the text stream in Twitter. Although event detection has long been a research topic, the characteristics of Twitter make it a non-trivial task. Tweets reporting such events are usually overwhelmed by high flood of meaningless "babbles". Moreover, event detection algorithm needs to be scalable given the sheer amount of tweets. This paper attempts to tackle these challenges with EDCoW (Event Detection with Clustering of Wavelet-based Signals). EDCoW builds signals for individual words by applying wavelet analysis on the frequency-based raw signals of the words. It then filters away the trivial words by looking at their corresponding signal auto-correlations. The remaining words are then clustered to form events with a modularity-based graph partitioning technique. Experimental studies show promising result of EDCoW. We also present the design of a proofof- concept system, which was used to analyze netizens' online discussion about Singapore General Election 2011.
Conference Paper
This paper presents an LDA-style topic model that captures not only the low-dimensional structure of data, but also how the structure changes over time. Unlike other recent work that relies on Markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps, and for each generated document, the mixture distribution over topics is influenced by both word co-occurrences and the document's timestamp. Thus, the meaning of a particular topic can be relied upon as constant, but the topics' occurrence and correlations change significantly over time. We present results on nine months of personal email, 17 years of NIPS research papers and over 200 years of presidential state-of-the-union addresses, showing improved topics, better timestamp prediction, and interpretable trends.
Article
Social media such as Twitter or weblogs are a popular source for live textual data. Much of this popularity is due to the fast rate at which this data arrives, and there are a number of global events - such as the Arab Spring - where Twitter is reported to have had a major influence. However, existing methods for emerging topic detection are often only able to detect events of a global magnitude such as natural disasters or celebrity deaths, and can monitor user-selected keywords or operate on a curated set of hashtags only. Interesting emerging topics may, however, be of much smaller magnitude and may involve the combination of two or more words that themselves are not unusually hot at that time. Our contributions to the detection of emerging trends are three-fold first of all, we propose a significance measure that can be used to detect emerging topics early, long before they become "hot tags", by drawing upon experience from outlier detection. Secondly, by using hash tables in a heavy-hitters type algorithm for establishing a noise baseline, we show how to track even all keyword pairs using only a fixed amount of memory. Finally, we aggregate the detected co-trends into larger topics using clustering approaches, as often as a single event will cause multiple word combinations to trend at the same time.
Article
We report on a series of experiments with convolutional neural networks (CNN) trained on top of pre-trained word vectors for sentence-level classification tasks. We first show that a simple CNN with little hyperparameter tuning and static vectors achieves excellent results on multiple benchmarks. Learning task-specific vectors through fine-tuning offers further gains in performance. We additionally propose a simple modification to the architecture to allow for the use of both task-specific and static word vectors. The CNN models discussed herein improve upon the state-of-the-art on 4 out of 7 tasks, which include sentiment analysis and question classification.
Conference Paper
This paper presents an LDA-style topic model that captures not only the low-dimensional structure of data, but also how the structure changes over time. Unlike other recent work that relies on Markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps, and for each generated document, the mixture distribution over topics is influenced by both word co-occurrences and the document's timestamp. Thus, the meaning of a particular topic can be relied upon as constant, but the topics' occurrence and correlations change significantly over time. We present results on nine months of personal email, 17 years of NIPS research papers and over 200 years of presidential state-of-the-union addresses, showing improved topics, better timestamp prediction, and interpretable trends.
Conference Paper
Streaming user-generated content in the form of blogs, microblogs, forums, and multimedia sharing sites, provides a rich source of data from which invaluable information and insights maybe gleaned. Given the vast volume of such social media data being continually generated, one of the challenges is to automatically tease apart the emerging topics of discussion from the constant background chatter. Such emerging topics can be identified by the appearance of multiple posts on a unique subject matter, which is distinct from previous online discourse. We address the problem of identifying emerging topics through the use of dictionary learning. We propose a two stage approach respectively based on detection and clustering of novel user-generated content. We derive a scalable approach by using the alternating directions method to solve the resulting optimization problems. Empirical results show that our proposed approach is more effective than several baselines in detecting emerging topics in traditional news story and newsgroup data. We also demonstrate the practical application to social media analysis, based on a study on streaming data from Twitter.
Conference Paper
A family of probabilistic time series models is developed to analyze the time evolution of topics in large document collections. The approach is to use state space models on the natural param- eters of the multinomial distributions that repre- sent the topics. Variational approximations based on Kalman filters and nonparametric wavelet re- gression are developed to carry out approximate posterior inference over the latent topics. In addi- tion to giving quantitative, predictive models of a sequential corpus, dynamic topic models provide a qualitative window into the contents of a large document collection. The models are demon- strated by analyzing the OCR'ed archives of the journal Science from 1880 through 2000.
CatBoost: unbiased boosting with categorical features
  • L Prokhorenkova
  • G Gusev
  • A Vorobev
  • A V Dorogush
  • A Gulin
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A.V., Gulin, A.: CatBoost: unbiased boosting with categorical features. In: Advances in Neural Information Processing Systems, pp. 6639-6649 (2018)
Tweedr: mining Twitter to inform disaster response
  • Z Ashktorab
  • C Brown
  • M Nandi
  • A Culotta
Ashktorab, Z., Brown, C., Nandi, M., Culotta, A.: Tweedr: mining Twitter to inform disaster response. In: Proceedings of ISCRAM, pp. 354-358 (2014)
LightGBM: a highly efficient gradient boosting decision tree
  • G Ke
Ke, G., et al.: LightGBM: a highly efficient gradient boosting decision tree. In: Advances in Neural Information Processing Systems, pp. 3149-3157 (2017)
Globally normalized transition-based neural networks
  • D Andor
  • C Zhou
  • C Sun
  • Z Liu
  • F Lau
Zhou, C., Sun, C., Liu, Z., Lau, F.: A C-LSTM neural network for text classification. arXiv preprint arXiv:1511.08630 (2015)