Article

Event detection in twitter

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Twitter, as a form of social media, is fast emerging in recent years. Users are using Twitter to report real-life events. This paper focuses on detecting those events by analyzing the text stream in Twitter. Although event detection has long been a research topic, the characteristics of Twitter make it a non-trivial task. Tweets reporting such events are usually overwhelmed by high flood of meaningless "babbles". Moreover, event detection algorithm needs to be scalable given the sheer amount of tweets. This paper attempts to tackle these challenges with EDCoW (Event Detection with Clustering of Wavelet-based Signals). EDCoW builds signals for individual words by applying wavelet analysis on the frequency-based raw signals of the words. It then filters away the trivial words by looking at their corresponding signal auto-correlations. The remaining words are then clustered to form events with a modularity-based graph partitioning technique. Experimental studies show promising result of EDCoW. We also present the design of a proofof- concept system, which was used to analyze netizens' online discussion about Singapore General Election 2011.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The high number of posts on this social network and the extent of its users have attracted the interest of researchers. In recent years, this huge amount of data has provided a hotbed for data mining research to discover facts, trends, events, and even predict specific incidents [3,4,5,6,7]. ...
... Therefore, the process for storing and analyzing this inexhaustible stream of information should not be dependent on the hardware specification of the system, such as memory capacity. In other words, if the event detection algorithm is applied to tweets for two months or two days, its accuracy should not be reduced, as well as the limitations of hardware such as memory to prevent it [5]. So, we need to propose an approach to detect events employing the analysis of tweets while they are received and require no further storage and processing. ...
... In fact, there is no distinct segregation between trending topic detection and event detection in many papers. In some papers discussing the subject of trending topic detection, the term event detection is frequently mentioned, including [5,13,14,15,16,17]. It is also pretty much the same in Twitter. ...
... The edge represents the semantic relationship or correlation strength between nodes. Graphic event recognition is implemented by techniques such as dense subgraphs based on tightly coupled entities Angel et al. [13], community detection [13], [240], and graph partitioning [241]. ...
... An event is represented as a cluster of keywords [242], [243]. The definition is similar to that of the feature pivot event [133], [241], [244], where an event is a group of key words. It assumes that, for massive streams of documents, a sharp change in word frequency indicates a burst of a new event. ...
... Weng and Lee [241] presented an event detection paradigm, which is rooted in signal processing theory. In this model, occurrences of a word with its timeline are transformed into a signal by wavelet analysis. ...
Full-text available
Article
There is large and growing amounts of textual data that contains information about human activities. Mining interesting knowledge from this textual data is a challenging task because it consists of unstructured or semistructured text that are written in natural language. In the field of artificial intelligence, event-oriented techniques are helpful in addressing this problem, where information retrieval (IR), information extraction (IE) and graph methods (GMs) are three of the most important paradigms in supporting event-oriented processing. In recent years, due to information explosions, textual event detection and recognition have received extensive research attention and achieved great success. Many surveys have been conducted to retrospectively assess the development of event detection. However, until now, all of these surveys have focused on only a single aspect of IR, IE or GMs. There is no research that provides a complete introduction or a comparison of IR, IE, and GMs. In this article, a survey about these techniques is provided from a broader perspective, and a convenient and comprehensive comparison of these techniques is given. The hallmark of this article is that it is the first survey that combines IR, IE and GMs in a single frame and will therefore benefit researchers by acting as a reference in this field.
... These challenges are inherited in the event detection problem. Other existing approaches utilize platform specific features such as hashtags, retweets, and followers and external knowledge to enhance event detection (Doulamis, Doulamis, Kokkinos, & Varvarigos, 2016), (Weng & Lee, 2011). The problem with these approaches is that when these attributes do not appear for different reasons in the process, the accuracy of the event detection will be affected. ...
... In our work, we used a greedy search approach considering only important words. Weng and Lee ( 2011 ) proposed an approach that depends on clustering wavelet signals. A signal is built for each word using wavelet analysis to reduce space and storage. ...
Full-text available
Thesis
Recently, Microblogs have become the new communication medium between users. It allows millions of users to post and share content of their own activities, opinions about different topics. Posting about occurring real-world events has attracted people to follow events through microblogs instead of mainstream media. As a result, there is an urgent need to detect events from microblogs so that users can identify events quickly, also and more importantly to aid higher authorities to respond faster to occurring events by taking proper actions. While considerable researches have been conducted for event detection on the English language. Arabic context have not received much research even though there are millions of Arabic users. Also existing approaches rely on platform dependent features such as hashtags, mentions, retweets etc. which make their approaches fail when these features are not present in the process. In addition to that, approaches that depend on the presence of frequently used words only do not always detect real events because it cannot differentiate events and general viral topics. In this thesis, we propose an approach for Arabic event detection from microblogs. We first collect the data, then a preprocessing step is applied to enhance the data quality and reduce noise. The sentence text is analyzed and the part-of-speech tags are identified. Then a set of rules are used to extract event indicator keywords called event triggers. The frequency of each event triggers is calculated, where event triggers that have frequencies higher than the average are kept, or removed otherwise. We detect events by clustering similar event triggers together. An Adapted soft frequent pattern mining is applied to the remaining event triggers for clustering. We used a dataset called Evetar to evaluate the proposed approach. The dataset contains tweets that cover different types of Arabic events that occurred in a one month period. We split the dataset into different subsets using different time intervals, so that we can mimic the streaming behavior of microblogs. We used precision, recall and f-measure as evaluation metrics. The highest average f-measure value achieved was 0.717. Our results were acceptable compared to three popular approaches applied to the same dataset.
... Clusters were ranked based on the value of the entropy. Event detection in Twitter was carried out by [20]. The paper focused on detecting reallife events from tweets using Event Detection with Clustering of Wavelet-based Signals (EDCoW). ...
Full-text available
Article
Interactions via social media platforms have made it possible for anyone, irrespective of physical location, to gain access to quick information on events taking place all over the globe. However, the semantic processing of social media data is complicated due to challenges such as language complexity, unstructured data, and ambiguity. In this paper, we proposed the Social Media Analysis Framework for Event Detection (SMAFED). SMAFED aims to facilitate improved semantic analysis of noisy terms in social media streams, improved representation/embedding of social media stream content, and improved summarization of event clusters in social media streams. For this, we employed key concepts such as integrated knowledge base, resolving ambiguity, semantic representation of social media streams, and Semantic Histogram-based Incremental Clustering based on semantic relatedness. Two evaluation experiments were conducted to validate the approach. First, we evaluated the impact of the data enrichment layer of SMAFED. We found that SMAFED outperformed other pre-processing frameworks with a lower loss function of 0.15 on the first dataset and 0.05 on the second dataset. Second, we determined the accuracy of SMAFED at detecting events from social media streams. The result of this second experiment showed that SMAFED outperformed existing event detection approaches with better Precision (0.922), Recall (0.793), and F-Measure (0.853) metric scores. The findings of the study present SMAFED as a more efficient approach to event detection in social media.
... Fung and Yu(Fung et al. 2005) firstly proposed a parameterfree method to detect bursty words and entities in text streams, then detected events through clustering the detected words and entities. Similarly, EDCoW (Weng and Lee 2011) and Twevent (Li, Sun, and Datta 2012) are also based on word burst detection. Specifically, in order to deal with the short and noisy contents of tweets, Twevent (Li, Sun, and Datta 2012) split each tweet into non-overlapping segments then detected and clustered bursty segments to fulfill event detection. ...
Article
Sub-event discovery is an effective method for social event analysis in Twitter. It can discover sub-events from large amount of noisy event-related information in Twitter and semantically represent them. The task is challenging because tweets are short, informal and noisy. To solve this problem, we consider leveraging event-related hashtags that contain many locations, dates and concise sub-event related descriptions to enhance sub-event discovery. To this end, we propose a hashtag-based mutually generative Latent Dirichlet Allocation model(MGe-LDA). In MGe-LDA, hashtags and topics of a tweet are mutually generated by each other. The mutually generative process models the relationship between hashtags and topics of tweets, and highlights the role of hashtags as a semantic representation of the corresponding tweets. Experimental results show that MGe-LDA can significantly outperform state-of-the-art methods for sub-event discovery.
... Several studies have been carried out by various researchers in ED, some of the relevant works are mentioned below: Jianshu et al. [4] presents a study on ED by clustering wavelet-based (EDCoW) signals. They build signals for individual words by applying wavelet analysis that provides precise measurements regarding when and how the frequency of the signal changes over time on frequency-based raw signals of the words and then filters away the trivial words by looking at their corresponding signal auto-correlations. ...
Full-text available
Conference Paper
Automatically handling the enormous amount of text data that is being generated with mind-blowing speed is an ongoing work in text processing for various applications. Event Detection (ED) is one such application that aims to extract information about events in a given text based on the words which indicate the events. It acts as a preprocessing step for various Natural Language Processing (NLP) applications such as relation extraction, topic modeling, and decision making. In this paper, we, team MUCS, present an approach using Linear SVC to identify pieces of text indicating events and then classifying those events into predefined categories using n-grams, suffix and prefix features. The model has been submitted to Event Detection from News in Indian Languages (EDNIL) task in Forum for Information Retrieval Evaluation(FIRE 2020).
... Clusters were ranked based on the value of the entropy. Event detection in Twitter was carried out by [20]. The paper focused on detecting real-life events from tweets using Event Detection with Clustering of ...
Full-text available
Preprint
Interactions via social media platforms have made it possible for anyone, irrespective of physical location, to gain access to quick information on events taking place all over the globe. However, the semantic processing of social media data is complicated due to challenges such as language complexity, unstructured data, and ambiguity. In this paper, we proposed the Social Media Analysis Framework for Event Detection (SMAFED). SMAFED aims to facilitate improved semantic analysis of noisy terms in social media streams, improved representation/embedding of social media stream content, and improved summarisation of event clusters in social media streams. For this, we employed key concepts such as integrated knowledge base, resolving ambiguity, semantic representation of social media streams, and Semantic Histogram-based Incremental Clustering based on semantic relatedness. Two evaluation experiments were conducted to validate the approach. First, we evaluated the impact of the data enrichment layer of SMAFED. We found that SMAFED outperformed other pre-processing frameworks with a lower loss function of 0.15 on the first dataset and 0.05 on the second dataset. Secondly, we determined the accuracy of SMAFED at detecting events from social media streams. The result of this second experiment showed that SMAFED outperformed existing event detection approaches with better Precision (0.922), Recall (0.793), and F-Measure (0.853) metric scores. The findings of the study present SMAFED as a more efficient approach to event detection in social media.
... Weng and Lee [99] targeted Twitter where new events are tweeted and discussed. Twitter is a challenging target for event-detection tasks as it is very scalable, and it is full of noisy data or tweets which are not related to any new event. ...
Full-text available
Article
Tracking social media sentiment on a desired target is certainly an important query for many decision-makers in fields like services, politics, entertainment, manufacturing, etc. As a result, there has been a lot of focus on Sentiment Analysis. Moreover, some studies took one step ahead by analyzing subjective texts further to understand possible motives behind extracted sentiments. Few other studies took several steps ahead by attempting to automatically interpret sentiment variations. Learning reasons from sentiment variations is indeed valuable, to either take necessary actions in a timely manner or learn lessons from archived data. However, machines are still immature to carry out the full Sentiment Variations’ Reasoning task perfectly due to various technical hurdles. This paper attempts to explore main approaches to Opinion Reason Mining, with focus on Interpreting Sentiment Variations. Our objectives are investigating various methods for solving the Sentiment Variations’ Reasoning problem and identifying some empirical research gaps. To identify these gaps, a real-life Twitter dataset is analyzed, and key hypothesis for interpreting public sentiment variations are examined.
... In a 2018 study, Kovacs-Györi et al. [44] investigated how social media data can be utilized to investigate planned large events in cities, using the Olympic Games in London, 2012. Similar studies have been performed also for the Rio Olympics [45], transportation during planned (and unplanned) events [46], or to detect events based on Twitter data [47][48][49]. ...
Full-text available
Article
Urban systems involve a multitude of closely intertwined components, which are more measurable than before due to new sensors, data collection, and spatio-temporal analysis methods. Turning these data into knowledge to facilitate planning efforts in addressing current challengesnof urban complex systems requires advanced interdisciplinary analysis methods, such as urban informatics or urban data science. Yet, by applying a purely data-driven approach, it is too easy to get lost in the ‘forest’ of data, and to miss the ‘trees’ of successful, livable cities that are the ultimate aim of urban planning. This paper assesses how geospatial data, and urban analysis, using a mixed methods approach, can help to better understand urban dynamics and human behavior, and how it can assist planning efforts to improve livability. Based on reviewing state-of-the-art research the paper goes one step further and also addresses the potential as well as limitations of new data sources in urban analytics to get a better overview of the whole ‘forest’ of these new data sources and analysis methods. The main discussion revolves around the reliability of using big data from social media platforms or sensors, and how information can be extracted from massive amounts of data through novel analysis methods, such as machine learning, for better-informed decision making aiming at urban livability improvement.
... In recent years, there has been renewal of interest in social text streaming data [53]. Some studies detect events from Twitter [19,54] and aim to harvest collective intelligence [55,56], and some research has focused on news and blogs [17]. When an event occurs, web users would search for the latest information about the event as well as publish blog posts to discuss the event. ...
Full-text available
Article
More and more people are involved in sustainability-related activities through social network to support/protect their idea or motivation for sustainable development. Understanding the variety of issues of social pulsation is crucial in development of social sustainability. However, issues in social media generally change overtime. Issues not identified in advance may soon become popular topics discussed in society, particularly controversial issues. Previous studies have focused on the detection of hot topics and discussion of controversial issues, rather than the identification of potential controversial issues, which truly require paying attention to social sustainability. Furthermore, previous studies have focused on issue detection and tracking based on historical data. However, not all controversial issues are related to historical data to foster the cases. To avoid the above-mentioned research gap, Artificial Intelligence (AI) plays an essential role in issue detection in the early stage. In this study, an AI-based solution approach is proposed to resolve two practical problems in social media: (1) the impact caused by the number of fan pages from Facebook and (2) awareness of the levels for an issue. The proposed solution approach to detect potential issues is based on the popularity of public opinion in social media using a Web crawler to collect daily posts related to issues in social media under a big data environment. Some analytical findings are carried out via the congregational rules proposed in this research, and the solution approach detects the attentive subjects in the early stages. A comparison of the proposed method to the traditional methods are illustrated in the domain of green energy. The computational results demonstrate that the proposed approach is accurate and effective and therefore it provides significant contribution to upsurge green energy deployment.
... PhraseNet [24] present an architecture that follows a similar paradigm, using phrases instead of unigrams derived from microblog documents. EDCoW [40]and [9] present wavelet based approaches. Some of the recent works in this area are [41], [25], [14] and [28]. ...
Full-text available
Preprint
A small survey on event detection using Twitter. This work first defines the problem statement, and then summarizes and collates the different research works towards solving the problem.
Conference Paper
A large amount of social media data hosted on platforms like Twitter, Instagram, Facebook, etc. are event-based and hold a substantial amount of real-world data. Event-based information can appear on any social media site in the form of news items, images, videos, audio clips, status updates, etc. The task of event detection refers to identifying data relevant to an event and the classification of this relevant data to different event types. Traditional social media event detection techniques focused mainly on a single modality as the data shared were mostly homogenous. However, the current social media data is multimodal and includes text, images, audio, and video clips, and geolocations. Multimodal event detection techniques are essential for handling such heterogeneous data. Among all the social media sites Twitter is the most popular as users share event-related short messages and photos in real-time generating several thousands of tweets very frequently. In this paper, we focus on providing a comprehensive survey of event detection from social media, especially from the widely used platform, Twitter. The survey focuses mainly on research done on event detection using the two main modalities single and multimodality. At the end of the paper, we discuss the relevance of multimodal event detection from social media data which currently spans multiple dimensions.
Full-text available
Article
Introduction: The current evaluation processes of the burden of diabetes are incomplete and subject to bias. This study aimed to identify regional differences in the diabetes burden on a universal level from the perspective of people with diabetes. Research design and methods: We developed a worldwide online diabetes observatory based on 34 million diabetes-related tweets from 172 countries covering 41 languages, spanning from 2017 to 2021. After translating all tweets to English, we used machine learning algorithms to remove institutional tweets and jokes, geolocate users, identify topics of interest and quantify associated sentiments and emotions across the seven World Bank regions. Results: We identified four topics of interest for people with diabetes (PWD) in the Middle East and North Africa and another 18 topics in North America. Topics related to glycemic control and food are shared among six regions of the world. These topics were mainly associated with sadness (35% and 39% on average compared with levels of sadness in other topics). We also revealed several region-specific concerns (eg, insulin pricing in North America or the burden of daily diabetes management in Europe and Central Asia). Conclusions: The needs and concerns of PWD vary significantly worldwide, and the burden of diabetes is perceived differently. Our results will support better integration of these regional differences into diabetes programs to improve patient-centric diabetes research and care, focused on the most relevant concerns to enhance personalized medicine and self-management of PWD.
Article
The collection of applications (internet followed) that provide way to create communication of user-generated matter by the social media (Twitter, Facebook, Whatsapp, etc.,). Twitter is the micro-blogging platform. Thoughts and opinions about different aspects are shared by users. Analysis on sentiment expressed in a piece of text which expresses opinions, towards a particular topic, product, etc. (positive, negative, or neutral). Primary issues are previous techniques that have biased classification accuracy, due to data distribution in a non-balanced way. The existing methods are applied over small dataset which cannot be extended for generalization with expected accuracy. "Curse of dimensionality” still exists with higher number of attributes in existing methods. Weaker classification in non-linear context. Limited use of transforms (kernels) for linearity in higher dimensional spaces and lack of parameter tuning. In most of the real time dataset the neutral class is very high. The proposed system is the framework for text mining to handle theme extraction from twitter opinion dataset. The Learning models to be built using the Support Vector Tool (SVT) classification method with a kernel trick applied with composition using unigram, bigram and hybrid (unigram + bigram) features. The performance to be obtained by tuning the internal parameters. The result shows that SVT linear kernel with hybrid features are the best classifier when compare to other classifiers with maximum accuracy from the twitter opinion dataset.
Chapter
Tracking news stories in documents is a way to deal with the large amount of information that surrounds us everyday, to reduce the noise and to detect emergent topics in news. Since the Covid-19 outbreak, the world has known a new problem: infodemic. News article titles are massively shared on social networks and the analysis of trends and growing topics is complex. Grouping documents in news stories lowers the number of topics to analyse and the information to ingest and/or evaluate. Our study proposes to analyse news tracking with little information provided by titles on social networks. In this paper, we take advantage of datasets of public news article titles to experiment news tracking algorithms on short messages. We evaluate the clustering performance with little amount of data per document. We deal with the document representation (sparse with TF-IDF and dense using Transformers [26]), its impact on the results and why it is key to this type of work. We used a supervised algorithm proposed by Miranda et al. [22] and K-Means to provide evaluations for different use cases. We found that TF-IDF vectors are not always the best ones to group documents, and that algorithms are sensitive to the type of representation. Knowing this, we recommend taking both aspects into account while tracking news stories in short messages. With this paper, we share all the source code and resources we handled.KeywordsText Classification and ClusteringNewsSocial data
Full-text available
Article
Given the recent availability of large volumes of social media discussions, finding temporal unusual phenomena, which can be called events, from such data is of great interest. Previous works on social media event detection either assume a specific type of event, or assume certain behavior of observed variables. In this paper, we propose a general method for event detection on social media that makes few assumptions. The main assumption we make is that when an event occurs, affected semantic aspects will behave differently from their usual behavior, for a sustained period. We generalize the representation of time units based on word embeddings of social media text, and propose an algorithm to detect durative events in time series in a general sense. In addition, we also provide an incremental version of the algorithm for the purpose of real-time detection. We test our approaches on synthetic data and two real-world tasks. With the synthetic dataset, we compare the performance of retrospective and incremental versions of the algorithm. In the first real-world task, we use a novel setting to test if our method and baseline methods can exhaustively catch all real-world news in the test period. The evaluation results show that when the event is quite unusual with regard to the base social media discussion, it can be captured more effectively with our method. In the second real-world task, we use the event captured to help improve the accuracy of stock market movement prediction. We show that our event-based approach has a clear advantage compared to other ways of adding social media information.
Full-text available
Article
The article studies Turkey's twiplomacy-how Turkish diplomats use Twitter in performing their diplomatic outreach and public diplomacy. The literature review shows that there is a lack of a comprehensive large N study of Turkey's twiplomacy. The article fills in this gap by collecting and analyzing data set of Twitter posts by 76 diplomats from 2010 to 2020. It helps understand how and to what extent Turkish diplomats maintain their presence on Twitter. We achieve this goal using two groups of methods. Firstly, we derive descriptive statistics for several user metrics including raw numbers of tweets per user and per date as well as retweet, reply, and like counts per user. Secondly , we analyze content of tweets through calculation of their sentiment scores. The main findings indicate that the Twitter presence of Turkish diplomats is relatively limited and reliant on a few prominent figures. Though Turkish diplomats are selected from well-educated individuals who can make the greatest use of available opportunities provided by social media, relatively few of them are active on Twitter. Another significant conclusion is that Turkey's twiplomacy is inconsistent and driven by individuals rather than a part of a wider strategy or framework. Online activities of different state institutions are not synchronized for efficient use of social media and so-called twiplomacy. Finally, according to the results of the content analysis, Turkish diplomats usually employ positive language in their tweets, as seen by the most frequently used terms, related emotions, and sentiment scores. It confirms the idea that Turkish diplomats tend to promote messages demonstrating Turkey's endorsement of international cooperation.
Chapter
The Digital era has the benefits in unearthing a large amount of imperative material. One such digital document is social media data, which when processed can give rise to information which can be helpful to our society. One of the many things that we can unearth from social media is events. Twitter is a very popular microblog that encompasses fruitful and rich information on real world events and popular topics. Event detection in view of situational awareness for crisis response is an important need of the current world. The identification of tweets comprising information that may assist in help and rescue operation is crucial. Most pertinent features for this process of identification are studied and the inferences are given in this article. The efficiency and practicality of the features are discussed here. This article also presents the results of experimentation carried out to assess the most relevant combination of features for improved performance in event detection from Twitter.
Full-text available
Chapter
This chapter provides a theoretical framework for burst detection, including its advantages, disadvantages, and other essential features. It further enumerates various open-source tools that can be used to conduct burst detection and discusses the use cases on how the information professionals can apply it in their daily lives. The chapter is followed by a case study using two different tools to demonstrate the application of burst detection in libraries.
Full-text available
Article
In this modern era, each and everything is computerized, and everyone has their own smart gadgets to communicate with others around the globe without any range limitations. Most of the communication pathways belong to smart applications, call options in smartphones, and other multiple ways, but e-mail communication is considered the main professional communication pathway, which allows business people as well as commercial and noncommercial organizations to communicate with one another or globally share some important official documents and reports. This global pathway attracts many attackers and intruders to do a scam with such innovations; in particular, the intruders generate false messages with some attractive contents and post them as e-mails to global users. This kind of unnecessary and not needed advertisement or threatening mails is considered as spam mails, which usually contain advertisements, promotions of a concern or institution, and so on. These mails are also considered or called junk mails, which will be reflected as the same category. In general, e-mails are the usual way of message delivery for business oriented as well as any official needs, but in some cases there is a necessity of transferring some voice instructions or messages to the destination via the same e-mail pathway. These kinds of voice-oriented e-mail accessing are called voice mails. The voice mail is generally composed to deliver the speech aspect instructions or information to the receiver to do some particular tasks or convey some important messages to the receiver. A voice-mail-enabled system allows users to communicate with one another based on speech input which the sender can communicate to the receiver via voice conversations, which is used to deliver voice information to the recipient. These kinds of mails are usually generated using personal computers or laptops and exchanged via general e-mail pathway, or separate paid and nonpaid mail gateways are available to deal with certain mail transactions. The above-mentioned e-mail spam is considered in many past researches and attains some solutions, but in case of voice-based e-mail aspect, there will be no options to manage such kind of security parameters. In this paper, a hybrid data processing mechanism is handled with respect to both text-enabled and voice-enabled e-mails, which is called Genetic Decision Tree Processing with Natural Language Processing (GDTPNLP). This proposed approach provides a way of identifying the e-mail spam in both textual e-mails and speech-enabled e-mails. The proposed approach of GDTPNLP provides higher spam detection rate in terms of text extraction speed, performance, cost efficiency, and accuracy. These all will be explained in detail with graphical output views in the Results and Discussion.
Article
Linguistic landscapes are useful tools to decipher language ideologies that regulate public spaces in society, helping us to decode the semiotic messages that those landscapes transmit. Urban spaces also reveal social practices that organize people’s lives and unveil social discourses that legitimize, approve, erode, or eliminate different linguistic varieties that struggle to survive. This article examines the use of (mock) Lunfardo, a Spanish urban variety spoken in the Rio de la Plata area, Argentina, in a sign posted by the Buenos Aires’ city authorities and the impact this sign had on social media. The results of the analysis show that appealing to Lunfardo as a symbol of identity failed to establish a conversation between parties within a separated, fractured society.
Full-text available
Article
Twitter is one of the largest online platforms where people exchange information. In the first few years since its emergence, researchers have been exploring ways to use Twitter data in various decision making scenarios, and have shown promising results. In this review, we examine 28 newer papers published in last five years (since 2016) that continued to advance Twitter-aided decision making. The application scenarios we cover include product sales prediction, stock selection, crime prevention, epidemic tracking, and traffic monitoring. We first discuss the findings presented in these papers, that is how much decision making performance has been improved with the help of Twitter data. Then we offer a methodological analysis that considers four aspects of methods used in these papers, including problem formulation, solution, Twitter feature, and information transformation. This methodological analysis aims to enable researchers and decision makers to see the applicability of Twitter-aided methods in different application domains or platforms.
Full-text available
Article
Social media have become a reliable source to generalize, voice and channelize an individual's idea on any matter from politics, global economy, industrialization to even embarking new reforms. People's opinions are being processed as data to even analyse new forms and propagandas for political stages. This analysis of integrating a person's view to polarize the context on any subject circulated on a large crowd can be termed sentimental analysis. Using rule-based categorization to produce the labels for enhancing outcome of sentiment is what VADER(Valance Aware Dictionary and sEntiment Reasoner) is targeting through this paper. On a platform where sentimental analysis is carried out on many fonts which includes volume-based analysis, opinion based and network based, we are focusing on enhancing the lexicon classification to strengthen the foundation for better performance. VADER system is aimed to match the LIWC(Linguistic Inquiry Word Count) system format so as to increase efficiency, imrpove the MSE score and enhance F1 regularization.
Chapter
Event detection has been proved important in various applications, such as route selection to avoid the congestion an event causes or deciding whether to join an event that one is interested in. While geotagged tweets are popular sources of information for event detection, they are usually insufficient for accurate detection when scarce. On the other hand, non-geotagged tweets are more abundant, but include much noise that also deters accurate event detection. In this work, we aimed to enhance detection performance by combining aggregated smartphone GPS data and non-geotagged tweets. We propose a novel method to detect events based on deviations from inferred normal human mobility, selecting event-related topics that correlated with human mobility, and extracting event-relevant tweets by scoring each tweet according to its relevance to an event. The relevance of each tweet is gauged from the tweet’s meaning and posting time. We conducted empirical evaluations using data that include multiple events, such as baseball game and airport congestion. Our proposed method detected 9 out of 10 events regardless of the type and scale of the events, which attests improvement over the geotag-based method. We also confirmed that our model was able to extract the essential event-relevant tweets with an average accuracy of over 90%.
Full-text available
Conference Paper
In these times of increasing cybersecurity threats, monitoring and analysing cybersecurity events in a timely and effective way is the key to promote social media security. Twitter is one of the world's widely used social media platforms where users can share their preferences, images, opinions, and events. The Twitter platform can promptly aggregate cyber-related events and provide a source of information about cyber threats. Likewise, Deep Learning can play a critical role to help social media providers achieve a more accurate assessment of cybersecurity threats. In this paper, we have reviewed various threats and discussed deep learning techniques to detect cybersecurity threats on Twitter.
Chapter
The publicly available data, such as the massive and dynamically updated news and social media data streams (a.k.a. big data), cover a wide range of social activities, personal views, and expressions. Effective research and application rely heavily on the ability of comprehending and discovering the knowledgepatterns underlying this big data, from which the notion of an event serves as a cornerstone in building up more complex knowledge structures. Establishing methodologies and techniques for discovering real-world events from such large amounts of data, as well as for managing and analyzing such events in an efficient and aesthetic manner, is crucial and challenging. In this paper, we present an event cube framework devised to support various collection, consolidation, fusion, and analysis tasks for suicidal events. More specifically, we present a mechanism for data collection over multiple data sources in both passive and active manners, and promote the mappings constructed from various representation spaces for data consolidation. Furthermore, multimodal fusion is devised to integrate multiple data intrinsic structures and learn discriminative data representations so as to process heterogeneous multimodal data efficiently. Finally, the event cube model is developed to support event organization and contextualization with hierarchical and analytical operations. A case study is provided to demonstrate the capabilities and benefits of our event cube facilities supporting on-line analytical processing of suicidal events and their relationships.
Article
The tremendous growth of event dissemination over social networks makes it very challenging to accurately discover and track exciting events, as well as their evolution and scope over space and time. People have migrated to social platforms and messaging apps, which represent an opportunity to create a more accurate prediction of social developments by translating event related streams to meaningful insights. However, the huge spread of ‘noise’ from unverified social media sources makes it difficult to accurately detect and track events. Over the last decade, multiple surveys on event detection from social media have been presented, with the aim of highlighting the different NLP, data management and machine learning techniques used to discover specific types of events, such as social gatherings, natural disasters, and emergencies, among others. However, these surveys focus only on a few dimensions of event detection, such as emphasizing on knowledge discovery form single modality or single social media platform or applied only to one specific language. In this survey paper, we introduce multiple perspectives for event detection in the big social data era. This survey paper thoroughly investigates and summarizes the significant progress in social event detection and visualization techniques, by emphasizing crucial challenges ranging from the management, fusion, and mining of big social data, to the applicability of these methods to different platforms, multiple languages and dialects rather than a single language, and with multiple modalities. The survey also focuses on advanced features required for event extraction, such as spatial and temporal scopes, location inference from multi-modal data (i.e., text or image), and semantic analysis. Application-oriented challenges and opportunities are also discussed. Finally, quantitative and qualitative experimental procedures and results to illustrate the effectiveness and gaps in existing works are presented.
Chapter
This is an overview paper of the NLPCC 2021 shared task on AutoIE2, which aims to evaluate the sub-event identification systems with limited annotated data. Given definitions of specific sub-events, 100K unannotated samples and 300 annotated seed samples, participants are required to build a sub-event identification system. 30 teams registered and 14 of them submitted results. The top system achieves 8.43% and 8.25% accuracy score improvement upon the baseline system with or without extra annotated data respectively. The evaluation result indicates that it is possible to use less human annotation and large unlabeled corpora for the sub-event identification system. ALL information about this task can be found at https://github.com/IIGROUP/AutoIE2.
Full-text available
Chapter
Event detection on social media has attracted a number of researches, given the recent availability of large volumes of social media discussions. Previous works on social media event detection either assume a specific type of event, or assume certain behavior of observed variables. In this paper, we propose a general method for event detection on social media that makes few assumptions. The main assumption we make is that when an event occurs, affected semantic aspects will behave differently from its usual behavior. We generalize the representation of time units based on word embeddings of social media text, and propose an algorithm to detect events in time series in a general sense. In the experimental evaluation, we use a novel setting to test if our method and baseline methods can exhaustively catch all real-world news in the test period. The evaluation results show that when the event is quite unusual with regard to the base social media discussion, it can be captured more effectively with our method. Our method can be easily implemented and can be treated as a starting point for more specific applications.
Full-text available
Article
The widespread popularity of social networking is leading to the adoption of Twitter as an information dissemination tool. Existing research has shown that information dissemination over Twitter has a much broader reach than traditional media and can be used for effective post-incident measures. People use informal language on Twitter, including acronyms, misspelled words, synonyms, transliteration, and ambiguous terms. This makes incident-related information extraction a non-trivial task. However, this information can be valuable for public safety organizations that need to respond in an emergency. This paper proposes an early event-related information extraction and reporting framework that monitors Twitter streams, synthesizes event-specific information, e.g., a terrorist attack, and alerts law enforcement, emergency services, and media outlets. Specifically, the proposed framework, Tweet-to-Act (T2A), employs word embedding to transform tweets into a vector space model and then utilizes theWord Mover’s Distance (WMD) to cluster tweets for the identification of incidents. To extract reliable and valuable information from a large dataset of short and informal tweets, the proposed framework employs sequence labeling with bidirectional Long Short-Term Memory based Recurrent Neural Networks (bLSTM-RNN). Extensive experimental results suggest that our proposed framework, T2A, outperforms other state-of-the-art methods that use vector space modeling and distance calculation techniques, e.g., Euclidean and Cosine distance. T2A achieves an accuracy of 96% and an F1-score of 86.2% on real-life datasets.
Full-text available
Article
A key challenge in mining social media data streams is to identify events which are actively discussed by a group of people in a specific local or global area. Such events are useful for early warning for accident, protest, election or breaking news. However, neither the list of events nor the resolution of both event time and space is fixed or known beforehand. In this work, we propose an online spatio-temporal event detection system using social media that is able to detect events at different time and space resolutions. First, to address the challenge related to the unknown spatial resolution of events, a quad-tree method is exploited in order to split the geographical space into multiscale regions based on the density of social media data. Then, a statistical unsupervised approach is performed that involves Poisson distribution and a smoothing method for highlighting regions with unexpected density of social posts. Further, event duration is precisely estimated by merging events happening in the same region at consecutive time intervals. A post processing stage is introduced to filter out events that are spam, fake or wrong. Finally, we incorporate simple semantics by using social media entities to assess the integrity, and accuracy of detected events. The proposed method is evaluated using different social media datasets: Twitter and Flickr for different cities: Melbourne, London, Paris and New York. To verify the effectiveness of the proposed method, we compare our results with two baseline algorithms based on fixed split of geographical space and clustering method. For performance evaluation, we manually compute recall and precision. We also propose a new quality measure named strength index, which automatically measures how accurate the reported event is.
Full-text available
Article
Social networks are real-time platforms formed by users involving conversations and interactions. This phenomenon of the new information era results in a very huge amount of data in different forms and modalities such as text, images, videos, and voice. The data with such characteristics are also known as big data with 5-V properties and in some cases are also referred to as social big data. To find useful information from such valuable data, many researchers tried to address different aspects of it for different modalities. In the case of text, NLP researchers conducted many research studies and scientific works to extract valuable information such as topics. Many enlightening works on different platforms of social media, like Twitter, tried to address the problem of finding important topics from different aspects and utilized it to propose solutions for diverse use cases. The importance of Twitter in this scope lies in its content and the behavior of its users. For example, it is also known as first-hand news reporting social media which has been a news reporting and informing platform even for political influencers or catastrophic news reporting. In this review article, we cover more than 50 research articles in the scope of topic detection from Twitter. We also address deep learning-based methods.
Article
Social media is growing at an explosive rate and it becomes increasingly difficult for users to locate useful information from massive and high-velocity social media data. Recently a few social media sites have employed timelines to organize historical data of entities, which greatly improve user experiences in rediscovering important timeline episodes and understanding their order and trends. However, timelines of entities are not explicitly available in most social media sites. In other words, a gap exists between the importance of timelines and their availability in social media. In this paper, we investigate the problem of mining timelines of entities in social media. We delineate its challenges and opportunities, and propose a principled framework Timeliner, which can automatically generate timelines for entities by exploiting their historical social media data. We conduct experiments on real-world datasets, and the experimental results demonstrate that Timeliner can accurately mine timelines of entities in social media.
Full-text available
Chapter
Our physical world is being projected into online cyberspace at an unprecedented rate. People nowadays visit different places and leave behind them million-scale digital traces such as tweets, check-ins, Yelp reviews, and Uber trajectories. Such digital data are a result of social sensing: namely people act as human sensors that probe different places in the physical world and share their activities online. The availability of massive social-sensing data provides a unique opportunity for understanding urban space in a data-driven manner and improving many urban computing applications, ranging from urban planning and traffic scheduling to disaster control and trip planning. In this chapter, we present recent developments in data-mining techniques for urban activity modeling, a fundamental task for extracting useful urban knowledge from social-sensing data. We first describe traditional approaches to urban activity modeling, including pattern discovery methods and statistical models. Then, we present the latest developments in multimodal embedding techniques for this task, which learns vector representations for different modalities to model people's spatiotemporal activities. We study the empirical performance of these methods and demonstrate how data-mining techniques can be successfully applied to social-sensing data to extract actionable knowledge and facilitate downstream applications.
Article
-Online social media networks are gaining attention worldwide with an increasing number of people relying on them to connect, communicate and share their daily pertinent event-related information. Event detection is now increasingly leveraging online social networks for highlighting events happening around the world via the Internet of People. In this paper, a novel Event Detection model based on Scoring and Word Embedding (ED-SWE) is proposed for discovering key events from a large volume of data streams of tweets and for generating an event summary using key words and top-k tweets. The proposed ED-SWE model can distill high-quality tweets, reduce the negative impact of the advent of spam, and identify latent events in the data streams automatically. Moreover, a word embedding algorithm is used to learn a real-valued vector representation for a predefined fixed-sized vocabulary from a corpus of Twitter data. In order to further improve the performance of the Expectation-Maximization (EM) iteration algorithm, a novel initialization method based on the authority values of the tweets is also proposed in this paper to detect live events efficiently and precisely. Finally, a novel automatic identification method based on the cosine measure is used to automatically evaluate whether a given topic can form a live event. Experiments conducted on real-world dataset to demonstrate that the ED-SWE model exhibits better efficiency and accuracy than several state-of-art event detection models.
Article
As the Smart city trend especially artificial intelligence, data science, and the internet of things has attracted lots of attention, many researchers have created various smart applications for improving people’s life quality. As it is very essential to automatically collect and exploit information in the era of industry 4.0, a variety of models have been proposed for storage problem solving and efficient data mining. In this paper, we present our proposed system, Trendy Keyword Extraction System (TKES), which is designed for extracting trendy keywords from text streams. The system also supports storing, analyzing, and visualizing documents coming from text streams. The system first automatically collects daily articles, then it ranks the importance of keywords by calculating keywords’ frequency of existence in order to find trendy keywords by using the Burst Detection Algorithm which is proposed in this paper based on the idea of Kleinberg. This method is used for detecting bursts. A burst is defined as a period of time when a keyword is continuously and unusually popular over the text stream and the identification of bursts is known as burst detection procedure. The results from user requests could be displayed visually. Furthermore, we create a method in order to find a trendy keyword set which is defined as a set of keywords that belong to the same burst. This work also describes the datasets used for our experiments, processing speed tests of our two proposed algorithms.
Chapter
This chapter introduces the application of information cascading analysis in social networks. We present a deep learning based framework of social network information cascade analysis, and we show the challenges of applying the MDATA model. The phenomenon of information dissemination in social networks is widespread, and Social Network Information Cascade Analysis (SNICA) aims to acquire valuable knowledge in the process of information dissemination in social networks. As the number, volume, and resolution of social network data increase rapidly, traditional social network data analysis methods, especially the analysis method of social network graph (SNG) data have become overwhelmed in SNICA. At the same time, the MDATA model fuses data from multiple sources in a graph, which can be applied to the SNICA problems. Recently, deep learning models have changed this situation, and it has achieved success in SNICA with its powerful implicit feature extraction capabilities. This chapter provides a comprehensive survey of recent progress in applying deep learning techniques for SNICA.
Article
Catastrophic events create uncertain situations for humanitarian organizations locating and providing aid to affected people. Many people turn to social media during disasters for requesting help and/or providing relief to others. However, the majority of social media posts seeking help could not properly be detected and remained concealed because often they are noisy and ill-formed. Existing systems lack in planning an effective strategy for tweet preprocessing and grasping the contexts of tweets. This research, first of all, formally defines request tweets in the context of social networking sites, hereafter rweets, along with their different primary types and sub-types. Our main contributions are the identification and categorization of rweets. For rweet identification, we employ two approaches, namely a rule-based and logistic regression, and show their high precision and F1 scores. The rweets classification into subtypes such as medical, food, shelter, using logistic regression shows promising results and outperforms exiting works. Finally, we introduce an architecture to store intermediate data to accelerate the development process of the machine learning classifiers.
Full-text available
Article
In the current age of overwhelming information and massive production of textual data on the Web, Event Detection has become an increasingly important task in various application domains. Several research branches have been developed to tackle the problem from different perspectives, including Natural Language Processing and Big Data analysis, with the goal of providing valuable resources to support decision-making in a wide variety of fields. In this paper, we propose a real-time domain-specific clustering-based event-detection approach that integrates textual information coming, on one hand, from traditional newswires and, on the other hand, from microblogging platforms. The goal of the implemented pipeline is twofold: (i) providing insights to the user about the relevant events that are reported in the press on a daily basis; (ii) alerting the user about potentially important and impactful events, referred to as hot events, for some specific tasks or domains of interest. The algorithm identifies clusters of related news stories published by globally renowned press sources, which guarantee authoritative, noise-free information about current affairs; subsequently, the content extracted from microblogs is associated to the clusters in order to gain an assessment of the relevance of the event in the public opinion. To identify the events of a day d, the algorithm dynamically builds a lexicon by looking at news articles and stock data of previous days up to d − 1. Although the approach can be extended to a variety of domains (e.g. politics, economy, sports), we hereby present a specific implementation in the financial sector. We validated our solution through a qualitative and quantitative evaluation, performed on the Dow Jones' Data, News and Analytics dataset, on a stream of messages extracted from the microblogging platform Stocktwits, and on the Standard & Poor's 500 index time-series. The experiments demonstrate the effectiveness of our proposal in extracting meaningful information from real-world events and in spotting hot events in the financial sphere. An added value of the evaluation is given by the visual inspection of a selected number of significant real-world events, starting from the Brexit Referendum and reaching until the recent outbreak of the Covid-19 pandemic in early 2020.
Chapter
The ever-increasing popularity of social media platforms has transformed the way in which information is shared during disasters and mass emergencies. Information that emanates from social media, especially in the early hours of a disaster when little-to-no information is available from other traditional sources, can be extremely valuable for emergency responders and decision makers to gain situational awareness and plan relief efforts. To capitalize on this potential, extensive research and development activities have been conducted over the last decade to build technologies to support various humanitarian aid tasks. In this paper, we provide an overview of the literature on using artificial intelligence and social media for disaster response and management from three perspectives: datasets, research studies, and systems. Then, we present further discussion on open research problems and future directions in the crisis informatics domain.
Article
To determine the order in which to display web pages, the search engine Google computes the PageRank vector, whose entries are the PageRanks of the web pages. The PageRank vector is the stationary distribution of a stochastic matrix, the Google matrix. The Google matrix in turn is a convex combination of two stochastic matrices: one matrix represents the link structure of the web graph and a second, rank-one matrix, mimics the random behaviour of web surfers and can also be used to combat web spamming. As a consequence, PageRank depends mainly the link structure of the web graph, but not on the contents of the web pages. We analyze the sensitivity of PageRank to changes in the Google matrix, including addition and deletion of links in the web graph. Due to the proliferation of web pages, the dimension of the Google matrix most likely exceeds ten billion. One of the simplest and most storage-efficient methods for computing PageRank is the power method. We present error bounds for the iterates of the power method and for their residuals.