Article

Topic category analysis on twitter via cross-media strategy

Springer Nature
Multimedia Tools and Applications
Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The growing popularity of social media provides a huge volume of social data including Tweets. These collections of social data can be potentially useful, but the extent of meaningful data in these collections has not been sufficiently researched, especially in South Korea Twitter data. In general, the South Korea Twitter data has been researched as a source of political media. Nonetheless, previous research on South Korea Twitter data has not adequately covered what kind of trend Twitter represents in terms of major topic categories such as politics, economics, or sports. In this paper, we present a cross-media approach to define the nature of South Korea Tweets by inferring the topic category distribution through short-text categorization. We select newspapers as cross-media, examine the categorization of news articles from major newspapers, and then train our classifier based on the features from each topic category. In addition, for grafting news topics onto South Korea Tweets, we propose a word clustering and filtering approach to exclude those words that do not provide semantic content for the topic categories. Based on the proposed procedures, we analyze the South Korea Tweets to determine the primary topic category focus of Twitter users. We observe the special behaviors of the South Korea Twitter users based on various parameters such as date, time slot, and day of the week. Because our research includes a macroscopic analysis of Twitter data using a cross-media strategy, our research can provide useful resources for other social media analysis as well.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The results revealed that the keyword-extracting technique using cross-domain filtering can be applied to documents, but the study did not evaluate the effectiveness or performance. Cho et al. [19] investigated topic category analysis in news articles, excluded special characters and numbers, and applied stemming for morpheme analysis. Finally, only nouns were used for topic categorization. ...
Article
Full-text available
Natural language processing refers to the ability of computers to understand text and spoken words similar to humans. Recently, various machine learning techniques have been used to encode a large amount of text and decode feature vectors of text successfully. However, understanding low-resource languages is in the early stages of research. In particular, Korean, which is an agglutinative language, needs sophisticated preprocessing steps, such as morphological analysis. Since morphological analysis in preprocessing significantly influences classification results, ideal and optimized morphological analyzers must be used. This study explored five state-of-the-art morphological analyzers for Korean news articles and categorized their topics into seven classes using term frequency–inverse document frequency and light gradient boosting machine frameworks. It was found that a morphological analyzer based on unsupervised learning achieved a computation time of 6 s in 500,899 tokens, which is 72 times faster than the slowest analyzer (432 s). In addition, a morphological analyzer using dynamic programming achieved a topic categorization accuracy of 82.5%, which is 9.4% higher than achieve when using the hidden Markov model (73.1%) and 13.4% higher compared to the baseline (69.1%) without any morphological analyzer in news articles. This study can provide insight into how each morphological analyzer extracts morphemes in sentences and affects categorizing topics in news articles.
... accessed on 12 December 2022). In the past, several authors investigated eWoM for the first two of these social platforms [7][8][9][10][11][12], whereas eWoM for Reddit has been little investigated [13][14][15][16]. ...
Article
Full-text available
Electronic Word of Mouth (eWoM) has been largely studied for social platforms, such as Yelp and TripAdvisor, which are highly investigated in the context of digital marketing. However, it can also have interesting applications in other contexts. Therefore, it can be challenging to investigate this phenomenon on generic social platforms, such as Facebook, Twitter, and Reddit. In the past literature, many authors analyzed eWoM on Facebook and Twitter, whereas it was little considered in Reddit. In this paper, we focused exactly on this last platform. In particular, we first propose a model for representing and evaluating the eWoM Power of Reddit posts. Then, we illustrate two possible applications, namely the definition of lifespan templates and the construction of profiles for Reddit posts. Lifespan templates and profiles are ultimately orthogonal to each other and can be jointly employed in several applications.
... The paper entitled BTopic category analysis on Twitter via cross-media strategy,^by Cho et al. [5] proposes a cross-media approach to define the nature of South Korea Tweets by inferring the topic category distribution through short-text categorization. To do that, they select newspapers as cross-media, examine the categorization of news articles from major newspapers, and then train the classifier based on the features from each topic category. ...
... Assim, treinam um classificador Naive Bayes usando o NLTK [3], tratando o problema do balanceamento de duas formas, diminuindo o número de publicações das classes maiores até que fiquem do tamanho da menor classe, removendo publicações aleatórias, e aumentando o número de exemplos das classes menores até que cheguem ao tamanho das classes maiores, repetindo publicações daquela classe. Trabalhos recentes têm usado recursos textuais para ajudar na categorização dos tópicos em micro-blogs, como em [6], que usa uma base de dados de notícias. Essas notícias são rotuladas com as suas respectivas categorias e usadas para classificar os tweets sul-coreanos, nessas mesmas categorias, por meio de um classificador SVM. ...
Conference Paper
Full-text available
Atualmente, redes sociais se tornaram grandes fontes de estudos, pois, com elas, é possível encontrar uma gama de informações relacionadas a gostos, interesses, desejos e opiniões de seus usuários. A identificação de dimensões de reputação é uma tarefa do gerenciamento da reputação digital, que consiste em segmentar uma base de opiniões sobre uma entidade em dimensões de reputação, estas dimensões refletem percepções afetivas e cognitivas da entidade por diferentes grupos de pessoas. Técnicas supervisionadas aplicadas a essa tarefa tem sido propostas, no entanto, elas se tornam inviáveis em aplicações reais, pois dependem de um conjunto de exemplos de treino que normalmente são manualmente rotulados. Este trabalho apresenta um novo método não supervisionado para identificar dimensões de reputação em microblogs baseado em modelagem de tópicos. Visto que, a maioria dos algoritmos de modelagem de tópicos não possuem um bom desempenho em identificar essas dimensões em textos curtos, a abordagem proposta visa melhorar esse desempenho. Este trabalho foi avaliado sobre o conjunto de dados do desafio RepLab 2014 e teve performance superior ao vencedor desse desafio, que é um método supervisionado. Porém, o método proposto neste artigo não necessita de, manualmente, rotular exemplos de treino.
Article
Full-text available
This paper examines communication patterns of key Twitter users by considering the socially and politically controversial Sejong City issue in South Korea. The network and message data were drawn from twtkr.com. Social network-based indicators and visualization methods were used to analyze political discourse among key Twitter users over time and illustrate various types of Tweets by these users and the interconnection between these key users. In addition, the study examines general Twitter users' participation in the discussion on the issue. The results indicate that some Twitter profiles of media outlets tend to be very dominant in terms of their message output, whereas their Tweets are not likely to be circulated by other users. Noteworthy is that Twitter profiles of individuals who are geographically affiliated with the issue are likely to play an important role in the flow of communication.
Article
Full-text available
Social networks such as Facebook, LinkedIn, and Twitter have been a crucial source of information for a wide spectrum of users. In Twitter, popular information that is deemed important by the com-munity propagates through the network. Studying the character-istics of content in the messages becomes important for a number of tasks, such as breaking news detection, personalized message recommendation, friends recommendation, sentiment analysis and others. While many researchers wish to use standard text mining tools to understand messages on Twitter, the restricted length of those messages prevents them from being employed to their full potential. We address the problem of using standard topic models in micro-blogging environments by studying how the models can be trained on the dataset. We propose several schemes to train a standard topic model and compare their quality and effectiveness through a set of carefully designed experiments from both qualitative and quantitative perspectives. We show that by training a topic model on aggregated messages we can obtain a higher quality of learned model which results in significantly better performance in two real-world classification problems. We also discuss how the state-of-the-art Author-Topic model fails to model hierarchical relationships between entities in Social Media.
Conference Paper
Full-text available
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Conference Paper
Full-text available
Twitter as a microblogging platform has vast potential to become a collective source of intelligence that can be used to obtain opinions, ideas, facts, and sentiments. This paper addresses the issue on collective intelligence retrieval with activated knowledge-base decision making. Our methodology differs from the existing literature in the sense that we are doing analysis on Twitter microblog messages as opposed to traditional blog analysis in the literature which deals with the conventional 'blogosphere'. Another key difference in our methodology is that we apply visualization techniques in conjunction with artificial intelligence-based data mining methods to classify messages dealing with the trend topic. Our methodology also analyzes demographics of the authors of such Twitter messages and attempt to map a Twitter trend into what's going on in the real world. Our findings reveal a pattern behind trends on Twitter, enabling us to see how it 'ticks' and evolves though visualization methods. Our findings also enable us to understand the underlying characteristics behind the 'trend setters', providing us a new perspective on the contributors of a trend.
Conference Paper
Full-text available
Twitter as a new form of social media can potentially contain much useful information, but content analysis on Twitter has not been well studied. In particular, it is not clear whether as an information source Twitter can be simply regarded as a faster news feed that covers mostly the same information as traditional news media. In This paper we empirically compare the content of Twitter with a traditional news medium, New York Times, using unsupervised topic modeling. We use a Twitter-LDA model to discover topics from a representative sample of the entire Twitter. We then use text mining techniques to compare these Twitter topics with topics from New York Times, taking into consideration topic categories and types. We also study the relation between the proportions of opinionated tweets and retweets and topic categories and types. Our comparisons show interesting and useful findings for downstream IR or DM applications.
Conference Paper
Full-text available
This paper presents a general framework for building classi- fiers that deal with short and sparse text & Web segments by making the most of hidden topics discovered from large- scale data collections. The main motivation of this work is that many classification tasks working with short segments of text & Web, such as search snippets, forum & chat mes- sages, blog & news feeds, product reviews, and book & movie summaries, fail to achieve high accuracy due to the data sparseness. We, therefore, come up with an idea of gaining external knowledge to make the data more related as well as expand the coverage of classifiers to handle future data bet- ter. The underlying idea of the framework is that for each classification task, we collect a large-scale external data col- lection called "universal dataset", and then build a classifier on both a (small) set of labeled training data and a rich set of hidden topics discovered from that data collection. The framework is general enough to be applied to different data domains and genres ranging from Web search results to medical text. We did a careful evaluation on several hundred megabytes of Wikipedia (30M words) and MEDLINE (18M words) with two tasks: "Web search domain disambiguation" and "disease categorization for medical text", and achieved significant quality enhancement.
Conference Paper
Full-text available
This paper presents online topic model (OLDA), a topic model that automatically captures the thematic patterns and identifies emerging topics of text streams and their changes over time. Our approach allows the topic modeling framework, specifically the latent Dirichlet allocation (LDA) model, to work in an online fashion such that it incrementally builds an up-to-date model (mixture of topics per document and mixture of words per topic) when a new document (or a set of documents) appears. A solution based on the empirical Bayes method is proposed. The idea is to incrementally update the current model according to the information inferred from the new stream of data with no need to access previous data. The dynamics of the proposed approach also provide an efficient mean to track the topics over time and detect the emerging topics in real time. Our method is evaluated both qualitatively and quantitatively using benchmark datasets. In our experiments, the OLDA has discovered interesting patterns by just analyzing a fraction of data at a time. Our tests also prove the ability of OLDA to align the topics across the epochs with which the evolution of the topics over time is captured. The OLDA is also comparable to, and sometimes better than, the original LDA in predicting the likelihood of unseen documents.
Conference Paper
Full-text available
With the increasing popularity of microblogging sites, we are in the era of information explosion. As of June 2011, about 200 million tweets are being generated everyday. Although Twitter provides a list of most popular topics people tweet about known as Trending Topics in real time, it is often hard to understand what these trending topics are about. Therefore, it is important and necessary to classify these topics into general categories with high accuracy for better information retrieval. To address this problem, we classify Twitter Trending Topics into 18 general categories such as sports, politics, technology, etc. We experiment with 2 approaches for topic classification, (i) the well-known Bag-of-Words approach for text classification and (ii) network-based classification. In text-based classification method, we construct word vectors with trending topic definition and tweets, and the commonly used tf-idf weights are used to classify the topics using a Naive Bayes Multinomial classifier. In network-based classification method, we identify top 5 similar topics for a given topic based on the number of common influential users. The categories of the similar topics and the number of common influential users between the given topic and its similar topics are used to classify the given topic using a C5.0 decision tree learner. Experiments on a database of randomly selected 768 trending topics (over 18 classes) show that classification accuracy of up to 65% and 70% can be achieved using text-based and network-based classification modeling respectively.
Article
Full-text available
In this letter we discuss a least squares version for support vector machine (SVM) classifiers. Due to equality type constraints in the formulation, the solution follows from solving a set of linear equations, instead of quadratic programming for classical SVM''s. The approach is illustrated on a two-spiral benchmark classification problem.
Article
Full-text available
More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been rewritten entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an active community, and has been downloaded more than 1.4 million times since being placed on Source-Forge in April 2000. This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
Article
Full-text available
Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized model is able to deal with domain#speci#c synonymy as well as with polysemous words. In contrast to standard Latent Semantic Indexing #LSI# by Singular Value Decomposition, the probabilistic variant has a solid statistical foundation and de#nes a proper generative data model. Retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methodsaswell as over LSI. In particular, the combination of models with di#erent dimensionalities has proven to be advantageous. 1
Article
Full-text available
The concept of maximum entropy can be traced back along multiple threads to Biblical times. Only recently, however, have computers become powerful enough to permit the widescale application of this concept to real world problems in statistical estimation and pattern recognition. In this paper we describe a method for statistical modeling based on maximum entropy. We present a maximum-likelihood approach for automatically constructing maximum entropy models and describe how to implement this approach efficiently, using as examples several problems in natural language processing.
Article
With the recent explosive growth of e-commerce and online communication, a new genre of text, short text, has been extensively applied in many areas. So many researches focus on short text mining. It is a challenge to classify the short text owing to its natural characters, such as sparseness, large-scale, immediacy, non-standardization. It is difficult for traditional methods to deal with short text classification mainly because too limited words in short text cannot represent the feature space and the relationship between words and documents. Several researches and reviews on text classification are shown in recent times. However, only a few of researches focus on short text classification. This paper discusses the characters of short text and the difficulty of short text classification. Then we introduce the existing popular works on short text classifiers and models, including short text classification using sematic analysis, semi-supervised short text classification, ensemble short text classification, and real-time classification. The evaluations of short text classification are analyzed in our paper. Finally we summarize the existing classification technology and prospect for development trend of short text classification
Article
LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Conference Paper
We present a novel topic modelling-based methodology to track emerging events in microblogs such as Twitter. Our topic model has an in-built update mechanism based on time slices and implements a dynamic vocabulary. We first show that the method is robust in detecting events using a range of datasets with injected novel events, and then demonstrate its application in identifying trending topics in Twitter.
Conference Paper
The growing popularity of social network services has led to many studies of various phenomena in this area. However, most of this research has been conducted using English language data, and relatively little has considered Korean. In this paper, we demonstrate a systematic analysis framework using Korean Twitter data to mine temporal and spatial trends of brand images. A publicly available Korean morpheme analyzer is used to analyze Korean tweets grammatically, and we construct Korean polarity dictionaries containing a noun, adjective, verb, and/or root to automatically analyze the sentiment of each tweet message. Sentiment classification is performed by a support vector machine and multinomial naïve Bayes classifier. In particular, our own feature selection step improves the support vector machine sentiment classification accuracy to 80%. Based on this result, we visualize the temporal and spatial distribution of brand images, and present the temporal changes of brand-related keyword networks. Our analysis enables trends in brand awareness to be systematically traced and evaluated. This allows various other analyses, such as advantages and disadvantages of the brand, and a comparison with its competitors.
Article
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Article
This paper examines the Twitter networking pattern of “following” and “mention” relationships between South Korean politicians. The data were obtained from the Twitter profiles of Korea’s national assemblymen and the most influential political figures. We conducted social network techniques including exponential random graph model and a regression method. The results suggest that these politicians employ two different strategies to establish relationships with other politicians on Twitter. One is “following” other politicians as a social ritual based on dyadic reciprocity, and the other is to “mention” other politicians as asymmetric political support based on the public popularity of their peers on Twitter.
Article
In this paper, we examine the results of applying Term Frequency Inverse Document Frequency (TF-IDF) to determine what words in a corpus of documents might be more favorable to use in a query. As the term implies, TF-IDF calculates values for each word in a document through an inverse proportion of the frequency of the word in a particular document to the percentage of documents the word appears in. Words with high TF-IDF numbers imply a strong relationship with the document they appear in, suggesting that if that word were to appear in a query, the document could be of interest to the user. We provide evidence that this simple algorithm efficiently categorizes relevant words that can enhance query retrieval.
Conference Paper
We present TwitterMonitor, a system that performs trend detection over the Twitter stream. The system identifies emerging topics (i.e. 'trends') on Twitter in real time and provides meaningful analytics that synthesize an accurate description of each topic. Users interact with the system by ordering the identified trends using different criteria and submitting their own description for each trend. We discuss the motivation for trend detection over social media streams and the challenges that lie therein. We then describe our approach to trend detection, as well as the architecture of TwitterMonitor. Finally, we lay out our demonstration scenario.
Conference Paper
Understanding the rapidly growing short text is very important. Short text is different from traditional documents in its shortness and sparsity, which hinders the application of conventional machine learning and text mining algorithms. Two major approaches have been exploited to enrich the representation of short text. One is to fetch contextual information of a short text to directly add more text; the other is to derive latent topics from existing large corpus, which are used as features to enrich the representation of short text. The latter approach is elegant and efficient in most cases. The major trend along this direction is to derive latent topics of certain granularity through well-known topic models such as latent Dirichlet allocation (LDA). However, topics of certain granularity are usually not sufficient to set up effective feature spaces. In this paper, we move forward along this direction by proposing an method to leverage topics at multiple granularity, which can model the short text more precisely. Taking short text classification as an example, we compared our proposed method with the state-of-the-art baseline over one open data set. Our method reduced the classification error by 20.25% and 16.68% respectively on two classifiers.
Article
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Conference Paper
Twitter, a microblogging service less than three years old, com- mands more than 41 million users as of July 2009 and is growing fast. Twitter users tweet about any topic within the 140-character limit and follow others to receive their tweets. The goal of this paper is to study the topological characteristics of Twitter and its power as a new medium of information sharing. We have crawled the entire Twitter site and obtained 41:7 million user profiles, 1:47 billion social relations, 4; 262 trending topics, and 106 million tweets. In its follower-following topology analysis we have found a non-power-law follower distribution, a short effec- tive diameter, and low reciprocity, which all mark a deviation from known characteristics of human social networks (28). In order to identify influentials on Twitter, we have ranked users by the number of followers and by PageRank and found two rankings to be sim- ilar. Ranking by retweets differs from the previous two rankings, indicating a gap in influence inferred from the number of followers and that from the popularity of one's tweets. We have analyzed the tweets of top trending topics and reported on their temporal behav- ior and user participation. We have classified the trending topics based on the active period and the tweets and show that the ma- jority (over 85%) of topics are headline news or persistent news in nature. A closer look at retweets reveals that any retweeted tweet is to reach an average of 1; 000 users no matter what the number of followers is of the original tweet. Once retweeted, a tweet gets retweeted almost instantly on next hops, signifying fast diffusion of information after the 1st retweet. To the best of our knowledge this work is the first quantitative study on the entire Twittersphere and information diffusion on it.
Article
Thesis (Ph.D. in Computer Science)--University of California, Berkeley, Fall 2004. Includes bibliographical references (leaves 106-113).
Article
The Rocchio relevance feedback algorithm is one of the most popular and widely applied learning methods from information retrieval. Here, a probabilistic analysis of this algorithm is presented in a text categorization framework. The analysis gives theoretical insight into the heuristics used in the Rocchio algorithm, particularly the word weighting scheme and the similarity metric. It also suggests improvements which lead to a probabilistic variant of the Rocchio classifier. The Rocchio classifier, its probabilistic variant, and a naive Bayes classifier are compared on six text categorization tasks. The results show that the probabilistic algorithms are preferable to the heuristic Rocchio classifier not only because they are more well-founded, but also because they achieve better performance. 1 Introduction Text categorization is the process of grouping documents into different categories or classes. With the amount of online information growing rapidly, the need for reliable automa...
Article
A probabilistic analysis of the Rocchio relevance feedback algorithm, one of the most popular learning methods from information retrieval, is presented in a text categorization framework. The analysis results in a probabilistic version of the Rocchio classifier and offers an explanation for the TFIDF word weighting heuristic. The Rocchio classifier, its probabilistic variant and a standard naive Bayes classifier are compared on three text categorization tasks. The results suggest that the probabilistic algorithms are preferable to the heuristic Rocchio classifier. This research is sponsored by the Wright Laboratory, Aeronautical Systems Center, Air Force Materiel Command, USAF, and the Advanced Research Projects Agency (ARPA) under grant F33615-93-1-1330. The US Government is authorized to reproduce and distribute reprints for Government purposes, notwithstanding any copyright notation thereon. Views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of Wright Laboratory or the United States Government. Keywords: text categorization, relevance feedback, naive Bayes classifier, information retrieval, vector space retrieval model, machine learning 1