ChapterPDF Available

Network-Based Pooling for Topic Modeling on Microblog Content

Authors:

Abstract and Figures

Topic modeling with tweets is difficult due to the short and informal nature of the texts. Tweet-pooling (aggregation of tweets into longer documents prior to training) has been shown to improve model outputs, but performance varies depending on the pooling scheme and data set used. Here we investigate a new tweet-pooling method based on network structures associated with Twitter content. Using a standard formulation of the well-known Latent Dirichlet Allocation (LDA) topic model, we trained various models using different tweet-pooling schemes on three diverse Twitter datasets. Tweet-pooling schemes were created based on mention/reply relationships between tweets and Twitter users, with several (non-networked) established methods also tested as a comparison. Results show that pooling tweets using network information gives better topic coherence and clustering performance than other pooling schemes, on the majority of datasets tested. Our findings contribute to an improved methodology for topic modeling with Twitter content.
Content may be subject to copyright.
Network-Based Pooling for Topic Modeling on
Microblog Content
Ana¨
ıs Ollagnier1[0000000243495678]and Hywel Williams1[0000000259273367]
Computer Science, University of Exeter, Exeter EX4 4QE, UK
{a.ollagnier,h.t.p.williams}@exeter.ac.uk
Abstract. Topic modeling with tweets is difficult due to the short and informal
nature of the texts. Tweet-pooling (aggregation of tweets into longer documents
prior to training) has been shown to improve model outputs, but performance
varies depending on the pooling scheme and data set used. Here we investigate
a new tweet-pooling method based on network structures associated with Twitter
content. Using a standard formulation of the well-known Latent Dirichlet Alloca-
tion (LDA) topic model, we trained various models using different tweet-pooling
schemes on three diverse Twitter datasets. Tweet-pooling schemes were created
based on mention/reply relationships between tweets and Twitter users, with sev-
eral (non-networked) established methods also tested as a comparison. Results
show that pooling tweets using network information gives better topic coher-
ence and clustering performance than other pooling schemes, on the majority
of datasets tested. Our findings contribute to an improved methodology for topic
modeling with Twitter content.
Keywords: Microblogs ·LDA ·Information Retrieval ·Aggregation ·User net-
works.
1 Introduction
Micro-blogging platforms such as Twitter have witnessed a rapid and impressive ex-
pansion, creating a popular new mode of public communication. Currently, Twitter has
6000 tweets written every second per day on average1. Twitter has become a signifi-
cant source of information for a broad variety of applications, but the volume of data
makes human analysis intractable. There is therefore considerable interest in adaptation
of computational techniques for large-scale analyses, such as opinion mining, machine
translation, and social information retrieval, among others. Application of topic model-
ing techniques to Twitter content is non-trivial due to the noisy and short texts associated
with individual tweets. In the literature, topic models such as Latent Dirichlet Alloca-
tion (LDA) [1] or the Author Topic Model (ATM) [2] have proved their success in
several applications (e.g. news articles, academic abstracts). However, results are more
mixed when applied on short texts due to the data sparsity in each individual document.
1http://www.internetlivestats.com/twitter-statistics/ Date of access: 28th Jul
2019.
2 A. Ollagnier, H. Williams
Several approaches have been proposed to design longer pseudo-documents by ag-
gregating multiple short texts (tweets). Each document results from a pooling strategy
applied in a pre-processing stage. In [3], an author-based tweet pooling scheme is used
which builds documents by combining all tweets posted by the same author. A hashtag-
based tweet pooling method is proposed by [4], which creates documents consisting of
all tweets containing the same hashtag. The main goal behind these approaches is to
improve topic model performance by training on the pooled documents, with efficacy
measured against similar topic models trained on the unpooled tweets. Empirical stud-
ies with these approaches highlight inconsistencies in the homogeneity of generated
topics. To overcome this problem, [5] propose a conversation-based pooling technique
which aggregates tweets occurring in the same user-to-user conversation. This approach
outperforms other pooling methods in terms of clustering quality and document re-
trieval. More recently, [6] propose to prune irrelevant tweets through a pooling strategy
based on information retrieval (IR) in order to place related tweets in the same cluster.
This method provides an interesting improvement in a variety of measures for topic
coherence, in comparison to unmodified LDA baseline and a variety of other pooling
schemes.
Several IR applications in context of microblogs use network representations [7]
(e.g. document retrieval, document content). Here, we evaluate a novel network-based
tweet pooling method that aggregates tweets based on user interactions around each
item of content. Our intuition behind this method is to expose connections between
users and their interest in a given topic; by pooling tweets based on relational informa-
tion (user interactions) we hope to create an improved training corpus. To evaluate this
method, we perform a comprehensive empirical comparison against four state-of-the-
art pooling techniques chosen after a literature survey. Across three Twitter datasets, we
evaluate the pooling techniques in terms of topic coherence and clustering quality. The
experimental results show that the proposed technique yields superior performance for
all metrics on the majority of datasets and takes considerably less time to train.
2 TWEET-POOLING METHODS
Tweet texts are qualitatively different to conventional texts, being typically short (280
characters2) with a messy structure including platform-specific objects (e.g. hashtags,
shortened urls, user names, emoticons/emojis). In this context, tweet-pooling has been
developed to better capture reliable document-level word co-occurrence patterns. Here,
we evaluate four existing unsupervised tweet pooling schemes alongside our proposed
network-based scheme:
Unpooled scheme: The default approach used as a baseline in which each tweet is
considered as a single document.
Author pooling: Each tweet authored by a single user is aggregated as a single
document, so the number of documents is the same as the number of unique users. This
approach outperforms the unpooled scheme [9].
2In September 2017, Twitter expanded the original 140-character limit to 280 char-
acters. See: https://blog.twitter.com/official/en_us/topics/product/2017/
tweetingmadeeasier.html. Date of access: 11th Feb 2019.
Network-Based Pooling for Topic Modeling on Microblog Content 3
Hashtag pooling: Tweets using similar hashtags are aggregated as a single docu-
ment. The number of documents is equal to the number of unique hashtags, but a tweet
can appear in several documents if it contains multiple hashtags. Tweets without hash-
tags are considered as individual documents. This method was shown [5] to outperform
unpooled schemes. (Note that [4] showed improved performance by assigning hashtag
labels to tweets without hashtags, but this technique adds computational cost and was
not used here.)
Conversation pooling: Each document consists of all tweets in the corpus that be-
long to the conversation tree for a chosen seed tweet. The conversation tree includes
tweets written in reply to an original tweet, as well as replies to those replies, and so
on. Tweets without replies are considered as individual documents. In [5], conversation
pooling outperforms alternative pooling schemes.
Fig. 1. Network-based tweet pooling. Each document is initialised with a seed tweet. In Step 1,
the first layer of direct replies to the seed tweet are added. In Step 2, all tweets by users mentioned
in the set of tweets resulting from Step 1 are also added.
Fig. 2. Example content of a document created by network-based tweet pooling.
Network-based pooling: In this novel scheme, each document is aggregated from
all tweets within the corpus that are associated with the seed tweet by a simple network
structure (Figure 1 and Figure 2). In Step 1, tweets are aggregated that were written in
reply to the seed tweet. In Step 2, we identify all mentioned users in the set of tweets
from Step 1 (i.e. all users that are referenced in tweet text using the @ symbol). We
then aggregate to the document all other tweets in the corpus that are authored by this
user set.
4 A. Ollagnier, H. Williams
This scheme differs from conversation pooling in two aspects. First, only direct
replies are aggregated i.e. the first layer of replies from the conversation tree. Manual
inspection of full tweet conversation trees showed that the conversation thread can shift
in topic as the tree increases in depth. Use of the full tree can thereby capture topics
which are not anymore related to those of the seed tweet. To identify reply tweets, we
used the in reply to status id field returned by the Twitter API for each tweet.
Second, exploiting tweets of all mentioned users allows the network-based pooling to
access additional content from users interested in the topics of the original seed tweet.
Leveraging this information, we construct a network based on both interactions and
connections between users.
3 TWEET CORPUS BUILDING
Table 1. Distribution of latent categories in the datasets (labelled by search theme)
Dataset No. of tweets Category / % of Documents
Generic 658,492 Music/24.4 - Business/10.2 - Movie/18.5 - Health/14.7 - Family/7.4 - Sport/24.8
Specific 445,852 Arts&entertainment/9.7 - Business/12.4 - Law Enforcement&Armed Forces/6.2 - Sci-
ence&technology/36.8 - Healthcare&medicine/25.5 - Service/9.4
Events 188,000 Natural disasters/37.1 - Transport/15.4 - Industrial/10.2 - Health/9.7 - Terrorism/27.6
To evaluate the portability of different pooling schemes we collected three tweet
datasets with different levels of underlying thematic/topical heterogeneity. Data was
collected using the public Twitter Search API3during 2018 and 2019. Each collection
was created with a different list of API keywords and included tweets collected on
different themes. For each chosen theme a list of terms was manually created. All tweets
returned were collated in a single corpus, labelled by the theme. The three datasets
collected were:
Generic. A wide range of themes. Tweets from 11 Dec’18 to 30 Jan’19 collected
using keywords related to a range of themes (‘music’, ‘business’, ‘movies’, ‘health’,
‘family’, ‘sports’).
Event. Tweets from 23 Mar’18 to 22 Jan’19 associated with various events (‘natural
disasters’, ‘transport’, ‘industrial’, ‘health’, ‘terrorism’). Search terms were manually
collated based on reading a sample of posts about disaster events.
Specific. Tweets from 21 Feb’18 to 11 Feb’19 associated with job adverts for dif-
ferent industries (‘arts & entertainment’, ‘business’, ‘law enforcement & armed forces’,
‘science & technology’, ‘healthcare & medicine’, ‘service’). Search terms manually
collated based on reading a sample of posts about job advertisements.
For each dataset, tweets retrieved by more than one query have been removed in
order to preserve uniqueness of tweet labels. Table 1 illustrates the distribution of latent
categories in each dataset. Each retrieved tweet was labeled according to a category
corresponding to the query submitted. We leverage these labels to evaluate the topics
produced by each model in term of clustering quality.
3https://dev.twitter.com/rest/public/search. Date of access: 19th Feb 2019.
Network-Based Pooling for Topic Modeling on Microblog Content 5
4 EVALUATION METRICS
According to metrics used in previous studies [4,5,6], we evaluate models both in terms
of clustering quality (purity and normalized mutual information (NMI)) and semantic
topic coherence (pointwise mutual information (PMI)).
Formally, let Tibe the set of tweets assigned to topic iand let T=T1,...,T|T|be
the set of topic clusters arising from a LDA model that produces |T|topics. Then let
Ljbe the set of tweets with ground-truth topic jand let L=L1,...,L|L|be the set
of of ground-truth topic labels with |L|labels in total. Our clustering-based metrics are
defined as follows:
Purity: Purity score is used to measure the fraction of tweets in each assigned LDA
topic cluster with the true label for that cluster, where the ‘true’ label is defined as the
most frequent ground-truth label found in that cluster. Formally:
Purity(T,L) = 1
|T|
i(1,|T|)
max
j(1,|L|)|TiLj|
Higher purity scores indicate better reconstruction of the original ‘true’ topic assign-
ments by the model.
Normalized Mutual Information (NMI): The NMI score estimates how much
information is shared between assigned topics Tand the ground-truth labeling L. NMI
is defined as follows:
NMI(T,L) = 2I(T,L)
H(T) + H(L)
where respectively, I(·,·)corresponds to mutual information and H(·)is entropy as
defined in [8]. NMI is a number between 0 and 1. A score close to 1 means an exact
matching of the clustering results.
Pointwise Mutual Information (PMI): The PMI score [10] evaluates the quality
of inferred topics based on the top-10 words associated with each modeled topic. This
measure is based on PMI which is computed as PMI(u,v) = log(p(u,v)
p(u)p(v))where uand v
are a given pair of words. The probability p(x)is derived empirically as the frequency of
word xin the whole tweet corpus, while probability p(x,y)is the likelihood of observing
both xand yin the same tweet. Coherence of a topic kis computed as the average
score of PMI for all possible pairs of the ten highest probability words for topic k(i.e.
Wk={w1,..., w10 }). Formally:
PMI Score(Tk) = 1
100
10
i=1
10
j=1
PMI(wi,wj)
where wi,wjWk. Then coherence of a whole topic model is calculated as the average
PMI-Score for all topics generated by the model.
6 A. Ollagnier, H. Williams
5 Results
For each combination of the three datasets (Section 3) and five pooling schemes (Sec-
tion 2), we calculated three evaluation metrics (purity scores, NMI scores and PMI
scores; Section 4) by training LDA models with 10 topics.
Table 2 presents various statistics of the training sets obtained by applying the differ-
ent pooling schemes. We filtered the datasets to keep only tweets written in English and
those with more than three tokens. Tweets were converted to lowercase and all URLs,
mentions (except with the network pooling scheme) and stop-words were removed.
After the tokenization process, all tokens based only on non-alphanumeric characters
(emoticons) and all short tokens (with <3 characters) were also deleted. Test sets have
been randomly extracted (30%) from each dataset preserving the same distribution of
tweet categories. For each topic model we conduct five cross-validations.
Table 2. Corpus statistics.
Scheme No. of documents No. of tokens
general specific event general specific event
Unpooled 658492 445852 188000 18991 14794 9454
Author Pooling 504253 340826 157377 18339 14091 9222
Conversation Pooling 649389 440682 185737 19301 15061 9668
Hashtag Pooling 585171 387522 174501 19868 15185 9348
Network Pooling 585171 402687 171266 19868 20065 13051
Table 3. Clustering metrics and coherence scores for different schemes and datasets.
Scheme Purity NMI PMI Score
general specific event general specific event general specific event
Unpooled 0.396 0.316 0.220 0.176 0.108 0.058 0.131 0.224 0.307
Author Pooling 0.377 0.399 0.326 0.181 0.176 0.124 0.892 0.116 0.338
Conversation Pooling 0.341 0.359 0.310 0.136 0.141 0.110 0.131 0.062 0.131
Hashtag Pooling 0.337 0.250 0.245 0.145 0.045 0.071 0.293 0.347 0.851
Network Pooling 0.418 0.503 0.362 0.173 0.228 0.155 0.912 0.582 0.794
Table 3 summarises the average results obtained with each pooling scheme and
dataset. According to the clustering evaluation metrics (purity and NMI), Network Pool-
ing produced the best model performance on all datasets, with the exception of NMI
scores on the General dataset, where it was narrowly outperformed by Unpooled and
Author Pooling.
Results for other pooling schemes vary by metric and dataset. Author Pooling is the
second-ranked scheme for most metrics/datasets, with Conversation Pooling also out-
performing the Unpooled scheme in most cases. It is interesting to notify that Hashtag
Pooling is mostly ineffective and gives performance worse than the baseline in most
cases. This finding can perhaps be explained by the observation that hashtags are typi-
cally present in a minority of tweets (e.g. 19.6% of tweets have hashtags in the Specific
dataset). Concerning the measure of the topic interpretability, coherence scores show
Network-Based Pooling for Topic Modeling on Microblog Content 7
that the Network Pooling scheme gives better performance on all datasets, with the ex-
ception on the Event dataset, where it was narrowly outperformed by Hashtag Pooling.
6 Conclusion
Methods for aggregating tweets to form longer documents more amenable to topic mod-
eling have been shown here and elsewhere to improve model performance. Here we
have proposed a new network-based pooling scheme for topic modeling with Twit-
ter data, that takes into account the network of users that engage with a particular
tweet. Our approach improves topic extraction despite different levels of underlying the-
matic/topical heterogeneity of each dataset. While similar to conversation-based pool-
ing in its use of reply tweets, the network approach includes otherwise un-linked con-
tent from users who authored replies. Experimental results showed that for the tests
performed in this study, the network-based pooling scheme considerably outperformed
other methods and was portable between datasets. Model outputs were improved on
both clustering metrics (purity and NMI) and topic coherence (PMI).
Although the experiments presented have been conducted on the corpora collected
on specific time intervals which reduces the shifting of conversation threads, especially
when we collect documents authored by a cited user in response to the seed tweet. On a
larger scale, topic shifting might be handled by adding conditions on document times-
tamps or topic correlation. In addition, the experimental findings suggest that network-
based approaches might offer a useful technique for topic modeling with Twitter data,
subject to further testing and validation with other datasets.
Acknowledgements This work was supported by the Institute of Coding which re-
ceived funding from the Office for Students (OfS) in the United Kingdom.
References
1. Blei, D. M., Ng, A. Y., & Jordan, M. I.: Latent dirichlet allocation. In: Journal of Machine
Learning Research, vol. 3, no Jan., pp. 993–1022. (2003).
2. Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P.: The author-topic model for authors
and documents. In: Proceedings of the 20th conference on Uncertainty in artificial intelli-
gence. AUAI Press. pp. 487–494. (2004).
3. Hong, L., & Davison, B. D.: Empirical study of topic modeling in twitter. In: Proceedings of
the 1st workshop on Social Media Analytics. ACM. pp. 80–88. (2010).
4. Mehrotra, R., Sanner, S., Buntine, W., & Xie, L.: Improving lda topic models for microblogs
via tweet pooling and automatic labeling. In: Proceedings of the 36th international ACM SI-
GIR conference on Research and Development in Information Retrieval. ACM. pp. 889–892.
(2013).
5. Alvarez-Melis, D., & Saveski, M.: Topic modeling in twitter: Aggregating tweets by conver-
sations. In: Proceedings of the 10th international AAAI conference on Web and Social Media.
pp. 519–522. (2016).
6. Hajjem, M., & Latiri, C.: Combining IR and LDA topic modeling for filtering microblogs. In:
Procedia Computer Science, vol. 112, pp. 761–770. (2017).
8 A. Ollagnier, H. Williams
7. Ahmad, W., & Ali, R.: Information retrieval from social networks: A survey. In: Proceedings
of the 3rd international conference on Recent Advances in Information Technology (RAIT).
IEEE. pp. 631–635. (2016).
8. Manning, C., Raghavan, P., & Schtze, H.: Introduction to information retrieval. In: Natural
Language Engineering, vol. 16, no 1, pp. 100–103. (2010).
9. Zhao, W. X., Jiang, J., Weng, J., He, J., Lim, E. P., Yan, H., & Li, X.: Comparing twitter
and traditional media using topic models. In: European conference on Information Retrieval.
Springer, pp. 338–349. (2011).
10. Lau, J. H., Newman, D., & Baldwin, T.: Machine reading tea leaves: Automatically evalu-
ating topic coherence and topic model quality. In: Proceedings of the 14th Conference of the
European Chapter of the Association for Computational Linguistics. pp. 530–539. (2014).
... Alvarez-Melis and Saveski [36] have reported that pooling by interaction provides better topic modeling and retrieval performance. Ollagnier and Williams [63] have extended the pooling by interaction by adding tweets of the users mentioned in the replies. It has been reported that adding tweets posted by the users mentioned in the replies improves topic modeling performance over no-pooling, user pooling, hashtag pooling, and interaction pooling. ...
Article
Topic modeling on tweets is known to experience under-specificity and data sparsity due to its character limitation. In earlier studies, researchers attempted to address this problem by either 1) tweet aggregation, where related tweets are combined into a single document or 2) tweet expansion with related text from external sources. The first approach faces the problem of losing the topic distribution in individual tweets. While finding a relevant text from the external source for a random tweet in the second approach is challenging for various reasons like differences in writing styles, multilingual content, and informal text. In contrast to adding context from external resources or combining related tweets into a pool, this study uses the internal vocabulary (hashtags) to counter under-specificity and sparsity in tweets. Earlier studies have indicated hashtags to be an important feature for representing the underlying context present in the tweet. Sequential models like Bi-directional Long Short Term Memory (BiLSTM) and Convolution Neural Network (CNN) over distributed representation of words have shown promising results in capturing semantic relationships between words of a tweet in the past. Motivated by the above, this article proposes a unified framework of hashtag-based tweet expansion exploiting text-based and network-based representation learning methods such as BiLSTM, BERT, and Graph Convolution Network (GCN). The hashtag-based expanded tweets using the proposed framework have significantly improved topic modeling performance compared to un-expanded (raw) tweets and hashtag-pooling-based approaches over two real-world tweet datasets of different nature. Furthermore, this article also studies the significance of hashtags in topic modeling performance by experimenting with different combination of word types such as hashtags, keywords, and user mentions.
Chapter
Social networks play a fundamental role in propagation of information and news. Characterizing the content of the messages becomes vital for different tasks, like breaking news detection, personalized message recommendation, fake users detection, information flow characterization and others. However, Twitter posts are short and often less coherent than other text documents, which makes it challenging to apply text mining algorithms to these datasets efficiently. Tweet-pooling (aggregating tweets into longer documents) has been shown to improve automatic topic decomposition, but the performance achieved in this task varies depending on the pooling method.
Article
Full-text available
Twitter is a networking micro-blogging service where users post millions of short messages every day. Building multilingual corpora from these microblogs contents can be useful to perform several computational tasks such as opinion mining. However, Twitter data gathering involves the problem of irrelevant included data. Recent literary works have proved that topic models such as Latent Dirichlet Allocation (LDA) are not consistent when applied to short texts like tweets. In order to prune the irrelevant tweets, we investigate in this paper a novel method to improve topics learned from Twitter content without modifying the basic machinery of LDA. This latter is based on a pooling process which combines Information retrieval (IR) approach and LDA.This is achieved through an aggregation strategy based on IR task to retrieve similar tweets in a same cluster. The result of tweet pooling is then used as an input for a basic LDA to overcome the sparsity problem of Twitter content. Empirical results highlight that tweets aggregation based on IR and LDA leads to an interesting improvement in a variety of measures for topic coherence, in comparison to unmodified LDA baseline and a variety of pooling schemes.
Conference Paper
Full-text available
Twitter, or the world of 140 characters poses serious challenges to the efficacy of topic models on short, messy text. While topic models such as Latent Dirichlet Allocation (LDA) have a long history of successful application to news articles and academic abstracts, they are often less coherent when applied to microblog content like Twitter. In this paper, we investigate methods to improve topics learned from Twitter content without modifying the basic machinery of LDA; we achieve this through various pooling schemes that aggregate tweets in a data preprocessing step for LDA. We empirically establish that a novel method of tweet pooling by hashtags leads to a vast improvement in a variety of measures for topic coherence across three diverse Twitter datasets in comparison to an unmodified LDA baseline and a variety of pooling schemes. An additional contribution of automatic hashtag labeling further improves on the hashtag pooling results for a subset of metrics. Overall, these two novel schemes lead to significantly improved LDA topic models on Twitter content.
Article
Full-text available
We introduce the author-topic model, a generative model for documents that extends Latent Dirichlet Allocation (LDA; Blei, Ng, & Jordan, 2003) to include authorship information. Each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. A document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors. We apply the model to a collection of 1,700 NIPS conference papers and 160,000 CiteSeer abstracts. Exact inference is intractable for these datasets and we use Gibbs sampling to estimate the topic and author distributions. We compare the performance with two other generative models for documents, which are special cases of the author-topic model: LDA (a topic model) and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. We show topics recovered by the author-topic model, and demonstrate applications to computing similarity between authors and entropy of author output.
Article
Full-text available
Social networks such as Facebook, LinkedIn, and Twitter have been a crucial source of information for a wide spectrum of users. In Twitter, popular information that is deemed important by the com-munity propagates through the network. Studying the character-istics of content in the messages becomes important for a number of tasks, such as breaking news detection, personalized message recommendation, friends recommendation, sentiment analysis and others. While many researchers wish to use standard text mining tools to understand messages on Twitter, the restricted length of those messages prevents them from being employed to their full potential. We address the problem of using standard topic models in micro-blogging environments by studying how the models can be trained on the dataset. We propose several schemes to train a standard topic model and compare their quality and effectiveness through a set of carefully designed experiments from both qualitative and quantitative perspectives. We show that by training a topic model on aggregated messages we can obtain a higher quality of learned model which results in significantly better performance in two real-world classification problems. We also discuss how the state-of-the-art Author-Topic model fails to model hierarchical relationships between entities in Social Media.
Article
We propose a new pooling technique for topic modeling in Twitter, which groups together tweets occurring in the same user-to-user conversation. Under this scheme, tweets and their replies are aggregated into a single document and the users who posted them are considered co-authors. To compare this new scheme against existing ones, we train topic models using Latent Dirichlet Allocation (LDA) and the Author-Topic Model (ATM) on datasets consisting of tweets pooled according to the different methods. Using the underlying categories of the tweets in this dataset as a noisy ground truth, we show that this new technique outperforms other pooling methods in terms of clustering quality and document retrieval.
Article
Topic models based on latent Dirichlet allocation and related methods are used in a range of user-focused tasks including document navigation and trend analysis, but evaluation of the intrinsic quality of the topic model and topics remains an open research area. In this work, we explore the two tasks of automatic evaluation of single topics and automatic evaluation of whole topic models, and provide recommendations on the best strategy for performing the two tasks, in addition to providing an open-source toolkit for topic and topic model evaluation.
Article
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.