Conference PaperPDF Available

A biterm topic model for short texts

Authors:

Abstract and Figures

Uncovering the topics within short texts, such as tweets and instant messages, has become an important task for many content analysis applications. However, directly applying conventional topic models (e.g. LDA and PLSA) on such short texts may not work well. The fundamental reason lies in that conventional topic models implicitly capture the document-level word co-occurrence patterns to reveal topics, and thus suffer from the severe data sparsity in short documents. In this paper, we propose a novel way for modeling topics in short texts, referred as biterm topic model (BTM). Specifically, in BTM we learn the topics by directly modeling the generation of word co-occurrence patterns (i.e. biterms) in the whole corpus. The major advantages of BTM are that 1) BTM explicitly models the word co-occurrence patterns to enhance the topic learning; and 2) BTM uses the aggregated patterns in the whole corpus for learning topics to solve the problem of sparse word co-occurrence patterns at document-level. We carry out extensive experiments on real-world short text collections. The results demonstrate that our approach can discover more prominent and coherent topics, and significantly outperform baseline methods on several evaluation metrics. Furthermore, we find that BTM can outperform LDA even on normal texts, showing the potential generality and wider usage of the new topic model.
Content may be subject to copyright.
A Biterm Topic Model for Short Texts
Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng
Institute of Computing Technology, CAS
Beijing, China 100190
yanxiaohui@software.ict.ac.cn, {guojiafeng, lanyanyan, cxq}@ict.ac.cn
ABSTRACT
Uncovering the topics within short texts, such as tweets and
instant messages, has become an important task for many
content analysis applications. However, directly applying
conventional topic models (e.g. LDA and PLSA) on such
short texts may not work well. The fundamental reason
lies in that conventional topic models implicitly capture the
document-level word co-occurrence patterns to reveal topics,
and thus suffer from the severe data sparsity in short docu-
ments. In this paper, we propose a novel way for modeling
topics in short texts, referred as biterm topic model (BTM).
Specifically, in BTM we learn the topics by directly modeling
the generation of word co-occurrence patterns (i.e. biterms)
in the whole corpus. The major advantages of BTM are
that 1) BTM explicitly models the word co-occurrence pat-
terns to enhance the topic learning; and 2) BTM uses the
aggregated patterns in the whole corpus for learning topics
to solve the problem of sparse word co-occurrence patterns
at document-level. We carry out extensive experiments on
real-world short text collections. The results demonstrate
that our approach can discover more prominent and coher-
ent topics, and significantly outperform baseline methods on
several evaluation metrics. Furthermore, we find that BTM
can outperform LDA even on normal texts, showing the po-
tential generality and wider usage of the new topic model.
Categories and Subject Descriptors
H.3.3 [Information Search and Retrieval]: Information
Search and Retrieval; I.5.3 [Pattern Recognition]: Clus-
tering
Keywords
Short Text, Topic Model, Biterm, Content Analysis, docu-
ment clustering
1. INTRODUCTION
Short texts are prevalent on the Web, no matter in tradi-
tional Web sites, e.g. Web page titles, text advertisements
and image captions, or in emerging social media, e.g. tweets,
status messages, and questions in Q&A websites. Uncover-
ing the topics of such short texts is crucial for a wide range
of content analysis tasks, such as content characterizing [26,
35, 14], user interest profiling [32], emerging topic detect-
ing [20] and so on. However, unlike the traditional normal
documents (e.g. news articles and academic papers), the lack
of rich context in short texts makes the topic modeling a
challenging problem.
Conventional topic models, like PLSA [16] and LDA [3],
are widely used for uncovering the hidden topics from text
corpus. In general, documents are modeled as mixtures
of topics, where a topic is a probability distribution over
words. Statistical techniques are then utilized to learn the
topic components and mixture coefficients of each document.
In essence, the conventional topic models reveal the latent
topics within the text corpus by implicitly capturing the
document-level word co-occurrence patterns [5, 30]. There-
fore, directly applying these models on short texts will suffer
from the severe data sparsity problem (i.e. the sparse word
co-occurrence patterns in each short document) [17]. More
specifically, 1) the occurrences of words in short document
play less discriminative role compared to lengthy documents
where the model has enough word counts to know how words
are related [17] ; 2) The limited contexts make it more dif-
ficult for topic models to identify the senses of ambiguous
words in short documents.
One simple but popular way to alleviate the sparsity prob-
lem is to aggregate short texts into lengthy pseudo-documents
before training a standard topic model. For example, Weng
et al. [32] aggregated the tweets published by individual
user into one document before training LDA. Besides the
user-based aggregation, Hong et al. [17] also aggregated the
tweets containing the same word, and shown that topic mod-
els trained on these aggregated messages work better than
the regular LDA. However, such heuristic data aggregation
methods are highly data-dependent. For example, the user
information is not always available in some datasets, like
the collection of Web page titles or advertisements. Even if
the user information is available, e.g. in tweets data, most
users only have few tweets which makes the aggregation less
effective.
Another way to deal with the problem is to make stronger
assumptions on the data. A typical way is to assume that
a short document only covers a single topic. For example,
Zhao et al. [35] modeled each tweet in the way of mixture
of unigrams [23]. Similar approach can be found in [12],
where words in each sentence are assumed to be drawn from
the same topic. Compared to LDA and PLSA, the simpli-
fied data generation process may help alleviate the sparsity
problem in short texts. However, it loses the flexibility to
capture different topic ingredients in one document, and suf-
Copyright is held by the International World Wide Web Conference
Committee (IW3C2). IW3C2 reserves the right to provide a hyperlink
to the author’s site if the Material is used in electronic media.
WWW 2013, May 13–17, 2013, Rio de Janeiro, Brazil.
ACM 978-1-4503-2035-1/13/05.
fers from overfitting issues due to the peaked posteriors of
topics P(z|d) [3].
Unlike these approaches, in this paper, we propose a novel
topic model for short texts to tackle the sparsity problem.
The main idea comes from the answers of the following two
questions. 1) Since topics are basically groups of correlated
words and the correlation is revealed by word co-occurrence
patterns in documents, why not explicitly model the word
co-occurrence for topic learning? 2) Since topic models on
short texts suffer from the problem of severe sparse patterns
in short documents, why not use the rich global word co-
occurrence patterns for better revealing topics?
Specifically, we propose a generative biterm topic model
(BTM), which learns topics over short texts by directly mod-
eling the generation of biterms in the whole corpus. Here,
a biterm is an unordered word-pair co-occurred in a short
context. The data generation process under BTM is that
the corpus consist of a mixture of topics, and each biterm
is drawn from a specific topic. Compared with conventional
topic models, the major differences and advantages of BTM
lie in that 1) BTM explicitly models the word co-occurrence
patterns (i.e. biterms), rather than documents, to enhance
the topic learning; and 2) BTM uses the aggregated pat-
terns in the whole corpus for learning topics to solve the
problem of sparse patterns at document-level. By learning
BTM, we can obtain the topic components and a global topic
distribution of the corpus, except the topic distribution of
each individual document as it does not model the document
generation process. However, we show that the topic distri-
bution of each document can be naturally derived based on
the learned model.
We conduct extensive experiments on two real-world short
text collections, i.e. the datasets from Twitter and a Q&A
website. Experimental results show that BTM can discover
more prominent and coherent topics than the baseline meth-
ods. Quantitative evaluations confirm the superiority of
BTM on several evaluation metrics. Additionally, we also
test our approach on a normal text collection, i.e. 20News-
group. It is surprising for us to find that BTM can out-
perform LDA even on normal texts, showing the potential
generality and wider usage of the new topic model.
The rest of the paper is organized as follows: in Section
2, we give a brief review of related works. Section 3 intro-
duces our model for short text topic modeling, and discuss
its implementation in Section 4. Experimental results are
presented in Section 5. Finally, conclusions are made in the
last section.
2. RELATED WORKS
In this section, we briefly summarize the related work
from the following two perspectives: topic models on normal
texts, and that on short ones.
2.1 Topic Models on Normal texts
Topic models have been proposed to uncover the latent
semantic structure from text corpus. The effort of mining
semantic structure in a text collection can be dated from
latent semantic analysis (LSA) [9], which utilizes the sin-
gular value decomposition of the document-term matrix to
reveal the major associative words patterns. Probabilistic
latent semantic analysis (PLSA) [16] improves LSA with a
sounder probabilistic model based on a mixture decomposi-
tion derived from a latent class model. In PLSA, a docu-
ment is presented as a mixture of topics, while a topic is a
probability distribution over words. Extending PLSA, La-
tent Dirichlet Allocation (LDA) [3] adds Dirichlet priors on
topic distributions, resulting in a more complete generative
model. Due to its nice generalization ability and extensibil-
ity, LDA achieves huge success in text mining domain.
In the last decade, topic models have been extensively
studied. Many more complicated variants and extensions of
LDA and PLSA have been proposed, such as the author-
topic model [27], Bayesian nonparametric topic model [29],
and supervised topic model [2]. Among them two works
close to us are the recently proposed regularized topic model [22]
and the generalized P´olya model [21], which also employ
word co-occurrence statistics to enhance topic learning. How-
ever, both of them utilize word co-occurrences as structure
priors for topic-word distribution, rather than directly mod-
eling their generation process. Above all, almost all the
models mentioned above deal with normal text without con-
sidering the specificity of short texts.
2.2 Topic Models on Short Texts
Early studies mainly focused on exploiting external knowl-
edge to enrich the representation of short texts. For ex-
ample, Sahami et al.[28] suggested a search-snippet-based
similarity measure for short texts. Phan et al.[24] learned
hidden topics from large external resources to enrich the
representation of short texts. Jin et al.[19] learned topics
on short texts via transfer learning from auxiliary long text
data. These ways may be helpful in some specific domains,
but not general since favorable external dataset might not
be always available. Additionally, these approaches and ours
are complementary rather than competitive.
With the emergence of social media in recent years, topic
models have been utilized for social media content analysis in
various tasks, such as content characterizing [26, 35], event
tracking [20], content recommendation [25, 8], and influen-
tial users prediction [32]. However, due to the lack of spe-
cific topic models for short texts, some researchers directly
applied conventional (or slightly modified) topic models for
analysis [26, 31]. Some others tried to aggregate short texts
into lengthy pseudo-documents based on some additional
information, and then train conventional topic models [32,
35]. Hong et al. [17] made a comprehensive empirical study
of topic modeling in Twitter, and suggested that new topic
models for short texts are in demand.
In our previous works, we developed methods based on
non-negative matrix factorization for short text clustering [34]
and topic learning [33] by exploiting global word co-occurrence
information. This work extends them by proposing a more
principle approach to model topics over short texts. To the
best of our knowledge, the proposed topic model is the first
one focusing on general-domain short texts, which does not
exploit any external knowledge.
3. OUR APPROACH
Conventional topic models learn topics based on document-
level word co-occurrence patterns, whose effectiveness will
be highly influenced in short text scenario where the word
co-occurrence patterns become very sparse in each docu-
ment. To tackle this problem, here we propose a novel
biterm topic model, which learns topics over short texts by
directly modeling the generation of all the biterms (i.e. word
co-occurrence patterns) in the whole corpus.
Figure 1: Graphical representation of (a) LDA, (b) mixture of unigrams, and (c) BTM. Different from LDA
and mixture of unigrams, BTM models the generation procedure of biterms in a collection, rather than
documents. For clarity, the fixed hyperparameters α, β are not presented.
3.1 Biterm Extraction
Without loss of generality, topics are represented as groups
of correlated words in topic models, while the correlation is
revealed by word co-occurrence patterns in documents. For
example, if the words“apple”, “iphone”, “ipad”and“app”fre-
quently co-occur with each other in the same contexts, we
can identify that they belong to a same topic (i.e. apple com-
pany and its products). Conventional topic models implic-
itly capture such word co-occurrence patterns by modeling
word generation from the document level. Different from
those approaches, our BTM directly models the word co-
occurrence patterns based on biterms. A biterm denotes an
unordered word-pair co-occurring in a short context (i.e. an
instance of word co-occurrence pattern). Here the short con-
text refers to a proper text window containing meaningful
word co-occurrences. In short texts, since documents are
usually short and specific, we just take each document as an
individual context unit. We extract any two distinct words
in a short text document as a biterm. For example, in the
short text document “I visit apple store.”, if we ignoring
the stop word “I”, there are three biterms, i.e. “visit apple”,
“visit store”, “apple store”. The biterms extracted from all
the documents in the collection compose the training data
of BTM.
3.2 Biterm Topic Model
The key idea of BTM is to learn topics over short texts
based on the aggregated biterms in the whole corpus to
tackle the sparsity problem in single document. Specifically,
we consider that the whole corpus as a mixture of topics,
where each biterm is drawn from a specific topic indepen-
dently
1
. The probability that a biterm drawn from a specific
topic is further captured by the chances that both words in
the biterm are drawn from the topic. Suppose α and β are
the Dirichlet priors. The specific generative process of the
corpus in BTM can be described as follows:
1. For each topic z
(a) draw a topic-specific word distribution φ
z
Dir(β)
2. Draw a topic distribution θ Dir(α)forthewhole
collection
1
Strictly speaking, two biterms in a document sharing the
same word occurrence are not independent. This simplified
assumption facilitate the computation by considering BTM
as a model built upon a biterm set.
3. For each biterm b in the biterm set B
(a) draw a topic assignment z Multi(θ)
(b) draw two words: w
i
,w
j
Mulit(φ
z
)
Following the above procedure, the joint probability of a
biterm b =(w
i
,w
j
) can be written as:
P (b)=
z
P (z)P (w
i
|z)P (w
j
|z).
=
z
θ
z
φ
i|z
φ
j|z
(1)
Thus the likelihood of the whole corpus is:
P (B)=
(i,j)
z
θ
z
φ
i|z
φ
j|z
(2)
We can see that, here we directly model the word co-
occurrence pattern, rather than a single word, as an unit
conveying semantics of topics. No doubt the co-occurrence
of a pair of words can much better reveal the topics than the
occurrence of a single word, and then enhance the learning
of topics. Moreover, all the biterms from the whole corpus,
rather than from a single document, are aggregated together
for the topic learning. Therefore, we can fully leverage the
rich global word co-occurrence patterns to better reveal the
latent topics.
For better understanding the uniqueness of BTM from
conventional topic models, here we make a comparison be-
tween BTM and two typical models for topic learning, i.e. LDA
and mixture of unigrams. Figure 1 illustrates the graphical
representation of the three models. We can see, in LDA
each document is generated by first drawing a document-
level topic distribution θ
d
, and then iteratively sampling a
topic assignment z for each word w in the document. LDA
implicitly captures the document-level word co-occurrence
patterns since the topic assignment variable z of each word
depends on other words in the same document through shar-
ing the same document-level topic distribution θ
d
.Hence,
when documents are short, LDA will suffer from the sparsity
problem due to its excessive reliance on local observations
for the inference of word topic assignment z, which in turn
hurts the learning of topics φ.
Different from LDA, mixture of unigrams draws the topic
assignment z for each document from a corpus-level topic
distribution θ. Leveraging the information of the whole cor-
pus, it alleviates the sparsity problem in topic inference,
which in turn helps the learning the topic components φ.
However, mixture of unigrams assumes that all the words in
a document are sampled from the same topic. This assump-
tion is so strong that it prevents the model from modeling
fine topics in documents. As we can see, even in short texts,
there might be multiple topics in one document.
BTM, shown in Figure 1(c), overcomes the data spar-
sity problem of LDA by drawing topic assignment z from
the corpus-level topic distribution θ as mixture of unigrams
does. Meanwhile, it also surmounts the disadvantage of mix-
ture of unigrams by breaking documents into biterms. In
this way, BTM not only can keep the correlation between
words, but also can capture multiple topic gradients in a
document, since the topic assignments of different biterms
in a document are independent.
3.3 Inferring Topics in a Document
A major difference between BTM and conventional topic
models is that BTM does not model the document genera-
tion process. Therefore, we cannot directly obtain the topic
proportions of documents during the topic learning process.
To infer the topics in a document, we assume that the topic
proportions of a document equals to the expectation of the
topic proportions of biterms generated from the document:
P (z|d)=
b
P (z|b)P (b|d). (3)
In Eq.(3), P (z|b) can be calculated via Bayes’ formula based
on the parameters estimated in BTM:
P (z|b)=
P (z)P (w
i
|z)P (w
j
|z)
z
P (z)P (w
i
|z)P (w
j
|z)
,
where P (z)=θ
z
,andP (w
i
|z)=φ
i|z
. Then the remaining
problem is how to obtain P (b|d). Here we simply take the
empirical distribution of biterms in the document as the
estimation
P (b|d)=
n
d
(b)
b
n
d
(b)
,
where n
d
(b) is the frequency of the biterm b in the document
d. In short texts, P (b|d) is nearly an uniform distribution
over all biterms in the document d. Despite of its simplicity,
we find this estimation always obtains good results in prac-
tice. More sophisticated ways may be studied in the future
work.
4. PARAMETERS INFERENCE
In this section, we describe the algorithm to infer the pa-
rameters {φ, θ} in BTM, and compare its complexity with
LDA.
4.1 Inference by Gibbs Sampling
Similar as LDA, inference cannot be done exactly in BTM.
Hence, we adopt Gibbs sampling to perform approximate
inference. Gibbs sampling is a simple and widely appli-
cable Markov chain Monte Carlo algorithm. Compared to
other inference methods for latent variable models, like vari-
ational inference and maximum posterior estimation, Gibbs
sampling has two advantages. First, it is in principal more
accurate since it asymptotically approaches the correct dis-
tribution. Second, it is more memory-efficient since it only
requires to maintain the counters and state variables, mak-
Algorithm 1: Gibbs sampling algorithm for BTM
Input: the number of topics K, hyperparameters α, β,
biterm set B
Output: multinomial parameter φ and θ
initialize topic assignments randomly for all the
biterms
for iter = 1 to N
iter
do
for b B do
draw z
b
from P (z|z
b
,B)
update n
z
, n
w
i
|z
,andn
w
j
|z
compute the parameters φ in Eq.(5) and θ in Eq.(6)
ing it preferred for large-scale dataset. More detailed com-
parison of these methods can be found in [1].
The basic idea of Gibbs sampling is to estimate the pa-
rameters alternatively, by replacing the value of one variable
by a value drawn from the distribution of that variable con-
ditioned on the values of the remaining variables. In BTM,
we need to sample all the three types of latent variables z,
φ and θ. However, with the technique of collapsed Gibbs
sampling [10], φ and θ can be integrated out due to the
conjugate priors α and β. Consequently, we only have to
sample the topic assignment for each biterm from its condi-
tional distribution given the remaining variables.
To perform Gibbs sampling, we first choose initial states
for the Markov chain randomly. Then we calculate the con-
ditional distribution P (z|z
b
,B) for each biterm b =
(w
i
,w
j
), where z
b
denotes the topic assignments for all
biterms except b, B is the global biterm set. By applying
the chain rule on the joint probability of the whole data, we
can obtain the conditional probability conveniently:
P (z|z
b
,B) (n
z
+ α)
(n
w
i
|z
+ β)(n
w
j
|z
+ β)
(
w
n
w|z
+ )
2
, (4)
where n
z
is the number of times of the biterm b assigned to
the topic z,andn
w|z
is the number of times of the word w
assigned to the topic z. Following the conventions of LDA,
here we use symmetric Dirichlet priors α and β. Note that
once a biterm b is assigned to the topic z, the two words w
i
and w
j
in it will be assigned to the topic simultaneously.
Finally, with the counters of the topic assignments of
biterm and word occurrences, we can easily estimate the
topic-word distributions φ and global topic distribution θ
as:
φ
w|z
=
n
w|z
+ β
w
n
w|z
+
, (5)
θ
z
=
n
z
+ α
|B| +
, (6)
where |B| is the total number of biterms.
An overview of the Gibbs sampling procedure we use is
shown in Algorithm 1. Due to space limitation, we omit the
detailed derivation of it.
4.2 Complexity Analysis
The major time consuming part in the Gibbs sampling
procedure of BTM is evaluating the conditional probability
in Eq.(4) for all the biterms, with time complexity O(K|B|).
During the entire process, we need to keep the counters n
z
,
n
w|z
, and the topic assignment z for each biterm, in total of
(K + MK+|B|) variables in memory. Note that in LDA, we
Table 1: Time complexity and the number of vari-
ables need to be maintained in Gibbs sampling im-
plementation of LDA, mixture of unigrams, and
BTM
method time complexity #variables
LDA O(K|D|
¯
l) |D|K + MK + |D|l
BTM
O(K|B|) K + MK + |B|
Table 2: Time cost (seconds) per iteration of BTM
and LDA on Tweets2011 collection.
K 50 100 150 200 250
LDA 38.07s 74.38s 108.13s 143.47s 178.66s
BTM
128.64s 250.07s 362.27s 476.19 s 591.24s
need to draw topic assignment for every word occurrence in
documents, which costs time O(K|D|
¯
l), where
¯
l =
i
l
i
/|D|
is the average length of documents in the collection. For
memory cost, LDA has to maintain the counters n
z|b
, n
w|z
,
and the topic assignment z for each word occurrences[15],
in total of (|D|K + MK + |D|
l) variables. Table 1 lists the
time complexity and variables required to be maintained in
the Gibbs sampling procedure of LDA, and BTM.
To compare the time and memory cost between BTM and
LDA, we approximately rewrite |B| as
2
:
|B|≈
|D|
¯
l(
¯
l 1)
2
.
We can see the time complexity of BTM is about (
¯
l 1)/2
times of LDA. In short texts, the average length of docu-
ments are very small, e.g.
¯
l =5.21 in the Tweets2011 col-
lection, thus the run-time of BTM is still comparable with
LDA. However, for very large dataset and a large topic num-
ber K, LDA is susceptible to memory problems owing to a
huge value of |D|K.
Table 2 shows the average run-time (per iteration) of BTM
and LDA in our experiments on the Tweets2011 collection.
We find the run-time of BTM is always about 3 times of
LDA for different topic number K. Table 3 shows the over-
all memory cost of BTM and LDA in the same collection.
We find that memory required by LDA rapidly increases as
the topic number K grows, which costs more than 10 times
of memory than BTM when K is larger than 200. As op-
posed to LDA, memory required by BTM grows very slowly.
With further investigation, we find the major part of mem-
ory in BTM is used to store the biterms in training dataset.
Therefore, BTM is a better choice for large dataset and a
large topic number K, when the memory cost is a bottle-
neck.
5. EXPERIMENTS
In this section, we conduct experiments on real-world short
text collections to demonstrate the effectiveness of our pro-
posed approach. We take two typical topic models as our
baseline methods, namely LDA and mixture of unigrams.
All the experiments were carried on a Linux server with
Intel Xeon 2.33 GHz CPU and 16G memory. Both BTM
2
For a document with length l, we generate l(l 1)/2
biterms. Here we simply take all the documents as with
the same length, since the variance of the length of short
documents is not large.
Table 3: Memory cost (m) per iteration of BTM and
LDA on Tweets2011 collection.
K 50 100 150 200 250
LDA 3177m 5524m 7890m 10218m 12561m
BTM
927m 946m 964m 984m 1002m
and mixture of unigrams were implemented via C++ code
3
.
For LDA, we used the open-source implementation Gibb-
sLDA++
4
. Parameters were tuned via grid search: for LDA,
α =0.05 and on short text collections, and α =50/K on
the normal text collection, β =0.01; for BTM and mixture
of unigrams, α =50/K and β =0.01. In all the methods,
Gibbs sampling was run for 1,000 iterations. The results
reported are the average over 10 runs.
One typical way for topic model evaluation is to com-
pare the perplexity or marginal likelihood on a held-out test
set [3, 11, 12]. However, since BTM not models the genera-
tion process of documents, these measures are not available
for us. Moreover, these measures do not reflect the topic
quality rightly [6]. Therefore, we evaluate the performance
of BTM on topic modeling on some other task-dependent
metrics.
5.1 Evaluation on Tweets2011 Collection
To verify the effectiveness of BTM on short texts, we car-
ried experiments on a standard short text collection, namely
Tweets2011
5
. It was published in TREC 2011 microblog
track, which provides approximately 16 million tweets sam-
pled between January 23rd and February 8th, 2011. Be-
sides the complete content of tweets, it also includes an
user id, and a timestamp for each tweet. To reduce low-
quality tweets, we processed the raw content via the follow-
ing normalization steps: (a) removing non-Latin characters
and stop words; (b) converting letters into lower case; (c)
removing words with document frequency less than 10; (d)
filtering out tweets with length less than 2; (e) removing du-
plicate tweets. At last, we left 4,230,578 valid tweets, 98,857
distinct words, and 2,039,877 users. The average document
length is 5.21.
We compared BTM with three topic modeling methods
on this short texts collection: (a) the standard LDA, which
takes each tweet as a document; (b) LDA-U, which aggre-
gates all the tweets from a user to a big psudo-document
before training LDA; (c) mixture of unigrams (denoted as
Mix), which assumes each tweet only exhibits a single topic.
In this collection, we set the number of topics K =50for
all the methods.
5.1.1 Quality of Topics
To investigate the quality of topics discovered by all the
test methods, we first sample some topics for visualization.
Following [7], we randomly drew two topics shared by the
topic sets discovered by the four methods. The selection
process is described as follows. Firstly, we collected the top
5 words in each topic into a topical word set for each method
individually. Then we randomly chose two terms (i.e., job
and snow) from the intersection of the four topical word
sets. For each topic, besides the top 20 words, which are
3
Code of BTM : http://code.google.com/p/btm/
4
http://gibbslda.sourceforge.net/
5
http://trec.nist.gov/data/tweets/
most representative for a topic, we also listed 20 non-top
words (i.e. ranked from 1001 to 1020) ordered by P (w|z).
Ideally, a high quality topic should be coherent as much
as possible. Hence, it is expected that the non-top words
should be relevant to the top words in the same topic.
Table 4 presents the top words (first row) and non-top
words (second row) of the topic selected by the word “job”.
We find the two words“job” and “jobs” are ranked highest by
all the four methods. However, in LDA, some other words,
like “web”, “website”, and “google”, are more related to a
topic about website, rather than job. The results in LDA-
U and mixture of unigrams seem a little better than LDA,
but still include a few of less relevant words like “website”
and “www”. While in BTM, the top 20 words are more
prominent and precise about “job”. In the non-top words,
we find LDA includes the least words about “job”, which is
hard to connect them to the top words. On the contrary,
BTM includes more relevant words about “job” than others,
suggesting this topic discovered by BTM is more coherent.
Table 5 presents the top words (first row) and non-top
words (second row) of the topic selected by another word
“snow”. In the first row, again we can see that the top words
in LDA are mixed with words about two different subjects
“weather” and “car”. The results in LDA-U is similar to
LDA, but more about “weather”. In contrast, the top words
in mixture of unigrams and BTM clearly describe weather.
In the second row, both LDA and LDA-U list words almost
have no connection to“snow”, while some of them are related
to “car”. For mixture of unigrams, it is hard to explain the
topic based on these non-top words. In BTM, there are
still many words about “weather”, like “temperature” and
“cyclone”. Besides the two topics presented here, we also
find similar phenomenon in remaining topics, which suggests
that the topics discovered by BTM are is more prominent
and coherent than the three baselines.
In order to perform more comprehensive analysis, we uti-
lize an automated metric, namely coherence score, proposed
by Mimno et al [21] for topic quality evaluation. Given a
topic z and its top T words V
(z)
=(v
(z)
1
, ..., v
(z)
T
) ordered by
P (w|z), the coherence score is defined as:
C(z; V
(z)
)=
T
t=2
t
l=1
log
D(v
(z)
m
,v
(z)
l
)+1
D(v
(z)
l
)
,
where D(v) is the document frequency of word v, D(v, v
)is
the number of documents words v and v
co-occurred. The
coherence score is based on the idea that words belonging to
a single concept will tend to co-occur within the same doc-
uments. It is empirically demonstrated that the coherence
score is highly correlated with human-judged topic coher-
ence. It must be stressed that the coherence score only is
appropriate for measuring frequent words in a topic. Be-
cause the frequency of rare words is less reliable.
To evaluate the overall quality of a topic set, we calculated
the average coherence score, namely
1
K
k
C(z
k
; V
(z
k
)
), for
each method. The result is listed in Table 6, where the num-
ber of top words T ranges from 5 to 20. We find the result
is in agreement with previous qualitative analysis. BTM re-
ceives the highest coherence score in all the settings, and
the superiority is statistically significant (P-value < 0.01 by
T-test). Both LDA-U and mixture of unigrams outperform
LDA slightly, but the differences are not significant.
Table 6: Average coherence score on the top T words
(ordered by P(w|z)) in topics discovered by LDA,
LDA-U, mixture of unigrams, and BTM. A larger
coherence score means the topics are more coher-
ent. It suggests that BTM outperforms others sig-
nificantly (P-value < 0.01 by t-test).
T 51020
LDA 55.0 ± 0.4 236.4 ± 2.0 1015.7 ± 5.9
LDA-U
54.2 ± 0.8 234.8 ± 1.1 1009.4 ± 4.4
Mix
53.8 ± 0.1 233.0 ± 1.4 1007.6 ± 6.7
BTM
52.4 ± 0.1 227.8 ± 0.3 990.2 ± 3.8
Table 7: Hashtags used for evaluation, not including
the prefix ’#’.
jan25 superbowl sotu wheniwaslittle mobsterworld jobs
agoodboyfriend bieberfact glee lfc rhoa itunes thegame
celebrity tcyasi americanidol cancer socialmedia jerseyshore
photography jp6foot7remix factsaboutboys meatschool
libra android sagittarius thissummer tnfisherman sagawards
ausopen bears weather jaejoongday skins bfgw fashion
pandora realestate teamautism travel nba football marketing
design oscars food dating kindle snow obama
5.1.2 Quality of Topical Representation of Documents
In the Tweets2011 collection, there is no category infor-
mation for tweets. Manual labeling might be difficult due to
the incomplete and informal content of tweets. Fortunately,
some tweets are labeled by their authors with hashtags in
the form of “#keyword”. By investigating the data, we find
there are mainly three types of usage of hashtags: (a) mark-
ing events or topics; (b) defining the types of content, like
“#ijustsayin”, “#quote”; (c) realizing some specified func-
tions, like “#fb” means importing the tweet to Facebook in
the meanwhile. In our case, only the first type of hashtags
are useful. Therefore, we manually chose 50 frequent hash-
tags in type (a), listed in Table 7.
Since each hashtag in Table 7 denotes a specific topic la-
beled by its author, we organized documents with the same
hashtag into a cluster. The following evaluation is based
on the fact that these clusters should have low intra-cluster
distances and high inter-cluster distances.
Considering topic models as a type of dimension reduction
methods, each document can be represented by a vector of
posterior distribution of topics:
d
i
=[p(z
1
|d
i
), ..., p(z
k
|d
i
)]. (7)
Then we can measure the distance of two documents by the
Jensen–Shannon divergence:
dis(d
i
,d
j
)=
1
2
D
KL
(d
i
||m)+
1
2
D
KL
(d
j
||m),
where m =
1
2
(d
i
+ d
j
), and D
KL
(p||q)=
i
p
i
ln
p
i
q
i
is the
Kullback–Leibler divergence. Given a set of clusters C =
{C
1
, ..., C
K
}, we introduce two distance scores
Average Intra-Cluster Distance:
IntraDis(C)=
1
K
K
k=1
d
i
,d
j
C
k
i=j
2dis(d
i
,d
j
)
|C
k
||C
k
1|
Table 4: Topics selected by the word “job” on the Tweets collection. The first row lists the top 20 words,
while the second row lists non-top words ranked from 1001 to 1020 based on P (w|z).
LDA LDA-U Mixture of unigrams BTM
job jobs business web job jobs design manager jobs job business jobs job manager business
website google design online
project web website site marketing social media sales hiring service services
marketing site blog project
business service online web design website project company senior
manager search
company hiring www manager blog project seo engineer management
www company service
support sales services internet sales tips marketing nurse office assistant
sales services post
london blog senior engineer company site hiring center customer development
nonprofit gallery announced expertise unemployed med iii understand rep industrial springfield mlm recruit oil req
presence published converting
host educational fort tags sustainability rankings unemployment processing
select reps requirement mgr
apps assignments labor scholarships stay single campus overview awards recruiters
territory recruiters power
introduction leads github extra cheap 101 vp relationships ict finish entrepreneur comp
involved announce poster
assurance avon manchester beginners colorado compliance assist 1000 alliance locations
larry dynamics feeds bristol
starting automotive table face winning mechanical patent auditor
Table 5: Topics selected by the word “snow” on the Tweets collection. The first row lists the top 20 words,
while the second row lists non-top words ranked from 1001 to 1020 based on P (w|z).
LDA LDA-U Mixture of unigrams BTM
snow car weather cold snow weather cold winter snow weather cold storm snow cold weather early
drive storm winter ice
ice storm rain stay winter ice rain warm stay ready ice winter
road bus driving rain
warm due car closed degrees stay sun spring storm hour hours weekend
ride traffic cars safe
coming spring drive traffic safe blizzard coming wind warm late coming spring
closed due warm train
safe sun blizzard city cyclone chicago freezing inches rain tired sun hot
western dmv covering a4 locations sunset drizzle australian thankful station temperature cyclone
push pulling milwaukee
mississippi interstate residents stops groundhogday possibly warmth issued colder
remains pace idiots 95
portland students fireplace cleveland traveling sidewalk mood couch snows pre
commuter buick owner
letting yuck ton counties signal covering predicting ten grass traveling polar outages
cta transmission cyclist
counting blankets pushed meant double affect umbrella filled yawn outage
flurries camping tyre
3pm springfield venture zoo schedule blew causing flurries online gloves speed
Average Inter-Cluster Distance:
InterDis(C)=
1
K(K 1)
C
k
,C
k
C
k=k
d
i
C
k
d
j
C
k
dis(d
i
,d
j
)
|C
k
||C
k
|
The intuition is that if the average inter-cluster distance
is small compared to the average intra-cluster distance, the
topical representation of documents agrees well with human
labeled clusters (via hashtag). Therefore, we calculate the
following ratio to evaluate the quality of one topical repre-
sentation of documents as [4, 13]:
H =
IntraDis(C)
InterDis(C)
.
Given a set of different topical representations of documents,
the best one is which minimizes the H score.
Table 8 shows the H score for all the test methods. From
the results, we can see that BTM preforms significantly bet-
ter than other three methods (P-value < 0.001). LDA-U
outperforms LDA slightly, implying that aggregating tweets
for individual users brings moderate benefit. Although LDA
dominates mixture of unigrams on normal texts, it is some-
how surprising that the performance of mixture of unigrams
outperforms LDA and LDA-U substantially in this short
text collection. It suggests that the data sparsity problem
seriously affects LDA and LDA-U, while less influences mix-
ture of unigrams and BTM. However, the H score of mixture
of ungirams is still much worse than BTM. With some fur-
ther analysis, we find the average intra-cluster distance of
mixture of unigrams is extremely large, owing to its peaked
posterior distribution of P (z|d). In other words, mixture of
Table 8: H score for different methods on the
Tweets2011 collection, smaller value is better. The
significant levels(P-value by t-test) are denoted as
0.1*, 0.01**, 0.001***.
Method H score Significant differences
LDA 0.576 ± 0.007
LDA-U
0.564 ± 0.011 >LDA*
Mix
0.503 ± 0.008 >LDA-U**>LDA***
BTM
0.474 ± 0.005 >Mix***>LDA-U***>LDA***
unigrams fails to recognize the resemblance of many docu-
ments.
From the above results, we find the improvement of LDA-
U over LDA is not so much as shown in [17]. An explanation
for this difference is that there are less tweets posted by an
user in average in our dataset than theirs. Figure 2 shows
the proportions of users who posted certain number of tweets
in the Tweets2011 collection, we find 63.3% of users posted
one tweet, and only 2.1% of users posted more than 9 tweets.
Thus it is not strange that aggregating tweets for individual
users has limited affects.
5.2 Evaluation on Question Collection
In order to demonstrate the effectiveness of our approach
is domain-independent, we evaluated it on another short
text collection, called Question collection. This collection
includes 648,514 questions crawled from a popular Chinese
Q&A website
6
. Each question has a category label assigned
by its questioner, making it convenient for automatic evalu-
6
http://zhidao.baidu.com
1 2 3 4 5 6 7 8 9 >9
0
0.2
0.4
0.6
0.8
Number of tweets posted
Proportion of users
Figure 2: Proportions of users who posted certain
number of tweets in the Tweets2011 collection.
ation. For pre-process, we removed stop words and low fre-
quency words (i.e. document frequency is less than 3). The
final collection contains 189,080 documents, 26,565 distinct
words, and 35 categories. The average length of documents
is 3.94. Note that in this collection, our baselines do not in-
clude LDA-U, since there is few users whole submitted more
than one question.
We performed the evaluation based on document classifi-
cation. Considering topic model as a way for dimensionality
reduction, which reduces a document to a fixed set of topical
features P (z|d), we would like to see how accurate and dis-
criminative of the topical representation of documents for
classification. We randomly split documents into training
and test subsets with the ratio 4 : 1, and classified them
by the linear SVM classifier LIBLINEAR
7
. We reported the
accuracy on 5-fold cross validation in Figure 3.
From the results, we can see that BTM always dominates
the two baselines. Moreover, the advantage of BTM be-
comes more notable as the topic number K grows. That is
because when the number of topics is small, topics discov-
ered are usually very general. In such case, a short document
is more likely to belong to a single topic, thus the perfor-
mance of BTM is close to mixture of unigrams. In contrast,
with the increase of the topic number K, we will learn more
specific topics. However, mixture of unigrams is unable to
capture the multiple topics exhibited in a document. Thus
the difference between BTM and mixture of unigrams be-
comes larger. At the same time, a large topic number will
aggravate the data sparsity problem of LDA by introducing
more parameters, thus the gap between BTM and LDA also
increases. Another important finding is that mixture of un-
igrams outperforms LDA all the time. It suggests that LDA
is not a good choice for short texts due to the data sparsity
problem.
One may wonder the impact of training data size on these
methods. We randomly sampled different proportion of doc-
uments, from 0.2 to 1, to train and test these methods sep-
arately. The results are shown in Figure 4. We can see
when the size of the training data grows, all the methods
work better. However, both BTM and mixture of unigrams
achieve more improvement than LDA. LDA only get close
to mixture of unigrams on small training data. It suggests
that increasing the training data will not overcome the data
sparsity problem in LDA, since the documents are still short.
7
http://www.csie.ntu.edu.tw/˜cjlin/liblinear/
Figure 3: Classification performance of BTM, mix-
ture of unigrams, and LDA on the Question collec-
tion.
Figure 4: Classification performance comparison
with different data proportions on the Questions col-
lection (K=40).
Comparing mixture of unigrams with BTM, we find BTM
has stable superiority over mixture of unigrams no matter
of the size of the training data.
5.3 Evaluation on Normal Texts
In previous experiments, we have demonstrated the effec-
tiveness of BTM on short texts. Although we propose BTM
for the short text scenario, there is no limitation for our
model to be applied on normal text collections. Therefore,
it is also interesting to see how effective is BTM on nor-
mal text. For this purpose, we compared BTM with LDA,
one of most popular topic models, on a normal text collec-
tion. The experiments were carried out on the 20Newsgroup
collection
8
, a standard corpora including 18,828 messages
harvested from 20 different Usenet newsgroups. Each news-
group corresponding to a different topic. Table 9 lists the
names of these newsgroups. For pre-process, we removed
stop words and words with frequency less than 3, but with-
out stemming. Finally, 42697 words are left.
We directly trained LDA on the original documents with-
out any other processing. Note that in BTM, we need to
extract biterms from the collection. This process is a little
different from that in short texts. Recall that a biterm is
defined as a word-pair co-occurred in a short context. It is
not appropriate to view a lengthy document as a single short
context, since it may involve a wide range of topics. In or-
8
http://qwone.com/˜jason/20Newsgroups/
Figure 5: Clustering performance of BTM with different context range thresholds and LDA on the 20
Newsgroups collection (K =20).
Table 9: The newsgroup names in the 20 News-
groups collection
No. Newsgroup Name No. Newsgroup Name
1 alt.atheism 11 rec.sport.hockey
2 comp.graphics
12 sci.crypt
3 comp.os.ms-windows.misc
13 sci.electronics
4 comp.sys.ibm.pc.hardware
14 sci.med
5 comp.sys.mac.hardware
15 sci.space
6 comp.windows.x
16 soc.religion.christian
7 misc.forsale
17 talk.politics.guns
8rec.autos
18 talk.politics.mideast
9 rec.motorcycles
19 talk.politics.misc
10 rec.sport.baseball
20 talk.religion.misc
der to reduce meaningless and noise biterms, the biterm set
is constructed by extracting any two words co-occur within
a context window with range no larger than a predefined
threshold r in each document.
5.3.1 Quantitative Evaluation
For quantitative evaluation, we compare the clustering
performance of BTM and LDA. Document clustering evalu-
ation is a direct way to measure the effectiveness of a topic
model without depending on any extrinsic methods. For
document clustering, we take each topic as a cluster, and
assign each document d to the topic z with highest value of
conditional probability P (z|d). Note that we do not know
the optimal context range threshold r ahead, therefore, we
tested different values of it, and report their results together.
We adopt three standard metrics in clustering evaluation
as follows. Let Ω = {ω
1
, ···
K
} be the set of output clus-
ters, and C = {c
1
, ··· ,c
P
} be P labeled classes of the doc-
uments.
Purity. Suppose documents in each cluster should take
the dominant class in the cluster. Purity is the ac-
curacy of this assignment measured by counting the
number of correctly assigned documents and divides
by the total number of test documents. Formally:
purity(Ω, C)=
1
n
K
i=1
max
j
|ω
i
c
j
|.
Note that when all the documents in each cluster are
with the same class, purity is highest with value of 1.
Conversely, it is close to 0 for bad clustering.
Normalized Mutual Information(NMI). Let I(Ω; C)de-
notes the mutual information between the two parti-
tions Ω and C, NMI penalized I(Ω; C) by their entropy
H(Ω) and H(C) to avoid the value biasing to large
number of clusters. Formally:
NMI(Ω, C)=
I(Ω; C)
[H(Ω) + H(C)]/2
=
i,j
|ω
i
c
j
|
n
log
|ω
i
||c
j
|
n|ω
i
c
j
|
(
i
|ω
i
|
n
log
|ω
i
|
n
+
j
|c
j
|
n
log
|c
j
|
n
)/2
Note that NMI is 1 for perfect match between Ω and
C, while 0 if the clustering is random with respect to
class membership.
Adjusted Rand Index(ARI)[18]. Consider documents
clustering as a series of pair-wise decisions. If two doc-
uments both in the same class and the same cluster, or
both in different classes and different clusters, the deci-
sion is considered to be correct, else false. Rand index
measures the percentage of decisions that are correct.
Adjusted Rand index is the corrected-for-chance ver-
sion of Rand index, whose expected value is 0, while
the maximum value is also 1 for exactly match.
ARI =
i,j
|ω
i
c
j
|
2
[
i
|ω
i
|
2
j
|c
j
|
2
]/
n
2
1
2
[
i
|ω
i
|
2
+
j
|c
j
|
2
] [
i
|ω
i
|
2
j
|c
j
|
2
]/
n
2
The results are shown in Figure 5. On the whole, it is
clear that BTM outperforms LDA significantly when the
context range threshold r is between 30 and 60, suggesting
that BTM also performs very well on normal texts. In par-
ticular, we find when r = 10, LDA works better than BTM,
implying that the context information utilized by BTM is
not enough. As the context range threshold r increases,
more word co-occurrence patterns are included, which im-
proves the performance of BTM substantially. However, the
improvement slows down when the context range threshold
r increases from 30 to 60. An explanation for this behavior is
that when the distance between two words increasing, they
might be less relevant. At this point, the assumption that
the two words in a biterm have the same topic will be less
credible. Moreover, a larger context range threshold r will
generate much more biterms, which increases the training
cost. Therefore, for both effectiveness and efficiency con-
sideration, the context range threshold r should not be too
small or too large for normal texts in practice.
Table 10: Topics discovered from the 20 Newsgroup collection by BTM and LDA (K=20). “sim” in the last
column denotes the cosine similarity of the two topics in a row.
BTM LDA sim
1 ax max g9v b8f a86 1d9 pl 145 3t giz ax max b8f g9v a86 145 1d9 pl 0t 3t 0.99
2
god jesus christ church bible people lord christian god jesus bible christian church christ christians paul 0.95
3
key encryption chip clipper keys government system key encryption chip clipper government keys public 0.95
4
window server display widget set application xterm file window server set application sun display problem manager 0.93
5
space earth launch mission orbit shuttle system solar space earth nasa gov time system mission launch 0.91
6
writes article don ca david uk wrote cs org writes article university uk ca cs michael mail brian 0.90
7
ax 0d cx 145 ah 34u w7 mv scx uw 0d cx ah w7 mv sp 17 uw scx air 0.86
8
people don fbi fire children koresh gun batf people writes gun fbi fire children article koresh 0.83
9
people don god writes make good point question people writes true don religion evidence question god 0.82
10
people government president don make time american president government people state states rights american 0.80
11
disease medical people patients don time writes good medical health disease drug study drugs men cancer 0.79
12
drive scsi mac bit card apple system monitor problem windows drive dos card mac system apple scsi disk 0.75
13
image jpeg file graphics images files color data format file image program files bit jpeg color output line 0.74
14
mail university information fax internet list email graphics ftp software data mail pub computer 0.62
15
car don writes cars good ve engine time car cars armenian armenians engine muslims turkish 000 0.62
16
00 year team 10 game 55 play players games 20 writes year play game good ca insurance scott team games 0.61
17
1993 health men number 10 hiv april study homosexual 10 1993 20 15 00 12 93 11 30 0.54
18
windows dos file system files run don os pc program don people ve time good ll make things thing doesn 0.25
19
armenian armenians people war muslims turkish information group list book post questions read subject 0.15
20
file entry output program build line printf char info writes price buy sale problem cost power good interested 0.03
5.3.2 Qualitative Evaluation
Here we study the quality of topics discovered by the two
topic models. In practice, a topic model which finds topics
with good readability and accurately reflecting the topical
structure of data is preferred. Table 10 presents all the topics
learned by BTM and LDA, when the number of topics is set
to 20. These topics from the two methods are matched based
cosine similarity using greedy algorithm. For each topic we
list its top words ordered by P (w|z). We can see that the
topics 1-16 in BTM and LDA are very similar. Compari-
son Table 9 and Table 10, we find it is easy to identify the
corresponding newsgroup of a topic in topics 1-16, except
topic 1 and topic 7. For example, topic 2 is with respect to
the newsgroup “soc.religion.christian”. It suggests that both
BTM and LDA uncover the inherent topical structure of the
collection closely.
We also note that topics 17-20 in Table 10 are very differ-
ent in BTM and LDA. In BTM, we can still identify that top-
ics 17-20 relate to the newsgroups “sci.med”, “comp.os.ms-
windows.misc”,“talk.politics.mideast”,“comp.os.ms-window-
s.misc” respectively. But in LDA, topic 17 is about numeral,
topic 18 is a set of common words, while topics 19 and
20 are with poor interpretability. In our view, the differ-
ences between the results of the two models are caused by
the following reasons. BTM explicitly model the word co-
occurrences in local context, it well captures the short-range
dependencies between words. Conversely, LDA captures the
long-range dependencies in documents [11], which are less
specific than short-range ones, resulting in the last four top-
ics more common but less readable.
6. CONCLUSION & FUTURE WORKS
Topic modeling for short texts is an increasingly impor-
tant task due to the prevalence of short texts on the Web.
Compared to normal documents, short texts lack of word
frequency and context information, causing severe sparsity
problems for conventional topic models. In this paper, we
propose a novel probabilistic topic model for short texts,
namely biterm topic model (BTM). BTM can well capture
the topics within short texts as it explicitly models the word
co-occurrence patterns and uses the aggregated patterns in
the whole corpus. We carried on experiments on two real-
world short text collections and one normal text collection.
The results demonstrated that BTM not only can learn
higher quality topics, but also more accurately capture the
topics of documents than previous methods. Besides, BTM
is simple and easy to implement, and also scales up well. All
these benefits makes BTM a practicable choice for content
analysis on short texts in a wide range of applications.
To the best of our knowledge, we are the first to propose
a topic model for general short texts. However, there is still
room to improve our work in the future. For example, we
would like to find more sophisticated way to estimate the
distribution P (b|d), which is uniform in the current work
for simplicity. Moreover, it is also interesting to explore
the usage of our model in various real-world applications,
like content recommendation, event tracking, and short texts
retrieval, etc.
7. ACKNOWLEDGEMENTS
This work is funded by the National Natural Science Foun-
dation of China under Grant No. 61202213, 61203298, No.
60933005, No. 61173008, No. 61003166, and 973 Program of
China under Grants No. 2012CB316303. We would like to
thank the anonymous reviewers for their helpful comments.
8. REFERENCES
[1] A. Asuncion, M. Welling, P. Smyth, and Y. Teh. On
smoothing and inference for topic models. In In
Proceedings of the 25th Conference on UAI, 2009.
[2] D. Blei and J. McAuliffe. Supervised topic models. In
J. Platt, D. Koller, Y. Singer, and S. Roweis, editors,
Advances in Neural Information Processing Systems
20, pages 121–128. MIT Press, Cambridge, MA, 2008.
[3] D. Blei, A. Ng, and M. Jordan. Latent dirichlet
allocation. The Journal of Machine Learning
Research, 3:993–1022, 2003.
[4] I. Bordino, C. Castillo, D. Donato, and A. Gionis.
Query similarity by projecting the query-flow graph.
In SIGIR, pages 515–522. ACM, 2010.
[5] J. Boyd-Graber and D. M. Blei. Syntactic topic
models. Technical Report arXiv:1002.4665, Feb 2010.
[6] J. Boyd-Graber, J. Chang, S. Gerrish, C. Wang, and
D. Blei. Reading tea leaves: How humans interpret
topic models. In NIPS, 2009.
[7] D. Cai, Q. Mei, J. Han, and C. Zhai. Modeling hidden
topics on document manifold. In Proceedings of the
17th ACM conference on Information and knowledge
management, pages 911–920. ACM, 2008.
[8] J. Chen, R. Nairn, L. Nelson, M. Bernstein, and
E. Chi. Short and tweet: experiments on
recommending content from information streams. In
Proceedings of the 28th international conference on
Human factors in computing systems, pages
1185–1194. ACM, 2010.
[9] S. Deerwester, S. Dumais, G. Furnas, T. Landauer,
and R. Harshman. Indexing by latent semantic
analysis. Journal of the American society for
information science, 41(6):391–407, 1990.
[10] T. Griffiths and M. Steyvers. Finding scientific topics.
Proceedings of the National Academy of Sciences of
the United States o f America, 101(Suppl 1):5228–5235,
2004.
[11] T. Griffiths, M. Steyvers, D. Blei, and J. Tenenbaum.
Integrating topics and syntax. NIPS, 17:537–544, 2005.
[12] A. Gruber, M. Rosen-Zvi, and Y. Weiss. Hidden topic
markov models. Artificial Intelligence and Statistics
(AISTATS), 2007.
[13] J. Guo, X. Cheng, G. Xu, and X. Zhu. Intent-aware
query similarity. In Proceedings of the 20th ACM
international conference on Information and
knowledge management, pages 259–268. ACM, 2011.
[14] J. Guo, G. Xu, X. Cheng, and H. Li. Named entity
recognition in query. In SIGIR, pages 267–274. ACM,
2009.
[15] G. Heinrich. Parameter estimation for text analysis.
Technical report, 2005.
[16] T. Hofmann. Probabilistic latent semantic indexing. In
SIGIR, pages 50–57. ACM, 1999.
[17] L. Hong and B. Davison. Empirical study of topic
modeling in twitter. In Proceedings of the First
Workshop on Social Media Analytics, pages 80–88.
ACM, 2010.
[18] L. Hubert and P. Arabie. Comparing partitions.
Journal of classification, 2(1):193–218, 1985.
[19] O. Jin, N. Liu, K. Zhao, Y. Yu, and Q. Yang.
Transferring topical knowledge from auxiliary long
texts for short text clustering. In Proceedings of the
20th ACM international conference on Information
and knowledge management, pages 775–784. ACM,
2011.
[20] C. X. Lin, B. Zhao, Q. Mei, and J. Han. Pet: a
statistical model for popular events tracking in social
communities. In Proceedings of the 16th ACM
SIGKDD, pages 929–938. ACM, 2010.
[21] D. Mimno, H. Wallach, E. Talley, M. Leenders, and
A. McCallum. Optimizing semantic coherence in topic
models. In Proceedings of the Conference on Empirical
Methods in Natural Language Processing, pages
262–272. Association for Computational Linguistics,
2011.
[22] D. Newman, E. V. Bonilla, and W. Buntine.
Improving topic coherence with regularized topic
models. In Advances in Neural Information Processing
Systems 24, pages 496–504. 2011.
[23] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell.
Text classification from labeled and unlabeled
documents using em. Machine learning, 39(2):103–134,
2000.
[24] X. Phan, L. Nguyen, and S. Horiguchi. Learning to
classify short and sparse text & web with hidden
topics from large-scale data collections. In Proceedings
of the 17th international conference on World Wide
Web, pages 91–100. ACM, 2008.
[25] O. Phelan, K. McCarthy, and B. Smyth. Using twitter
to recommend real-time topical news. In Proceedings of
the third ACM conference on Recommender systems,
pages 385–388, New York, NY, USA, 2009. ACM.
[26] D. Ramage, S. Dumais, and D. Liebling.
Characterizing microblogs with topic models. In
International AAAI Conference on Weblogs and Social
Media, volume 5, pages 130–137, 2010.
[27] M. Rosen-Zvi, T. Griffiths, M. Steyvers, and
P. Smyth. The author-topic model for authors and
documents. In UAI, 2004.
[28] M. Sahami and T. Heilman. A web-based kernel
function for measuring the similarity of short text
snippets. In Proceedings of the 15th international
conference on World Wide Web, pages 377–386. ACM,
2006.
[29] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei.
Hierarchical dirichlet processes. Journal of the
American Statistical Association, 101, 2004.
[30] X. Wang and A. McCallum. Topics over time: a
non-markov continuous-time model of topical trends.
In Proceedings of the 12th ACM SIGKDD, pages
424–433, New York, NY, USA, 2006. ACM.
[31] Y. Wang, E. Agichtein, and M. Benzi. Tm-lda:
efficient online modeling of latent topic transitions in
social media. In Proceedings of the 18th ACM
SIGKDD, pages 123–131, New York, NY, USA, 2012.
ACM.
[32] J. Weng, E. Lim, J. Jiang, and Q. He. Twitterrank:
finding topic-sensitive influential twitterers. In
Proceedings of the third ACM international conference
on Web search and data mining, pages 261–270. ACM,
2010.
[33] X. Yan, J. Guo, S. Liu, X. Cheng, and Y. Wang.
Learning topics in short texts by non-negative matrix
factorization on term correlation matrix. In
Proceedings of the SIAM International Conference on
Data Mining. SIAM, 2013.
[34] X. Yan, J. Guo, S. Liu, X.-q. Cheng, and Y. Wang.
Clustering short text using ncut-weighted
non-negative matrix factorization. In Proceedings of
the 20th ACM international conference on
Information a nd knowledge management, pages
2259–2262, New York, NY, USA, 2012. ACM.
[35] W. Zhao, J. Jiang, J. Weng, J. He, E. Lim, H. Yan,
and X. Li. Comparing twitter and traditional media
using topic models. Advances in Information
Retrieval, pages 338–349, 2011.
... A quantitative study employing convenience sampling through a selfadministered questionnaire was conducted in 2020. Data were collected at various venues during the festival period (6)(7)(8)(9)(10)(11)(12)(13)(14), and 482 questionnaires were included in the analysis. Exploratory Factor Analysis (EFA) and Two-step Cluster Analysis [Schwarz's Bayesian Information Criterion] were applied to the data. ...
... The data will go through pre-processing and sentence segmentation, and first, Topic-modeling using LDA and BTM will be performed to analyze the tourist destination image. LDA (Latent Dirichlet Allocation) is a model for estimating the topic of a given document based on the word distribution for the topic [5], and BTM (Biterm Topic Modeling) is based on the joint probability distribution between words, This is a model for estimating the topic of the document [6]. Through this, the overall image of tourist destinations viewed by domestic and foreign tourists will be identified and the differences will be analyzed. ...
Presentation
Full-text available
Responding to Richter’s call (1983, 317) “to know far more about the tourism policy making process”, this paper seeks to introduce the assumptions, concepts and research methodologies of the Narrative Policy Framework (Jones & McBeth, 2010; Jones et al., 2014, McBeth et al., 2014; Shanahan et al., 2018) and apply them in tourism research in order to investigate the black box of tourism planning and policy processes (Hall, 2008, 15).
... However, our method contributes two technical novelties compared to traditional topic modeling methods. First, we take advantage of strong pre-trained contextualized embeddings to analyze the texts which provide richer information than word co-occurrence or uncontextualized word embeddings, such as those used in LDA (Blei, Ng, and Jordan 2003), biterm (Yan et al. 2013), pLSA (Hofmann 1999), or ETM (Dieng, Ruiz, and Blei 2020). This rich information is particularly helpful for modeling short text datasets such as tweets. ...
... In this experiment, to demonstrate our approach's effectiveness on this problem, we also compare our method with other topic-modeling methods sharing similar properties to ours such as dealing with short texts, using neural embeddings features. To be specific, we considered the following methods: (1) LDA, (2) biterm for short text modeling (Yan et al. 2013) -a method that models the word-pair cooccurrence patterns in the whole corpus aiming to solve the problem of sparse word co-occurrence at document-level, (3) Contextualized Topic Models (CTM) ) -a method adding BERT embeddings to improve Neural Topic Model, and (4) baseline -taking the mode of the training set as predictions. We used 10-folds cross-validation to evaluate each method, where we trained the model on 9 training folds and tested on The CTM and biterm algorithm were implemented using contextualized-topic-models and biterm Python library respectively, which were developed by the original papers' authors. ...
Article
Full-text available
How we perceive our surrounding world impacts how we live in and react to it. In this study, we propose LaBel (Latent Beliefs Model), an alternative to topic modeling that uncovers latent semantic dimensions from transformer-based embeddings and enables their representation as generated phrases rather than word lists. We use LaBel to explore the major beliefs that humans have about the world and other prevalent domains, such as education or parenting. Although human beliefs have been explored in previous works, our proposed model helps automate the exploring process to rely less on human experts, saving time and manual efforts, especially when working with large corpus data. Our approach to LaBel uses a novel modification of autoregressive transformers to effectively generate texts conditioning on a vector input format. Differently from topic modeling methods, our generated texts (e.g. “the world is truly in your favor”) are discourse segments rather than word lists, which helps convey semantics in a more natural manner with full context. We evaluate LaBel dimensions using both an intrusion task as well as a classification task of identifying categories of major beliefs in tweets finding greater accuracies than popular topic modeling approaches.
... The 20 News Groups Datasets for example consists of 15,465 documents and 4,159 words [116]. Tweets have also been used for topic modelling tasks [123][124][125]. Jonsson et al. [123] for example, collected tweets from Twitter to prepare a datasets of 129,530 tweets and used LDA [47], Biterm-Topic-Model (BTM) [124] and a variation of LDA algorithms for topic modelling to compare their performance. ...
... Tweets have also been used for topic modelling tasks [123][124][125]. Jonsson et al. [123] for example, collected tweets from Twitter to prepare a datasets of 129,530 tweets and used LDA [47], Biterm-Topic-Model (BTM) [124] and a variation of LDA algorithms for topic modelling to compare their performance. In case of Twitter based topic modelling datasets, a tweet is considered as Document, though Jonsson et al. [123] aggregate documents to form pseudo-documents and found that it solves the poor performance of LDA on shorter documents. ...
Preprint
Full-text available
A distinct feature of Hindu religious and philosophical text is that they come from a library of texts rather than single source. The Upanishads is known as one of the oldest philosophical texts in the world that forms the foundation of Hindu philosophy. The Bhagavad Gita is core text of Hindu philosophy and is known as a text that summarises the key philosophies of the Upanishads with major focus on the philosophy of karma. These texts have been translated into many languages and there exists studies about themes and topics that are prominent; however, there is not much study of topic modelling using language models which are powered by deep learning. In this paper, we use advanced language produces such as BERT to provide topic modelling of the key texts of the Upanishads and the Bhagavad Gita. We analyse the distinct and overlapping topics amongst the texts and visualise the link of selected texts of the Upanishads with Bhagavad Gita. Our results show a very high similarity between the topics of these two texts with the mean cosine similarity of 73%. We find that out of the fourteen topics extracted from the Bhagavad Gita, nine of them have a cosine similarity of more than 70% with the topics of the Upanishads. We also found that topics generated by the BERT-based models show very high coherence as compared to that of conventional models. Our best performing model gives a coherence score of 73% on the Bhagavad Gita and 69% on The Upanishads. The visualization of the low dimensional embeddings of these texts shows very clear overlapping among their topics adding another level of validation to our results.
... From a data protection perspective, the data controller should be able to provide a valid reasoning why a specific approach was chosen from the available options (e.g.e.g. Latent Dirichlet Allocation (LDA) (Blei et al., 2003), Biterm Topic Model (BTM) (Yan et al., 2013), Latent Semantic Analysis (LSA) (Deerwester et al., 1990), Non Negative Matrix Factorization (NMF), Parallel Latent Dirichlet Allocation (PLDA), Pachinko Allocation Model (PAM)). Similarly, there are many different approaches for extraction and classification (e.g. ...
Preprint
Sentiment analysis has always been an important driver of political decisions and campaigns across all fields. Novel technologies allow automatizing analysis of sentiments on a big scale and hence provide allegedly more accurate outcomes. With user numbers in the billions and their increasingly important role in societal discussions, social media platforms become a glaring data source for these types of analysis. Due to its public availability, the relative ease of access and the sheer amount of available data, the Twitter API has become a particularly important source to researchers and data analysts alike. Despite the evident value of these data sources, the analysis of such data comes with legal, ethical and societal risks that should be taken into consideration when analysing data from Twitter. This paper describes these risks along the technical processing pipeline and proposes related mitigation measures.
... Several researchers have proposed strategies to explore and analyze text collections. At present, these techniques range from simple methodologies such as frequency counts [21] to more complex Machine Learning (ML) based algorithms [16,25,26]. In particular, Topic Modelling (TM) based strategies have emerged as an impressive paradigm to automatically process the semantic characteristics of large textual databases. ...
Article
Full-text available
Education quality has become an important issue and has received considerable attention around the world, especially due to its relevant repercussions on the socio-economical development of society. In recent years, many nations have realized the need for a highly skilled workforce to thrive in the emerging knowledge-based economy. They have consequently adopted strategies to identify the lines of action to improve the education quality. In response to the government’s efforts to improve the education quality in Colombia, this study examines the current perceptions of the education system from the perspective of key local stakeholders. Therefore, we used a survey that contained open-ended questions to collect information about the limitations and difficulties of the education process for several groups of participants. The collected answers were categorized into a variety of topics using a Latent Dirichlet Allocation based model. Consequently, the students’, teachers’ and parents’ answers were analyzed separately to obtain a general landscape of the perceptions of the education system. Evaluation metrics, such as topic coherence, were quantitatively analyzed to assess the modelling performance. In addition, a methodology for the hyper-parameters setting and the final topic labelling was presented. The results suggest that topic modelling strategies are a viable alternative to identify strategic lines of action and to obtain a macro-perspective of the perceptions of the education system.
... More precisely HTMM expects that all words in a same sentence share the same topic, and that consecutive sentences are more likely to have the same topic. Biterm Topic Model (BTM) [43] proposes to address a shortcoming of LDA, dealing with short texts, by modeling the generation of word co-occurrence patterns. By taking advantage of co-occurrence at the corpus-level, BTM solves the sparsity problem at the document level, which makes it suitable for short texts. ...
Preprint
Full-text available
Networks of documents connected by hyperlinks, such as Wikipedia, are ubiquitous. Hyperlinks are inserted by the authors to enrich the text and facilitate the navigation through the network. However, authors tend to insert only a fraction of the relevant hyperlinks, mainly because this is a time consuming task. In this paper we address an annotation, which we refer to as anchor prediction. Even though it is conceptually close to link prediction or entity linking, it is a different task that require developing a specific method to solve it. Given a source document and a target document, this task consists in automatically identifying anchors in the source document, i.e words or terms that should carry a hyperlink pointing towards the target document. We propose a contextualized relational topic model, CRTM, that models directed links between documents as a function of the local context of the anchor in the source document and the whole content of the target document. The model can be used to predict anchors in a source document, given the target document, without relying on a dictionary of previously seen mention or title, nor any external knowledge graph. Authors can benefit from CRTM, by letting it automatically suggest hyperlinks, given a new document and the set of target document to connect to. It can also benefit to readers, by dynamically inserting hyperlinks between the documents they're reading. Experiments conducted on several Wikipedia corpora (in English, Italian and German) highlight the practical usefulness of anchor prediction and demonstrate the relevancy of our approach.
Article
Within the product innovation process, companies are required to design their product according to diverse external influence dimensions from the product environment. By the analysis of these influence dimensions, companies gain insight into the urgency or possibility to innovate their product accordingly. As the majority of data within the business context available is text data, there is a need to formulate a method that enables companies to evaluate the data relevant for the product innovation accordingly. As the manual evaluation of this data is not feasible due to the high data amount, especially for small and medium sized companies, a concept for an automated evaluation method is required. This concept uses approaches from the field of text mining and applies them to the innovation management in order to gain insight from diverse texts about innovation potentials. This concept includes the definition of a suitable preprocessing, a topic modeling approach for this use case and a collection of options for the topic exploration. The preprocessing defines how usually occurring text documents can be converted into a format that is manageable for the text mining approach. The topic modeling is based on a constrained TF-IDF- weighted Latent Dirichlet Allocation to identify preferably new and unique topics within the considered dataset. Afterwards, the identified topics can be explored in differentiated ways in order to gain a better insight into the product environment. The validation of this method on three sample datasets tests its limitations but indicates also the potential of an automated text analysis for innovation management.
Article
As one of the prevalent topic mining methods, neural topic modeling has attracted a lot of interests due to the advantages of low training costs and strong generalisation abilities. However, the existing neural topic models may suffer from the feature sparsity problem when applied to short texts, due to the lack of context in each message. To alleviate this issue, we propose a Context Reinforced Neural Topic Model (CRNTM), whose characteristics can be summarized as follows. First, by assuming that each short text covers only a few salient topics, the proposed CRNTM infers the topic for each word in a narrow range. Second, our model exploits pre-trained word embeddings by treating topics as multivariate Gaussian distributions or Gaussian mixture distributions in the embedding space. Extensive experiments on two benchmark short corpora validate the effectiveness of the proposed model on both topic discovery and text classification.
Article
Following the rapid spread of COVID-19 to all the world, most countries decided to temporarily close their educational institutions. Consequently, distance education opportunities have been created for education continuity. The abrupt change presented educational challenges and issues. The aim of this study is to investigate the content of Twitter posts to detect the arising topics regarding the challenges of distance education. We focus on students in Saudi Arabia to identify the problems they faced in their distance education experience. We developed a workflow that integrates unsupervised and supervised machine learning techniques in two phases. An unsupervised topic modeling algorithm was applied on a subset of tweets to detect underlying latent themes related to distance education issues. Then, a multi-class supervised machine learning classification technique was carried out in two levels to classify the tweets under discussion to categories and further to sub-categories. We found that 76,737 tweets revealed five underlying themes: educational issues, social issues, technological issues, health issues, and attitude and ethical issues. This study presents an automated methodology that identifies underlying themes in Twitter content with a minimum human involvement. The results of this work suggest that the proposed model could be utilized for collecting and analyzing social media data to provide insights into students’ educational experience.
Conference Paper
Full-text available
Topic models have the potential to improve search and browsing by extracting useful semantic themes from web pages and other text documents. When learned topics are coherent and interpretable, they can be valuable for faceted browsing, results set diversity analysis, and document retrieval. However, when dealing with small collections or noisy text (e.g. web search result snippets or blog posts), learned topics can be less coherent, less interpretable, and less useful. To overcome this, we propose two methods to regularize the learning of topic models. Our regularizers work by creating a structured prior over words that reflect broad patterns in the external data. Using thirteen datasets we show that both regularizers improve topic coherence and interpretability while learning a faithful representation of the collection of interest. Overall, this work makes topic models more useful across a broader range of text data.
Article
Full-text available
Latent topic analysis has emerged as one of the most effective methods for classifying, clustering and retrieving textual data. However, existing models such as Latent Dirichlet Allocation (LDA) were developed for static corpora of relatively large documents. In contrast, much of the textual content on the web, and especially social media, is temporally sequenced, and comes in short fragments, including microblog posts on sites such as Twitter and Weibo, status updates on social networking sites such as Facebook and LinkedIn, or comments on content sharing sites such as YouTube. In this paper we propose a novel topic model, Temporal-LDA or TM-LDA, for efficiently mining text streams such as a sequence of posts from the same author, by modeling the topic transitions that naturally arise in these data. TM-LDA learns the transition parameters among topics by minimizing the prediction error on topic distribution in subsequent postings. After training, TM-LDA is thus able to accurately predict the expected topic distribution in future posts. To make these predictions more efficient for a realistic online setting, we develop an efficient updating algorithm to adjust the topic transition parameters, as new documents stream in. Our empirical results, over a corpus of over 30 million microblog posts, show that TM-LDA significantly outperforms state-of-the-art static LDA models for estimating the topic distribution of new documents over time. We also demonstrate that TM-LDA is able to highlight interesting variations of common topic transitions, such as the differences in the work-life rhythm of cities, and factors associated with area-specific problems and complaints.
Article
Full-text available
We introduce the author-topic model, a generative model for documents that extends Latent Dirichlet Allocation (LDA; Blei, Ng, & Jordan, 2003) to include authorship information. Each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. A document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors. We apply the model to a collection of 1,700 NIPS conference papers and 160,000 CiteSeer abstracts. Exact inference is intractable for these datasets and we use Gibbs sampling to estimate the topic and author distributions. We compare the performance with two other generative models for documents, which are special cases of the author-topic model: LDA (a topic model) and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. We show topics recovered by the author-topic model, and demonstrate applications to computing similarity between authors and entropy of author output.
Article
Full-text available
Social networks such as Facebook, LinkedIn, and Twitter have been a crucial source of information for a wide spectrum of users. In Twitter, popular information that is deemed important by the com-munity propagates through the network. Studying the character-istics of content in the messages becomes important for a number of tasks, such as breaking news detection, personalized message recommendation, friends recommendation, sentiment analysis and others. While many researchers wish to use standard text mining tools to understand messages on Twitter, the restricted length of those messages prevents them from being employed to their full potential. We address the problem of using standard topic models in micro-blogging environments by studying how the models can be trained on the dataset. We propose several schemes to train a standard topic model and compare their quality and effectiveness through a set of carefully designed experiments from both qualitative and quantitative perspectives. We show that by training a topic model on aggregated messages we can obtain a higher quality of learned model which results in significantly better performance in two real-world classification problems. We also discuss how the state-of-the-art Author-Topic model fails to model hierarchical relationships between entities in Social Media.
Conference Paper
This paper presents an LDA-style topic model that captures not only the low-dimensional structure of data, but also how the structure changes over time. Unlike other recent work that relies on Markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps, and for each generated document, the mixture distribution over topics is influenced by both word co-occurrences and the document's timestamp. Thus, the meaning of a particular topic can be relied upon as constant, but the topics' occurrence and correlations change significantly over time. We present results on nine months of personal email, 17 years of NIPS research papers and over 200 years of presidential state-of-the-union addresses, showing improved topics, better timestamp prediction, and interpretable trends.
Conference Paper
Non-negative matrix factorization (NMF) has been successfully applied in document clustering. However, experiments on short texts, such as microblogs, Q&A documents and news titles, suggest unsatisfactory performance of NMF. An major reason is that the traditional term weighting schemes, like binary weight and tfidf, cannot well capture the terms' discriminative power and importance in short texts, due to the sparsity of data. To tackle this problem, we proposed a novel term weighting scheme for NMF, derived from the Normalized Cut (Ncut) problem on the term affinity graph. Different from idf, which emphasizes discriminability on document level, the Ncut weighting measures terms' discriminability on term level. Experiments on two data sets show our weighting scheme significantly boosts NMF's performance on short text clustering.
Article
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.