Page 1
Experiments with Nonparametric Topic Models
Wray Buntine
Monash University
Clayton, VIC, Australia
wray.buntine@monash.edu
∗
Swapnil Mishra
RSISE, The Australian National University
Canberra, ACT, Australia
swapnil.mishra@anu.edu.au
ABSTRACT
In topic modelling, various alternative priors have been de
veloped, for instance asymmetric and symmetric priors for
the documenttopic and topicword matrices respectively,
the hierarchical Dirichlet process prior for the document
topic matrix and the hierarchical PitmanYor process prior
for the topicword matrix. For information retrieval, lan
guage models exhibiting word burstiness are important. In
deed, this burstiness effect has been show to help topic mod
els as well, and this requires additional word probability
vectors for each document. Here we show how to combine
these ideas to develop highperforming nonparametric topic
models exhibiting burstiness based on standard Gibbs sam
pling. Experiments are done to explore the behavior of the
models under different conditions and to compare the algo
rithms with previously published. The full nonparametric
topic models with burstiness are only a small factor slower
than standard Gibbs sampling for LDA and require double
the memory, making them very competitive. We look at the
comparative behaviour of different models and present some
experimental insights.
Categories and Subject Descriptors
I.7 [Document and Text Processing]: Miscellaneous;
I.2.6 [Artificial Intelligence]: Learning
Keywords
topic modelling; experimental results; nonparametric prior;
text
1.INTRODUCTION
Topic models are now a recognised genre in the suite of ex
ploratory software available for text data mining and other
semistructured tasks. Moreover, it is also recognised that
∗Part of this author’s contribution was done while at
NICTA, Canberra.
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage, and that copies bear this notice and the full ci
tation on the first page. Copyrights for thirdparty components of this work must be
honored. For all other uses, contact the owner/author(s). Copyright is held by the
author/owner(s).
KDD’14, August 24–27, 2014, New York, NY, USA.
ACM 9781450329569/14/08.
http://dx.doi.org/10.1145/2623330.2623691.
a broad class of variants can be developed based around
extended graphical models that capture some additional do
main requirements. Examples of these extensions include
authortopic modelling [19], document segmentation [8] and
wordsense disambiguation [1], however the list is extensive.
Here we consider the task of improving the vanilla topic
model, but in the back of our mind is the requirement that
the techniques need to easily be transferred to the host of
variants that make up a significant part of use in applica
tions.
While the standard model is Latent Dirichlet Allocation
(LDA), and many techniques exist to scale this up signifi
cantly, better quality topic models are available. Two clear
and related innovations are available here. The first is the
use of more sophisticated priors for the probability vectors
rather than the simple symmetric Dirichlet [25], and the sec
ond is the use of nonparametric methods, best characterised
by the HDPLDA model [24], again to improve the priors,
the standard being the hierarchical Dirichlet process. These
have the goals of better estimating prior topic or word pro
portions, and also estimating the “right” number of topics.
Note these are related in that a technique to estimate the
proportions for different topics can also make some topics
insignificant, thus effectively changing the number of topics.
Another innovation, not so well known, is to model bursti
ness [5], which is the idea that once we see a word, we can
expect to see it again. Consider the following news snippet:
Despite their separation, Charles and Diana stayed
close to their boys William and Harry. Here, they
accompany the boys for 13yearold William’s
first day school at Eton College on Sept. 6, 1995,
with housemaster Dr. Andrew Gayley looking on.
We see here that two words, “boys” and “William” appear
twice. In the information retrieval literature, a related phe
nomonen is the notion of eliteness [12] whereby words are
said to have different “levels” of occurrence, and this influ
enced the development of the dominant relevance paradigm
in information retrieval [18].
These innovations, nonparametric priors and burstiness,
have so far been bogged down by computationally intensive
techniques that prevent their wider use. The original imple
mentation of DCMLDA for burstiness [5] was only able to
be applied to small data sets of less than a 1000 documents.
A theme of research in recent machine learning conferences
have been a variety of alternative algorithms for inference
with HDPLDA [22, 27, 2]. One technique of note is the use
of stochastic variational methods that allows application to
streaming data [27, 13]. An excellent theoretical and em
Page 2
pirical comparison of a variety of sampling and variational
methods can be found in [20].
These newer algorithms, however, are usually based on the
standard stickbreaking formulation for Dirichlet processes
and variational methods for these and the simpler Dirichlet
distribution. Recently, an alternative sampling scheme for
the more general PitmanYor process has been developed
[4] called table indicator sampling. It uses prebuilt tables of
second order generalised Stirling numbers, and the scheme
has seen use on problems such as document segmentation [9]
and topic models on structured text [7]. Its key advantages
are that it requires no dynamic memory for implementation,
and that the convergence is usually significantly faster and
better quality (it is a collapsed Gibbs sampler) [6].
This table indicator sampling allows easy development of
nonparametric topic models including HDPLDA, its ex
tension to the PitmanYor process (PYP) and powerlaw
models [21] which place the PYP on the topicword ma
trix. What is interesting, however, which is our first major
contribution, is that we can easily extend this broader vari
ety of nonparametric topic models by adding a burstiness
component on the front end of the model. Moreover, an
implementation trick lets this be done with little additional
memory/time overhead. These models are available in our
recently MLOSSreleased open source multicore software
hca1.
The resulting nonparametric LDA algorithms turn out
to be fast. Generally, they are a factor or two (in memory
and time) slower than standard Gibbs implementations of
vanilla LDA and typically 5 times faster than comparable
variational HDPLDA implementations but with the same
memory requirements. Although, we also show that Mal
let’s [16] asymmetricsymmetric LDA is a form of truncated
HDPLDA and it is an order of magnitude faster again. The
experimental results show improvement over all the existing
methods in perplexity, including Mallet. We also conduct a
number of experiments to explore the nature of the new al
gorithms. The full experiments with burstiness are the first
done at moderate scale with these sort of models and the ex
perimental insights are our second major contribution. Our
implementation runs bursty HDPLDA with K=1000 topics
on 800k news articles at 10 minutes per major Gibbs cycle
using a standard 8core CPU.
We first discuss, at a general level, how we use PitmanYor
processes and the nature of the inference with them, in Sec
tion 2. Section 3 presents the different models used here.
Because our inference schemes are standard block Gibbs
samplers in the table indicator framework, we do not detail
the algorithms here other than describing how we imple
ment burstiness. Section 4 then presents our experimental
setup, and a sequence of empirical investigations follow in
Section 5.
2. THE HIERARCHICAL PITMANYOR
PROCESS
Here we briefly review the methods used for inference on
the hierarchical PitmanYor processes that one can see em
bedded in the model we use [3]. Those not needing to under
stand the fundamentals of the sampling methods can skip
this section.
1See http://mloss.org/software/view/527/
All our samplers use standard block table indicator Gibbs
samplers for the network of PitmanYor processes [4, 9] and
adaptive rejection sampling [10] for the many hyperparam
eters. Slice sampling is usually of similar performance but
can suffer with the extremely peaked posteriors for the con
centration parameter of a PitmanYor process.
We use the PitmanYor process as a distribution on a
probability vector. The distribution has a mean (i.e., an
other probability vector), a variance parameter represented
as a concentration (usually given as bX when on vector?X),
and a third parameter called discount (usually given as aX
when on vector?X), so one has ? p ∼ PYP
PitmanYor process, when used in this way, can be used hi
erarchically to form distributions on a network of probability
vectors.
The inference on a network of probability vectors is based
on a basic property of species sampling schemes [14] that
is best understood using the framework of message passing
over networks. Figure 1a shows the context of a probability
vector ? p having a PitmanYor process with base distribution
?θ. Two multinomial style likelihood messages are passed up
to ? p with counts ? n and ? m.
passed on from ? p to?θ using the multinomialDirichlet dis
tribution [3] is a complex set of gamma functions obtained
using the normalising term for a Dirichlet distribution, rep
resented in the figure as lDir
p
(?θ) (assuming a concentration
at the node of bp)
?
ap,bp,?θ
?
. The
The standard message then
lDir
p
(?θ) =
Γ(bp)
Γ(bp+ N + M)
?
k
Γ(bpθk+ nk+ mk)
Γ(bpθk)
,(1)
where the total statistics are N =?
work inference.
Figure 1b shows the alternative after marginalising out the
vector ? p using PitmanYor process theory [3]. One however
must introduce a new latent count vector?t that represents
the fraction of the data− − − − →
n + m that passes up in a message to
?θ. These are called table multiplicities and they correspond
to the number of tables in the corresponding Chinese restau
rant process (CRP) [24]. The multiplicites have a bounding
constraint tk ≤ nk+ mk and moreover tk ≡ 0 if and only if
nk+ mk ≡ 0. Thus at the expense of introducing a latent
count vector (?t) one gets a simple multinomial likelihood
passed up the hierarchy, albiet with a complex looking but
O(1) normalising constant, in the form
knk and M =?
kmk.
This functional complexity on?θ prevents any further net
lPY P
p
(?θ) =
(bpap)TΓ(bp)
Γ(bp+ N + M)
?
k
Snk+mk
tk,ap
θtk
k,(2)
where the total T =?
generalized secondorder Stirling number [23]. Libraries2ex
ist for efficiently dealing with this making it in most cases
an O(1) computation.
The counts of?t then contribute to the data at the par
ent node?θ and thus its posterior probability. Thus network
inference is feasible, and moreover no dynamic memory is
required, unlike CRP methods, because?t is the same di
mension as− − − − →
n + m. Sampling the?t directly, however, leads
ktkand (xy)T denotes the Pochham
mer symbol, (xy)T = x(x + y)...(x + (T − 1)y). Sc
t,ais a
2See
https://mloss.org/software/view/528
https://mloss.org/software/view/424/
and
Page 3
? p
?θ
lDir
p
(?θ)
?
kpnk
k
?
kpmk
k
(a) Embedded probability vector with messages.
− − − − →
n + m
?t
?θ
?
kθtk
k
?
kpnk
k
?
kpmk
k
(b) Embedded count vectors after marginalising.
Figure 1: Computation with species sampling models.
to a poor algorithm because they may have a large range
(above, 0 ≤ tk ≤ nk+ mk) and the impact of data at the
node ? p on the node?θ is buffered by?t, leading to poor mixing.
Table indicators are introduced by [4] to allow the sam
pling of the table multiplicity vectors?t to be done incremen
tally and thus allow more rapid mixing and simpler sam
pling. Note the table indicators are Boolean values indicat
ing if the current data item increments the table multiplicity
at its node and thus the data item contributes to the mes
sage to the parent. The assignment of indicators to data can
be done because of the above constraint tk ≤ nk+ mk. So
the data item contributes a +1 to nk+mk and the matched
table indicator contributes either 0 or +1 to tk, which is the
change in the message to the parent. If there is a grandpar
ent node, then a corresponding table indicator in the parent
node might also propagate a +1 up to the grandparent.
For inference on a network of such vectors, each probabil
ity vector node contributes a factor to the posterior proba
bility. For the above example with table indicators this is
given by Formula (3) [3]
(bpap)TΓ(bp)
Γ(bp+ N + M)
?
k
Snk+mk
tk,ap
?
nk+ mk
tk
?−1
,(3)
where the addition of the?nk+mk
the tk boolean table indicators to be on out of a possible
nk+ mk.
In sampling, a data point coming from the node source
for ? n contributes a +1 to nk (for some k), and either con
tributes a +1 or a 0 to tkdepending on the value of the table
indicator. If nk = tk = 0 initially, then it must contribute
a +1 to tk, so there is no choice. The change in posterior
probability of Formula (3) due to the new data point at this
node is, given the Boolean indicator rl
?
?
?
depending on the value of the table indicator rlfor the data
point. In a network, one has to jointly sample table indi
cators for all reachable ancestor nodes in the network, and
standard discrete graphical model inference is done in closed
form. Examples are given by [7, 8].
For estimation, one requires the expected probabilities
IE? n,? m,? t,?θ[? p] at a node. In the table indicator framework this
is harder to compute. Fortunately, with a trivial change of
latent variables (drop the table indicators and reintroduce
tk
?term over Equation (2)
simply divides by the number of choices there are for picking
(tk+ 1)(bp+ T ∗ ap)Snk+mk+1
(nk+ mk− tk+ 1)Snk+mk+1
(nk+ mk+ 1)(bp+ N + M)Snk+mk
tk+1,ap
?rl≡1
?rl≡0
?−1
tk,ap
tk,ap
(4)
the table occupancies for the CRP) we get the usual estima
tion formula for the CRP [23] given by Equation (5)
IE? n,? m,? t,?θ[? p] =
bp+ T ∗ ap
bp+ N + M
?θ +? n + ? m − ap?t
bp+ N + M.
(5)
Since this does not involve knowing the table occupancies
for the CRP, no additional sampling is needed to compute
the formula, just the existing counts (i.e., ? n, ? m,?t) are used.
Moreover, we know the estimates normalise correctly.
3.
3.1
MODELS
Basic Models
The basic nonparametric topic model we consider is given
in Figure 2. Here, the documenttopic proportions?θi (for i
?θ
? α
z
x
?φ
I
L
aθ,bθ
aα,bα
aφ,bφ
?β
aβ,bβ
Figure 2: Nonparametric topic model.
running over documents) have a PYP with mean ? α and the
topicword proportions?φk (for k running over topics) have
a PYP with mean?β. The mean vectors ? α and?β correspond
to the asymmetric priors of [25].
While we show ? α and?β having a GEM prior [15, 24] in
the figure, allowing different priors covers a range of LDA
styles, as shown in Table 1.
nite and the discount for the PYP on?θ, aθ, is zero, then
?θ ∼ Dirichlet(bθ? α). Thus the two PYPs in the figure can be
configured to be Dirichlets, giving the standard LDA setup
for?θ and likewise for?φk. The GEM is equivalent to the
stickbreaking prior that is at the core of a DP or PYP, so
For instance, when ? α is fi
Page 4
Table 1: Family of LDA Models. The“tr”abbreviates“trun
cated” and “symm” abbreviates “symmetric”.
? α prior
finite ? u
tr. GEM
symm. Dir.
tr. GEM
aα
aθ
?β prior
finite ? u
finite ? u
finite ? u
tr. GEM
aφ
LDA
tr. HDPLDA
tr. HDPLDA
tr. NPLDA
0
0
0

0
0
0
0
0
0
0

using this with ? α and a truncated K, and setting?φkup to be
Dirichlet distributed as just shown, we have truncated HDP
LDA. Notice there are different ways of provided a truncated
prior to ensure a fixed dimensional ? α. The truncated GEM
is used in various versions of truncated HDPLDA [22, 27],
and the simpler truncation, just using a Dirichlet, is implicit
in the asymmetric priors of [25]. That is, the asymmetric
symmetric (AS) variant of LDA [25] is equivalent to a trun
cated HDPLDA. This means that Mallet [16] has imple
mented a truncated HDPLDA (via ASLDA) since 2008,
and it is indeed both one of the fastest and the best per
forming.
Thus we reproduce several alternative variants of LDA
[25], as well as truncated versions of HDPLDA, HPYPLDA
and a fully nonparametric asymmetric version (with the
truncated GEM prior on both ? α and?β) we refer to as NP
LDA. Sampling algorithms for dealing with the HPYPLDA
case are from earlier work [4], and the other cases are similar.
3.2 Bursty Models
The extension with burstiness we consider [5] is given in
Figure 3. Here, each topic?φk is specialised to a variant
?θ
? α
z
x
?φ
I
L
aθ,bθ
aα,bα
aφ,bφ
?β
aβ,bβ
aψ,bψ
?ψ
Figure 3: Model with topic burstiness.
specific to each document i,?ψk,i. Thus
?ψk,i∼ PY
?
aφ,bφ,k,?φk
?
.
On the surface one would think introducing potentially K ∗
W (number of words by number of topics) new parameters
for each document, for the?ψk, seems statistically impracti
cal. In practice, the?ψkare marginalised out during inference
and bookkeeping only requires a small number of additional
latent variables. Note that each topic k has its own concen
tration parameter bφ,k. This feature will be illustrated in
Subsection 5.5.
3.3Inference with Burstiness
In LDA style topic modelling using our approach, we get a
formula for sampling a new topic z for a word w in position l
in a document d. Suppose all the other data and the rest of
the document is D−(d,l)and this is some model M (maybe
NPLDA or LDA, etc.) with hyperparameters. Then de
note this Gibbs sampling formula as p?z w,D−(d,l),M?.
formula [11]. It also forms the first step of the block Gibbs
sampler we use for HDPLDA [4]: first we sample the topic
z, and then we sample the various table indicators give z in
the model.
The burstiness model built on M, denote it MB, is sam
pled using p?z w,D−(d,l),MB?
is a front end to the Gibbs sampler. At the position l in a
document we have a word of type w and wish to resample
its topic z = k. Let nw,k be the number of other existing
words of the type w already in topic k for the current docu
ment, and let sw,k be the corresponding table multiplicities.
They are statistics for the parameters?ψk in the burstiness
model. Note by keeping track of which words in a document
are unique, one knows that nw,k = 0 for those words, thus
computation can be substantially simplified. Let N.,k and
S.k be the corresponding totals for the topic k in the doc
ument (i.e., summed over words). The matrices of counts
nw,k and sw,k and vectors N.,k and S.k can be recomputed
as each document in processed in time proportional to the
length of the document.
The Gibbs sampling probability for choosing z = k at
position l for the burstiness model is obtained using Equa
tion (4).
p?z = k w,D−(d,l),MB?
p?z w,D−(d,l),M?bψ,k+ aψS.k
1
bψ,k+ N.,k
nw,k+ 1
For LDA, this is just the standard collapsed Gibbs sampling
which is computed using
p?z w,D−(d,l),M?. Thus we say the burstiness model MB
∝
(6)
bψ,k+ N.,k
sw,k+ 1
nw,k+ 1
S
nw,k+1
sw,k+1,aψ
S
sw,k,aψ
nw,k
+
nw,k− sw,k+ 1
S
nw,k+1
sw,k,aψ
nw,k
sw,k,aψ
S
.
This has a special case when sw,k= nw,k= 0 of
p?z w,D−(d,l),M?bψ,k+ aψS.k
Once topic z = k is sampled, the second term of Equation (6)
is proportional to the probability that the table indicator for
word w in the?ψk PYP is zero, it does not contribute data
to the parent node?φk, i.e., the original model M will ignore
this data point. The first term of Equation (6) is propor
tional to the probability that the table indicator is one, so
it does contribute data to the parent node?φk, i.e., back
to the original model M. This table indicator is sampled
according to the two terms and the nw,k,sw,k,N.,k,S.k are
all updated. If the table indicator is one then the original
model M processes the data point in the manner it usually
would.
Thus Equation (6) is used to filter words, so we refer to it
as the burstiness frontend. Only words with table indicators
bψ,k+ N.,k
.(7)
Page 5
of one are allowed to pass through to the regular model M
and contribute to its statistics for?φk and, for instance, any
further PYP vectors in the model.
4.EXPERIMENTAL SETUP
4.1Implementation
The publically available hca suite used in these exper
iments is coded in C using 16 and 32 bit integers where
needed for saving space. All data preparation is done using
the DCABags3package, a set of scripts, and input data can
be handled in a number of formats including the LdaC for
mat. All algorithms are run on a desktop with an Intel(R)
Core(TM) i7 8core CPU (3.4Ghz) using a single core.
The algorithms have no dynamic memory, so we set the
maximum number of topics K ahead of time. This is like
the truncation level in variational implementations of HDP
LDA. Moreover, initialisation is done by setting the number
of topics to this maximum and randomly assigning words to
topics. Other authors [22] report initialising to the maxi
mum number of topics, rather than 1, leads to substantially
better results, an experimental finding with which we agree.
Note, inference and learning for burstiness requires the
word by topic counts nw,k and word by topic multiplicities
sw,kbe maintained for each document, as well as their totals.
There is an implementation trick used to achieve space effi
ciency here. First one computes, for each document, which
words appear more than once in the document (i.e., those
for which nw,k can become greater than 1). These words
require special handling, the full Equation (6), and lists of
these are stored in preset variable length arrays. Words that
occur only once in a document are easy to deal with since
their sampling is governed by Equation (7) and no sampling
of the table indicator is needed. Second, the count and mul
tiplicity statistics (the nw,k and sw,k which are statistics
for?ψi,k) are not stored but recomputed as each document is
about to be processed. Moreover, this only needs to be done
for words appearing more than once in the document (hence
why lists of these are prestored). All one needs to recompute
these statistics is the Boolean table indicators and the topic
assignments. The statistics nw,k,sw,k can be recomputed in
time proportional to the length of the document.
4.2Data
We have used several datasets for our experiments, the
PN, MLT, RML, TNG, NIPS and LAT datasets. Not all
data sets were used in all comparisons.
The PN dataset is taken from 805K News articles (Reuters
RCV1) using the query “person”, excluding stop words and
words appearing <5 times. The MLT dataset is abstracts
from the JMLR volumes 111, the ICML years 20072011,
and IEEE Trans.of PAMI 20062011. Stop words were dis
carded along with words appearing <5 or >2900 times. The
RML dataset is the Reuters21578 collection, made up us
ing standard ModLewis split. The TNG dataset is the 20
newsgroup dataset using the standard split. For both stop
words were discarded along with words appearing <5 times.
The LAT dataset is the LA Times articles from TREC disk
4. Stop words were discarded along with words appearing
<10 times. Only words made up entirely of alphabetic char
3http://mloss.org/software/view/522/
Table 2: Characteristic Sizes of Datasets
PNMLT
4662
2691
306
224k
RML
16994
19813
6188
1.27M
TNG
35287
18846
7532
1.87M
LAT
78953
131896
NIPS
13649
1740
348
23.0M
W
D
T
N
26037
8616
1000
1.76M
0
34.5M
acters or dashes were allowed. Roweis’ NIPS dataset4was
left as is.
Characteristics of these six datasets are given in Table 2,
where dictionary size is W, number of documents (including
test) is D, number of test documents is T and total number
of words in the collection is N.
4.3 Evaluation
The algorithms are evaluated on two different measures,
test sample perplexity and pointwise mutual information
(PMI). Perplexity is calculated over test data and is done
using document completion [26], known to be unbiased and
easy to implement for a broad class of models. The doc
ument completion estimate is averaged over 40 cycles per
document done at the end of the training run and uses a 80
20% split, so every fifth word is used to evaluate perplexity
and the remaining to estimate latent variables for the docu
ment. Topic comprehensibility can be measured in terms of
PMI [17]. It is done by measuring average word association
between all pairs of words in the top10 topic words (using
the English Wikipedia articles). Here the PMI reported is
average across all topics. PMI files are prepared with the
DCABags package using linkCoco and projected onto the
specific datasets using cooc2pmi.pl in the hca suite.
We also compare results with two other systems, online
hdp [27] is a stochastic variational algorithm for HDPLDA
coded in Python from C. Wang5, and HDP a Matlab+C com
bination doing Gibbs sampling from Y.W. Teh. To do the
comparisons, at various timepoints we take a snapshot of the
? α vector and the?φk vectors. This is already supported in
onlinehdp, and C. Chen provided the support for this task
with HDP. We then load these values along with the hyperpa
rameter settings into hca and use its document completion
and PMI evaluation options“V p hdoc,5.” In this way, all
algorithms are compared using identical software.
5. EXPERIMENTS
5.1 Runtime Comparisons
To see how the algorithms work at scale, we consider the
cycle times and memory requirements of the different ver
sions running on the full LAT data set. These are given in
Table 3. Cycle times in minutes are for a full pass through all
documents and memory requirements are given in megabytes.
LDA, HDPLDA (where aθ = 0) and NPLDA are as de
scribed in Section 3. The right half of the table gives per
formance for the burstiness model of Figure 3. Note only
a portion of the computation is linear in K so, for instance
NPLDA with burstiness using K = 2000 topics on the same
dataset takes roughly 90 minutes a cycle and 2.43GB mem
ory. Moreover, given it is coded in Python using inefficient
4http://www.cs.nyu.edu/∼roweis/data.html
5Some C++ versions also exist.
Page 6
Table 3: Cycle times and memory requirements on the LA
Times TREC 4 data using K = 500 topics. “Burst” is the
burstiness version.
w/out Burst
mins.
11
20
35
236
with Burst
mins.
20
30
45
Alg.
LDA
HDPLDA
NPLDA
onlinehdp
Mb
630
760
840
1800
Mb
690
850
930
allocation, onlinehdp has comparable memory requirements
to hca. In subsequent experiments, we also saw HDP required
57 times more memory than hca.
Experiments show that the convergence rates (in cycle
counts not time) are similar for the various Gibbs algorithms
(LDA, Burst LDA, Burst NPLDA, etc.). Gibbs for full non
parametric LDA with the burstiness front end gives substan
tial improvements over vanilla Gibbs LDA while requiring
only 50% more memory and 3 times greater computation
time. Note that the table indicator samplers have previ
ously been reported to give 12% improvement in perplexity
over Chinese restaurant samplers [4], which in turn retain a
substantial improvement over earlier variational algorithms
for HDPLDA [22].
We find that sampling hyperparameters (for instance, dis
counts and concentrations of PitmanYor processes) to be
important for performance. A substantial part of the time
for topic burstiness is hyperparameter sampling, something
that is usually less than 5% for the other models. This is
because the model has a different concentration parameter
for every topic, thus much more of the inefficient adaptive
rejection sampling is done for the bursty models versus oth
ers.
5.2General Results
A subset of the results are presented in Table 4 and some
informative plots given in Figure 4 and Figure 5. These
represent the average values computed over 4 independent
runs. Note that the differences between hca’s “Burst HDP”
and “Burst NPLDA” in the table are not significant at the
5% level, but are only mildy significant.
LDA reaches an earlier minimum for perplexity and then
it usually increases, though PMI does increase as well. Mod
els like HDPLDA and NPLDA usually keep on improving
in PMI as the number of topics increases and holdout per
plexity often waivers about, gradually increasing after a later
minimum. For instance, for the small MLT data set they
reach a minimum perplexity at about K=20. All the while,
PMI keeps improving. For data sets like Reuters21578 a
much larger number of topics can be supported, for instance
K > 500 easily. The eventual increase in perplexity for
larger K seems counterintuitive given the nonparametric
slogan of“estimating the right dimension of the model from
the data”. However, remember, we have initialization arti
facts to deal with. Initializing with substantially too many
topics leads to fragmentation/duplication of the topics not
subsequently handled by simple Gibbs sampling. To deal
with this sort of affect, we need something like splitmerge
operators in the sampler [2].
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
20 40 60 80 100
Test Perplexity
LDA
Burst LDA
NPLDA
Burst NPLDA
1
1.5
2
2.5
3
3.5
4
20 40 60 80 100
PMI of Topics
No of Topics
Figure 4: Perplexity and PMI on the RML data for LDA,
Burst LDA and NPLDA, Burst NPLDA.
900
1000
1100
1200
1300
1400
1500
10 20 30 40 50 60 70 80 90 100
Test Perplexity
No of Topics
LDA
HDPLDA
NPLDA
Burst LDA
Burst HDPLDA
Burst NPLDA
Figure 5: Perplexity as the number of topics (K) changes
for different algorithms on the MLT data.
With the burstiness models, however, the change in per
formance is dramatic. Burstiness almost always improves
PMI, sometimes substantially and the drop in perplexity
is always dramatic. Burstiness makes perplexity peak much
earlier w.r.t. the number of topics, but for the nonparametric
models the subsequent rise in perplexity thereafter is mild.
The nonparametric models cope better with the challenges
of sampling behind the bursty frontend. However the best
perplexity is reached for low number of topics on MLT where
all the different models (LDA, HDPLDA, NPLDA) have
similar perplexity. For the larger RML data set, LDA’s per
plexity again peaks earlier but NPLDA’s keeps improving
for the number of topics considered.
5.3Performance Comparisons
This section compares hca performance with previous al
gorithms.
5.3.1
In order to compare the different systems, hca versus on
linehdp and HDP we use the RML and TNG data. We used
a fixed set of hyperparameters with no sampling so all dis
count parameters are set to 0 and the relevant concentration
parameters set to 1 (bα,bθ) and a symmetric β = (0.01)?1 is
used. For onlinehdp we did a large number of runs vary
Comparison with onlinehdp and HDP
Page 7
Table 4: Document completion perplexity and PMI for hca variants. Data is presented as “Perplexity/PMI”. “HDP” is short
for “HDPLDA”.
Data (K)
MLT(10)
MLT(50)
RML(50)
RML(110)
PN(160)
PN(240)
LDABurst LDA
915.46/2.47
1008.68/3.26
915.65/2.61
965.56/2.99
2988.69/4.18
3081.19/4.45
HDPBurst HDP
904.29/2.59
940.69/3.63
891.07/2.73
889.42/3.31
2689.19/4.62
2720.98/4.76
NPLDA
1480.20/2.38
1375/3.47
1431.29/2.10
1297.08/2.96
3785.05/4.39
3676.35/4.72
Burst NPLDA
907.74/2.70
932.88/3.93
882.06/2.89
880.22/3.32
2657.78/4.70
2734.66/4.78
1493.62/2.33
1504.63/2.94
1472.87/2.07
1441.55/2.43
4232.08/3.69
4306.63/4.07
1480.85/2.61
1389.29/3.70
1427.28/2.25
1308.83/3.05
3801.42/4.50
3726.05/4.75
ing τ = 1,4,16,64.
batchsize = 250,1000. Note τ = 64,κ = 0.8 are recommend
in [27]. Only the fastest and best converging result is given
for onlinehdp. We did one run of both hca and HDP with
these settings noting that the differences are way outside of
the range of typical statistical variation between individual
runs. Plots of the runs over time are given in Figures 6
and 7. The final PMI scores for the 3 algorithms are given
κ = 0.5,0.8 and K = 150,300 and
1000
1500
2000
2500
3000
3500
4000
0 2000 4000 6000 8000 10000 12000
Perplexity
Seconds elapsed
OnlineHDP
NPLDA
HDP(Teh)
Figure 6: Comparative perplexity for one run on the RML
data.
in Table 5.
Table 5: PMI scores for the comparative runs.
onlinehdp hca
3.47
4.017
HDP
4.452
4.887
RML
TNG
2.607
4.042
Table 6: Effective Number of Topics for the comparative
runs.
onlinehdp
RML 37.0
TNG 7.1
hca
155
92.7
HDP
149
89.6
The improvement in perplexity of hca over HDP is not that
surprising because comparative experiments on even simple
2000
4000
6000
8000
10000
12000
14000
0 2000 4000 6000 8000 10000 12000
Perplexity
Seconds elapsed
OnlineHDP
NPLDA
HDP(Teh)
Figure 7: Comparative perplexity for one run on the TNG
data.
models show the significant improvement of table indicator
methods over CRP methods [6], and Sato et al. [20] also re
port substantial differences between different formulations
for variational HDPLDA. However, the poor performance
of onlinehdp needs some explanation. On looking at the
topics discovered by onlinehdp, we see there are many du
plicates. Moreover, the topic proportions given by the ? α vec
tor show extreme bias towards earlier topics. It is known, for
instance, that variational methods applied to the Dirichlet
make the probability estimates more extreme. In this model
one is working with a tree of Betas, so it seems the effect is
confounded. A useful diagnostic here is the“Effective Num
ber of Topics” which is given by exponentiating the entropy
of the estimated ? α vector, shown in Table 6. One can see hca
and HDP are similar here but onlinehdp has a dramatically
reduced number of topics. The nonduplicated topics in the
onlinehdp result, however, look good in terms of compre
hensability, so the online stochastic variational method is
clearly a good way to get a smaller number of topics from a
very large data set.
5.3.2
Mallet suppports asymmetricsymmetric LDA, which is a
form of truncated HDPLDA using finite symmetric Dirich
let to truncate a GEM. We compare the implementation
of HDPLDA in Mallet and hca. Results are reported for
Comparison with Mallet
Page 8
Table 7: Comparative Results for Mallet.
hca
Dataset(K)
RML(300)
TNG(300)
MLT(50)
PN(240)
Mallet
1404 ± 8
4081 ± 27
1357 ± 14
3844 ± 24
(HDPLDA)
1280 ± 2
3999 ± 10
1389 ± na
3726 ± na
(NPLDA)
1145 ± 2
3586 ± 8
1375 ± na
3676 ± na
Table 8: Comparative Results for PCVB0
hca
KPCVB0
1285 ± 10
1275 ± 10
(HDPLDA)
1267 ± 5
1223 ± 5
(NPLDA)
1193 ± 5
1151 ± 5
200
300
“RML”and“TNG”datasets with 300 topics as per previous,
and also some from Table 4. As suggested in [16] we run
Mallet for 2000 iterations, and optimise the hyperparam
eters every 10 major Gibbs cycles after an initial burnin
period of 50 cycles, to get the best results. Table 7 presents
the comparative results. We can see that hca generally pro
duces better results. Note that results produced by the full
asymmetric version NPLDA are even better, an option not
implemented in Mallet.
5.3.3
We also sought to compare hca with the variants of PCVB0
reported in [20]. These are a family of simplified variational
algorithms, though the different variants seem to perform
similarly. Without details of the document preprocessing,
it was difficult to reproduce comparable datasets. Thus only
results for their “KOS blog corpus,” available preprocessed
from the UCI Machine Learning Repository, where used in
producing the comparisons presented in Table 8. We note
the smaller difference here in perplexity is such that better
hyperparameter estimation with PCVB0 could well make
the algorithms more equal. Interestly, Sato et al. report lit
tle difference between the symmetric or asymmetric priors
on the Dirichlet on?φk. In contrast, our corresponding asym
metric version NPLDA shows significant improvements.
Comparison with PCVB0
5.3.4
A splitmerge variant of HDPLDA has been developed
[2] that was compared with online and batch variational al
gorithms. For the NIPS data they have made runs with
K = 300 and they estimate all hyperparameters. They use
a 8020% split for document completion and we replicated
the experiment with the same dataset, parameter settings
and sampling. The results are show in Figure 8 and should
be compared with [2, Figure 2(b)]. Their results show plots
for 40 hours whereas we ran for 4.5 hours, so our algorithm is
approximately 4 times faster per document. Our Gibbs im
plementation of HDPLDA substantially beats all other non
splitmerge algorithms. Not surprisingly, the sophisticated
splitmerge sampler eventually reaches the performance of
ours. Note the NPLDA model is superior to HDPLDA on
this data, and the bursty versions are clearly superior to all
others.
Comparison on NIPS 19882000 Dataset
Figure 8: Convergence on Roweis’ NIPS data for K = 300.
5.4Effect of Hyperparameters on the Num
ber of Topics
Standard reporting of experiments using HDPLDA usu
ally sets the β parameter which governs the symmetric prior
for the?φk. For instance, some authors [13] call this η and
it is set to 0.01. Here we explore what happens when we
vary this parameter for the RML data. Note we have done
this experiment on most of the data sets and the results are
comparable. We train HDPLDA for 1000 Gibbs cycles and
1050
1100
1150
1200
1250
1300
1350
0500 100015002000 2500
Test Perplexity
Topics
2
2.5
3
3.5
4
4.5
0500 1000150020002500
PMI of Topics
Topics
beta=0.001
beta=0.01
beta=0.1
beta=0.5
beta=sampled
beta=0.001
beta=0.01
beta=0.1
beta=0.5
beta=sampled
Figure 9: Perplexity and PMI for the RML data when vary
ing β in the symmetric prior for HDPLDA.
then record the evaluation measures. This takes 60 min
utes on the desktop for each value of β. We also do a run
where β is sampled. For each of the curves, the stopping
point on the right gives the number of full topics used by
the algorithms (ignoring trivially populated topics with 12
words). So the lowest perplexity is achieved by HDPLDA
with β = 0.001 where roughly K = 2,400 topics are used.
Sampling β roughly tracks the lowest achieved for each num
ber of topics.
The PMI results also indicate that for larger β one obtains
more comprehensible topics, though less of them. Thus there
is a tradeoff: if you want less but more comprehensible
topics, for instance a coarser summary of the content, then
make β larger. If you want a better fit to the data, or more
finely grained topics, then estimate β properly.
Page 9
Table 9: Low proportion topics (proportion below 0.001)
with lower variance factor for LAT data when K = 500.
PMI
0.31
topic words
Zsa gabor capos slapping avert anhalt enright rolls
Royce copslapping Hensley judgeship Leona
herald tribune examiner dailies gannet batten numeric
presstelegram petersburg sentinel
Baker PT evangelist bakers Tammy Faye swagged evan
gelists televangelists defrocked
2.32
4.02
Thus we can see that the number of topics found by HDP
LDA is significantly affected by the hyper parameter β, and
thus it is probably inadvisable to fix it without careful ex
perimentation, consideration or sampling.
number of topics on RML, with roughly 20,000 documents
is up to 2,000. Inspection shows a good number of these are
comprehensible. With larger collections we claim it would
be impractical to attempt to“estimate”the right number of
topics. For larger collections, one could be estimating tens
of thousands of topics. Is this large number of topics even
useful?
5.5Topic Specific Concentrations
For the topic burstiness model of Figure 3 we had topic
specific concentrations to the PYP, bφ,k. Now the concen
tration and discount together control the variance. So for
document i and topic k, the variance of a word probability
ψi,k,w from its mean φk,w will be
the ratio the variance factor. If it is close to one then the
word proportions?ψi,kfor the topic have little relationship to
their mean?φk. If close to zero they are similar. Figure 10
considers 500 topics from a model built on the LAT data
with K = 500 using PYPLDA and topic burstiness. About
Moreover, the
?
1−aφ
1+bφ,k
?
φk,w [3]. We call
0.0001
0.001
0.01
0.1
1
0 0.20.40.60.81 1.21.41.61.82
variance factor
Topic proportion
Figure 10: Topic proportions versus the variance factor for
LAT data when K = 500.
15% of the topics have low values for concentration that
make the topics effectively random, and thus not properly
used. Examples of topics with low proportions but variance
factor below 0.4, so the topics are still use able, are given
in Table 9. The first topic is actually about two issues: the
first is the Zsa Zsa Gabor slapping incident, and the second
is about Orange County Dist. Atty.s Avert and Enright.
6.CONCLUSION
We have shown that an implementation of the HDPLDA
nonparametric topic model and related nonparametric ex
tension NPLDA using block table indicator Gibbs sampling
methods [4] are significantly superior to recent variational
methods in terms of perplexity of results. The NPLDA is
also significantly superior in perplexity to the Mallet imple
mentation of truncated HDPLDA (masquerading as asym
metric symmetric LDA). Taking account of the different im
plementation languages, the newer Gibbs samplers and vari
ational methods also have the same memory footprint. Mal
let is substantially faster, however, and performs well for
HDPLDA.
We note that these have two goals, (A) better estimat
ing prior topic or word proportions, and (B) estimating the
“right”number of topics. The nonparametric methods seem
superior at the first goal (A) over the parametric equivalents.
Given that the estimated number of topics grows substan
tially with the collection sizes, it is not clear how important
goal (B) can be. Arguably, goal (A) is the more important
one.
Moreover, we have developed a Gibbs theory of burstiness
that:
• Is implemented as a frontend so can in principle read
ily be applied to most variants of a topic model that
use a Gibbs sampler.
• It is a factor of 1.52 slower per major Gibbs cycle.
This will allow the wide variety of topicmodel variants to
easily take advantage of the burstiness model.
Through the experiments, we have illustrated some char
acterizations of the models, for instance:
• Our asymmetricasymmetric NPLDA model is about
75% slower than HDPLDA but generally performs
better than HDPLDA, a different result to published
results [25, 20] due to the different algorithms.
• The topic comprehensibility (as measured using PMI)
is substantially improved by the burstiness version, as
reported in the original work [5].
• The topic concentration parameter in the burstiness
model goes very low when the topic is insignificant.
We can use this to estimate which topics have become
inactive in the model.
• The concentration parameter for the topicword vec
tors significantly affects results, so care should be taken
in experiments using these models.
7. ACKNOWLEDGEMENTS
Both authors were funded partly by NICTA. NICTA is
funded by the Australian Government through the Depart
ment of Communications and the Australian Research Coun
cil through the ICT Centre of Excellence Program. Thanks
to Changyou Chen and Kar Wai Lim for their feedback and
Changyou for running the HDP experiments.
8. REFERENCES
[1] J. BoydGraber, D. Blei, and X. Zhu. A topic model
for word sense disambiguation. In EMNLPCoNLL,
pages 1024–1033, 2007.
[2] M. Bryant and E. Sudderth. Truly nonparametric
online variational inference for hierarchical Dirichlet
processes. In P. Bartlett, F. Pereira, C. Burges,
Page 10
L. Bottou, and K. Weinberger, editors, Advances in
Neural Information Processing Systems 25, pages
2708–2716. 2012.
[3] W. Buntine and M. Hutter. A Bayesian view of the
PoissonDirichlet process. Technical Report
arXiv:1007.0296 [math.ST], arXiv, Feb. 2012.
[4] C. Chen, L. Du, and W. Buntine. Sampling table
configurations for the hierarchical PoissonDirichlet
process. In Machine Learning and Knowledge
Discovery in Databases: European Conference, ECML
PKDD, pages 296–311. Springer, 2011.
[5] G. Doyle and C. Elkan. Accounting for burstiness in
topic models. In Proc. of the 26th Annual Int. Conf.
on Machine Learning, ICML ’09, pages 281–288, 2009.
[6] L. Du. Nonparametric Bayesian Methods for
Structured Topic Models A Mixture Distribution
Approach. PhD thesis, School of Computer Science,
the Australian National University, Canberra,
Australia, 2011.
[7] L. Du, W. Buntine, and H. Jin. Modelling sequential
text with an adaptive topic model. In Proc. of the
2012 Joint Conf. on EMNLP and CoNLL, pages
535–545. ACM, 2012.
[8] L. Du, W. Buntine, and M. Johnson. Topic
segmentation with a structured topic model. In
HLTNAACL, pages 190–200. The Association for
Computational Linguistics, 2013.
[9] L. Du, W. Buntine, and M. Johnson. Topic
segmentation with a structured topic model. In
Proceedings of NAACLHLT, pages 190–200, 2013.
[10] W. R. Gilks and P. Wild. Adaptive rejection sampling
for Gibbs sampling. Applied Statistics, pages 337–348,
1992.
[11] T. Griffiths and M. Steyvers. Finding scientific topics.
PNAS Colloquium, 2004.
[12] S. Harter. A probabilistic approach to automatic
keyword indexing. Part II. An algorithm for
probabilistic indexing. Jnl. of the American Society
for Information Science, 26(5):280–289, 1975.
[13] M. Hoffman, D. Blei, C. Wang, and J. Paisley.
Stochastic variational inference. Journal of Machine
Learning Research, 14:1303–1347, 2013.
[14] H. Ishawaran and L. James. Generalized weighted
Chinese restaurant processes for species sampling
mixture models. Statistica Sinica, 13:1211–1235, 2003.
[15] H. Ishwaran and L. James. Gibbs sampling methods
for stickbreaking priors. Journal of ASA,
96(453):161–173, 2001.
[16] A. K. McCallum. Mallet: A machine learning for
language toolkit. http://mallet.cs.umass.edu, 2002.
[17] D. Newman, J. Lau, K. Grieser, and T. Baldwin.
Automatic evaluation of topic coherence. In Proc. of
the 2010 Annual Conf. of the NAACL, pages
100ˆ a˘A¸ S–108, 2010.
[18] S. Robertson and H. Zaragoza. The probabilistic
relevance framework: BM25 and beyond. Found.
Trends Inf. Retr., 3(4):333–389, Apr. 2009.
[19] M. RosenZvi, T. Griffiths, M. Steyvers, and P. Smyth.
The authortopic model for authors and documents. In
Proc. of the 20th Annual Conf. on Uncertainty in
Artificial Intelligence (UAI04), pages 487–49, 2004.
[20] I. Sato, K. Kurihara, and H. Nakagawa. Practical
collapsed variational Bayes inference for hierarchical
Dirichlet process. In Proc. of the 18th ACM SIGKDD
international conf. on Knowledge discovery and data
mining, pages 105–113. ACM, 2012.
[21] I. Sato and H. Nakagawa. Topic models with
powerlaw using PitmanYor process. KDD ’10, pages
673–682. ACM, 2010.
[22] Y. Teh, K. Kurihara, and M. Welling. Collapsed
variational inference for HDP. In NIPS ’07. 2007.
[23] Y. W. Teh. A Bayesian interpretation of interpolated
KneserNey. Technical Report TRA2/06, School of
Computing, National University of Singapore, 2006.
[24] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei.
Hierarchical Dirichlet processes. Journal of the ASA,
101(476):1566–1581, 2006.
[25] H. Wallach, D. Mimno, and A. McCallum. Rethinking
LDA: Why priors matter. In Advances in Neural
Information Processing Systems 19, 2009.
[26] H. Wallach, I. Murray, R. Salakhutdinov, and
D. Mimno. Evaluation methods for topic models. In
ICML ’09, pages 672–679. 2009.
[27] C. Wang, J. Paisley, and D. Blei. Online variational
inference for the hierarchical Dirichlet process. In
AISTATS ’11. 2011.