Experiments with Non-parametric Topic Models
Clayton, VIC, Australia
RSISE, The Australian National University
Canberra, ACT, Australia
In topic modelling, various alternative priors have been de-
veloped, for instance asymmetric and symmetric priors for
the document-topic and topic-word matrices respectively,
the hierarchical Dirichlet process prior for the document-
topic matrix and the hierarchical Pitman-Yor process prior
for the topic-word matrix. For information retrieval, lan-
guage models exhibiting word burstiness are important. In-
deed, this burstiness effect has been show to help topic mod-
els as well, and this requires additional word probability
vectors for each document. Here we show how to combine
these ideas to develop high-performing non-parametric topic
models exhibiting burstiness based on standard Gibbs sam-
pling. Experiments are done to explore the behavior of the
models under different conditions and to compare the algo-
rithms with previously published. The full non-parametric
topic models with burstiness are only a small factor slower
than standard Gibbs sampling for LDA and require double
the memory, making them very competitive. We look at the
comparative behaviour of different models and present some
Categories and Subject Descriptors
I.7 [Document and Text Processing]: Miscellaneous;
I.2.6 [Artificial Intelligence]: Learning
topic modelling; experimental results; non-parametric prior;
Topic models are now a recognised genre in the suite of ex-
ploratory software available for text data mining and other
semi-structured tasks. Moreover, it is also recognised that
∗Part of this author’s contribution was done while at
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage, and that copies bear this notice and the full ci-
tation on the first page. Copyrights for third-party components of this work must be
honored. For all other uses, contact the owner/author(s). Copyright is held by the
KDD’14, August 24–27, 2014, New York, NY, USA.
a broad class of variants can be developed based around
extended graphical models that capture some additional do-
main requirements. Examples of these extensions include
author-topic modelling , document segmentation  and
word-sense disambiguation , however the list is extensive.
Here we consider the task of improving the vanilla topic
model, but in the back of our mind is the requirement that
the techniques need to easily be transferred to the host of
variants that make up a significant part of use in applica-
While the standard model is Latent Dirichlet Allocation
(LDA), and many techniques exist to scale this up signifi-
cantly, better quality topic models are available. Two clear
and related innovations are available here. The first is the
use of more sophisticated priors for the probability vectors
rather than the simple symmetric Dirichlet , and the sec-
ond is the use of non-parametric methods, best characterised
by the HDP-LDA model , again to improve the priors,
the standard being the hierarchical Dirichlet process. These
have the goals of better estimating prior topic or word pro-
portions, and also estimating the “right” number of topics.
Note these are related in that a technique to estimate the
proportions for different topics can also make some topics
insignificant, thus effectively changing the number of topics.
Another innovation, not so well known, is to model bursti-
ness , which is the idea that once we see a word, we can
expect to see it again. Consider the following news snippet:
Despite their separation, Charles and Diana stayed
close to their boys William and Harry. Here, they
accompany the boys for 13-year-old William’s
first day school at Eton College on Sept. 6, 1995,
with housemaster Dr. Andrew Gayley looking on.
We see here that two words, “boys” and “William” appear
twice. In the information retrieval literature, a related phe-
nomonen is the notion of eliteness  whereby words are
said to have different “levels” of occurrence, and this influ-
enced the development of the dominant relevance paradigm
in information retrieval .
These innovations, non-parametric priors and burstiness,
have so far been bogged down by computationally intensive
techniques that prevent their wider use. The original imple-
mentation of DCMLDA for burstiness  was only able to
be applied to small data sets of less than a 1000 documents.
A theme of research in recent machine learning conferences
have been a variety of alternative algorithms for inference
with HDP-LDA [22, 27, 2]. One technique of note is the use
of stochastic variational methods that allows application to
streaming data [27, 13]. An excellent theoretical and em-
pirical comparison of a variety of sampling and variational
methods can be found in .
These newer algorithms, however, are usually based on the
standard stick-breaking formulation for Dirichlet processes
and variational methods for these and the simpler Dirichlet
distribution. Recently, an alternative sampling scheme for
the more general Pitman-Yor process has been developed
 called table indicator sampling. It uses prebuilt tables of
second order generalised Stirling numbers, and the scheme
has seen use on problems such as document segmentation 
and topic models on structured text . Its key advantages
are that it requires no dynamic memory for implementation,
and that the convergence is usually significantly faster and
better quality (it is a collapsed Gibbs sampler) .
This table indicator sampling allows easy development of
non-parametric topic models including HDP-LDA, its ex-
tension to the Pitman-Yor process (PYP) and power-law
models  which place the PYP on the topic-word ma-
trix. What is interesting, however, which is our first major
contribution, is that we can easily extend this broader vari-
ety of non-parametric topic models by adding a burstiness
component on the front end of the model. Moreover, an
implementation trick lets this be done with little additional
memory/time overhead. These models are available in our
recently MLOSS-released open source multi-core software
The resulting non-parametric LDA algorithms turn out
to be fast. Generally, they are a factor or two (in memory
and time) slower than standard Gibbs implementations of
vanilla LDA and typically 5 times faster than comparable
variational HDP-LDA implementations but with the same
memory requirements. Although, we also show that Mal-
let’s  asymmetric-symmetric LDA is a form of truncated
HDP-LDA and it is an order of magnitude faster again. The
experimental results show improvement over all the existing
methods in perplexity, including Mallet. We also conduct a
number of experiments to explore the nature of the new al-
gorithms. The full experiments with burstiness are the first
done at moderate scale with these sort of models and the ex-
perimental insights are our second major contribution. Our
implementation runs bursty HDP-LDA with K=1000 topics
on 800k news articles at 10 minutes per major Gibbs cycle
using a standard 8-core CPU.
We first discuss, at a general level, how we use Pitman-Yor
processes and the nature of the inference with them, in Sec-
tion 2. Section 3 presents the different models used here.
Because our inference schemes are standard block Gibbs
samplers in the table indicator framework, we do not detail
the algorithms here other than describing how we imple-
ment burstiness. Section 4 then presents our experimental
setup, and a sequence of empirical investigations follow in
2. THE HIERARCHICAL PITMAN-YOR
Here we briefly review the methods used for inference on
the hierarchical Pitman-Yor processes that one can see em-
bedded in the model we use . Those not needing to under-
stand the fundamentals of the sampling methods can skip
All our samplers use standard block table indicator Gibbs
samplers for the network of Pitman-Yor processes [4, 9] and
adaptive rejection sampling  for the many hyperparam-
eters. Slice sampling is usually of similar performance but
can suffer with the extremely peaked posteriors for the con-
centration parameter of a Pitman-Yor process.
We use the Pitman-Yor process as a distribution on a
probability vector. The distribution has a mean (i.e., an-
other probability vector), a variance parameter represented
as a concentration (usually given as bX when on vector?X),
and a third parameter called discount (usually given as aX
when on vector?X), so one has ? p ∼ PYP
Pitman-Yor process, when used in this way, can be used hi-
erarchically to form distributions on a network of probability
The inference on a network of probability vectors is based
on a basic property of species sampling schemes  that
is best understood using the framework of message passing
over networks. Figure 1a shows the context of a probability
vector ? p having a Pitman-Yor process with base distribution
?θ. Two multinomial style likelihood messages are passed up
to ? p with counts ? n and ? m.
passed on from ? p to?θ using the multinomial-Dirichlet dis-
tribution  is a complex set of gamma functions obtained
using the normalising term for a Dirichlet distribution, rep-
resented in the figure as lDir
(?θ) (assuming a concentration
at the node of bp)
The standard message then
Γ(bp+ N + M)
Γ(bpθk+ nk+ mk)
where the total statistics are N =?
Figure 1b shows the alternative after marginalising out the
vector ? p using Pitman-Yor process theory . One however
must introduce a new latent count vector?t that represents
the fraction of the data− − − − →
n + m that passes up in a message to
?θ. These are called table multiplicities and they correspond
to the number of tables in the corresponding Chinese restau-
rant process (CRP) . The multiplicites have a bounding
constraint tk ≤ nk+ mk and moreover tk ≡ 0 if and only if
nk+ mk ≡ 0. Thus at the expense of introducing a latent
count vector (?t) one gets a simple multinomial likelihood
passed up the hierarchy, albiet with a complex looking but
O(1) normalising constant, in the form
knk and M =?
This functional complexity on?θ prevents any further net-
Γ(bp+ N + M)
where the total T =?
generalized second-order Stirling number . Libraries2ex-
ist for efficiently dealing with this making it in most cases
an O(1) computation.
The counts of?t then contribute to the data at the par-
ent node?θ and thus its posterior probability. Thus network
inference is feasible, and moreover no dynamic memory is
required, unlike CRP methods, because?t is the same di-
mension as− − − − →
n + m. Sampling the?t directly, however, leads
ktkand (x|y)T denotes the Pochham-
mer symbol, (x|y)T = x(x + y)...(x + (T − 1)y). Sc
(a) Embedded probability vector with messages.
− − − − →
n + m
(b) Embedded count vectors after marginalising.
Figure 1: Computation with species sampling models.
to a poor algorithm because they may have a large range
(above, 0 ≤ tk ≤ nk+ mk) and the impact of data at the
node ? p on the node?θ is buffered by?t, leading to poor mixing.
Table indicators are introduced by  to allow the sam-
pling of the table multiplicity vectors?t to be done incremen-
tally and thus allow more rapid mixing and simpler sam-
pling. Note the table indicators are Boolean values indicat-
ing if the current data item increments the table multiplicity
at its node and thus the data item contributes to the mes-
sage to the parent. The assignment of indicators to data can
be done because of the above constraint tk ≤ nk+ mk. So
the data item contributes a +1 to nk+mk and the matched
table indicator contributes either 0 or +1 to tk, which is the
change in the message to the parent. If there is a grandpar-
ent node, then a corresponding table indicator in the parent
node might also propagate a +1 up to the grandparent.
For inference on a network of such vectors, each probabil-
ity vector node contributes a factor to the posterior proba-
bility. For the above example with table indicators this is
given by Formula (3) 
Γ(bp+ N + M)
where the addition of the?nk+mk
the tk boolean table indicators to be on out of a possible
In sampling, a data point coming from the node source
for ? n contributes a +1 to nk (for some k), and either con-
tributes a +1 or a 0 to tkdepending on the value of the table
indicator. If nk = tk = 0 initially, then it must contribute
a +1 to tk, so there is no choice. The change in posterior
probability of Formula (3) due to the new data point at this
node is, given the Boolean indicator rl
depending on the value of the table indicator rlfor the data
point. In a network, one has to jointly sample table indi-
cators for all reachable ancestor nodes in the network, and
standard discrete graphical model inference is done in closed
form. Examples are given by [7, 8].
For estimation, one requires the expected probabilities
IE? n,? m,? t,?θ[? p] at a node. In the table indicator framework this
is harder to compute. Fortunately, with a trivial change of
latent variables (drop the table indicators and reintroduce
?term over Equation (2)
simply divides by the number of choices there are for picking
(tk+ 1)(bp+ T ∗ ap)Snk+mk+1
(nk+ mk− tk+ 1)Snk+mk+1
(nk+ mk+ 1)(bp+ N + M)Snk+mk
the table occupancies for the CRP) we get the usual estima-
tion formula for the CRP  given by Equation (5)
IE? n,? m,? t,?θ[? p] =
bp+ T ∗ ap
bp+ N + M
?θ +? n + ? m − ap?t
bp+ N + M.
Since this does not involve knowing the table occupancies
for the CRP, no additional sampling is needed to compute
the formula, just the existing counts (i.e., ? n, ? m,?t) are used.
Moreover, we know the estimates normalise correctly.
The basic non-parametric topic model we consider is given
in Figure 2. Here, the document-topic proportions?θi (for i
Figure 2: Non-parametric topic model.
running over documents) have a PYP with mean ? α and the
topic-word proportions?φk (for k running over topics) have
a PYP with mean?β. The mean vectors ? α and?β correspond
to the asymmetric priors of .
While we show ? α and?β having a GEM prior [15, 24] in
the figure, allowing different priors covers a range of LDA
styles, as shown in Table 1.
nite and the discount for the PYP on?θ, aθ, is zero, then
?θ ∼ Dirichlet(bθ? α). Thus the two PYPs in the figure can be
configured to be Dirichlets, giving the standard LDA set-up
for?θ and likewise for?φk. The GEM is equivalent to the
stick-breaking prior that is at the core of a DP or PYP, so
For instance, when ? α is fi-
Table 1: Family of LDA Models. The“tr”abbreviates“trun-
cated” and “symm” abbreviates “symmetric”.
? α prior
finite ? u
finite ? u
finite ? u
finite ? u
using this with ? α and a truncated K, and setting?φkup to be
Dirichlet distributed as just shown, we have truncated HDP-
LDA. Notice there are different ways of provided a truncated
prior to ensure a fixed dimensional ? α. The truncated GEM
is used in various versions of truncated HDP-LDA [22, 27],
and the simpler truncation, just using a Dirichlet, is implicit
in the asymmetric priors of . That is, the asymmetric-
symmetric (AS) variant of LDA  is equivalent to a trun-
cated HDP-LDA. This means that Mallet  has imple-
mented a truncated HDP-LDA (via AS-LDA) since 2008,
and it is indeed both one of the fastest and the best per-
Thus we reproduce several alternative variants of LDA
, as well as truncated versions of HDP-LDA, HPYP-LDA
and a fully non-parametric asymmetric version (with the
truncated GEM prior on both ? α and?β) we refer to as NP-
LDA. Sampling algorithms for dealing with the HPYP-LDA
case are from earlier work , and the other cases are similar.
3.2 Bursty Models
The extension with burstiness we consider  is given in
Figure 3. Here, each topic?φk is specialised to a variant
Figure 3: Model with topic burstiness.
specific to each document i,?ψk,i. Thus
On the surface one would think introducing potentially K ∗
W (number of words by number of topics) new parameters
for each document, for the?ψk, seems statistically impracti-
cal. In practice, the?ψkare marginalised out during inference
and book-keeping only requires a small number of additional
latent variables. Note that each topic k has its own concen-
tration parameter bφ,k. This feature will be illustrated in
3.3Inference with Burstiness
In LDA style topic modelling using our approach, we get a
formula for sampling a new topic z for a word w in position l
in a document d. Suppose all the other data and the rest of
the document is D−(d,l)and this is some model M (maybe
NP-LDA or LDA, etc.) with hyperparameters. Then de-
note this Gibbs sampling formula as p?z |w,D−(d,l),M?.
formula . It also forms the first step of the block Gibbs
sampler we use for HDP-LDA : first we sample the topic
z, and then we sample the various table indicators give z in
The burstiness model built on M, denote it M-B, is sam-
pled using p?z |w,D−(d,l),M-B?
is a front end to the Gibbs sampler. At the position l in a
document we have a word of type w and wish to resample
its topic z = k. Let nw,k be the number of other existing
words of the type w already in topic k for the current docu-
ment, and let sw,k be the corresponding table multiplicities.
They are statistics for the parameters?ψk in the burstiness
model. Note by keeping track of which words in a document
are unique, one knows that nw,k = 0 for those words, thus
computation can be substantially simplified. Let N.,k and
S.k be the corresponding totals for the topic k in the doc-
ument (i.e., summed over words). The matrices of counts
nw,k and sw,k and vectors N.,k and S.k can be recomputed
as each document in processed in time proportional to the
length of the document.
The Gibbs sampling probability for choosing z = k at
position l for the burstiness model is obtained using Equa-
p?z = k |w,D−(d,l),M-B?
p?z |w,D−(d,l),M?bψ,k+ aψS.k
For LDA, this is just the standard collapsed Gibbs sampling
which is computed using
p?z |w,D−(d,l),M?. Thus we say the burstiness model M-B
nw,k− sw,k+ 1
This has a special case when sw,k= nw,k= 0 of
p?z |w,D−(d,l),M?bψ,k+ aψS.k
Once topic z = k is sampled, the second term of Equation (6)
is proportional to the probability that the table indicator for
word w in the?ψk PYP is zero, it does not contribute data
to the parent node?φk, i.e., the original model M will ignore
this data point. The first term of Equation (6) is propor-
tional to the probability that the table indicator is one, so
it does contribute data to the parent node?φk, i.e., back
to the original model M. This table indicator is sampled
according to the two terms and the nw,k,sw,k,N.,k,S.k are
all updated. If the table indicator is one then the original
model M processes the data point in the manner it usually
Thus Equation (6) is used to filter words, so we refer to it
as the burstiness front-end. Only words with table indicators
of one are allowed to pass through to the regular model M
and contribute to its statistics for?φk and, for instance, any
further PYP vectors in the model.
The publically available hca suite used in these exper-
iments is coded in C using 16 and 32 bit integers where
needed for saving space. All data preparation is done using
the DCA-Bags3package, a set of scripts, and input data can
be handled in a number of formats including the LdaC for-
mat. All algorithms are run on a desktop with an Intel(R)
Core(TM) i7 8-core CPU (3.4Ghz) using a single core.
The algorithms have no dynamic memory, so we set the
maximum number of topics K ahead of time. This is like
the truncation level in variational implementations of HDP-
LDA. Moreover, initialisation is done by setting the number
of topics to this maximum and randomly assigning words to
topics. Other authors  report initialising to the maxi-
mum number of topics, rather than 1, leads to substantially
better results, an experimental finding with which we agree.
Note, inference and learning for burstiness requires the
word by topic counts nw,k and word by topic multiplicities
sw,kbe maintained for each document, as well as their totals.
There is an implementation trick used to achieve space effi-
ciency here. First one computes, for each document, which
words appear more than once in the document (i.e., those
for which nw,k can become greater than 1). These words
require special handling, the full Equation (6), and lists of
these are stored in preset variable length arrays. Words that
occur only once in a document are easy to deal with since
their sampling is governed by Equation (7) and no sampling
of the table indicator is needed. Second, the count and mul-
tiplicity statistics (the nw,k and sw,k which are statistics
for?ψi,k) are not stored but recomputed as each document is
about to be processed. Moreover, this only needs to be done
for words appearing more than once in the document (hence
why lists of these are prestored). All one needs to recompute
these statistics is the Boolean table indicators and the topic
assignments. The statistics nw,k,sw,k can be recomputed in
time proportional to the length of the document.
We have used several datasets for our experiments, the
PN, MLT, RML, TNG, NIPS and LAT datasets. Not all
data sets were used in all comparisons.
The PN dataset is taken from 805K News articles (Reuters
RCV1) using the query “person”, excluding stop words and
words appearing <5 times. The MLT dataset is abstracts
from the JMLR volumes 1-11, the ICML years 2007-2011,
and IEEE Trans.of PAMI 2006-2011. Stop words were dis-
carded along with words appearing <5 or >2900 times. The
RML dataset is the Reuters-21578 collection, made up us-
ing standard ModLewis split. The TNG dataset is the 20-
newsgroup dataset using the standard split. For both stop
words were discarded along with words appearing <5 times.
The LAT dataset is the LA Times articles from TREC disk
4. Stop words were discarded along with words appearing
<10 times. Only words made up entirely of alphabetic char-
Table 2: Characteristic Sizes of Datasets
acters or dashes were allowed. Roweis’ NIPS dataset4was
left as is.
Characteristics of these six datasets are given in Table 2,
where dictionary size is W, number of documents (including
test) is D, number of test documents is T and total number
of words in the collection is N.
The algorithms are evaluated on two different measures,
test sample perplexity and point-wise mutual information
(PMI). Perplexity is calculated over test data and is done
using document completion , known to be unbiased and
easy to implement for a broad class of models. The doc-
ument completion estimate is averaged over 40 cycles per
document done at the end of the training run and uses a 80-
20% split, so every fifth word is used to evaluate perplexity
and the remaining to estimate latent variables for the docu-
ment. Topic comprehensibility can be measured in terms of
PMI . It is done by measuring average word association
between all pairs of words in the top-10 topic words (using
the English Wikipedia articles). Here the PMI reported is
average across all topics. PMI files are prepared with the
DCA-Bags package using linkCoco and projected onto the
specific data-sets using cooc2pmi.pl in the hca suite.
We also compare results with two other systems, online-
hdp  is a stochastic variational algorithm for HDP-LDA
coded in Python from C. Wang5, and HDP a Matlab+C com-
bination doing Gibbs sampling from Y.W. Teh. To do the
comparisons, at various timepoints we take a snapshot of the
? α vector and the?φk vectors. This is already supported in
onlinehdp, and C. Chen provided the support for this task
with HDP. We then load these values along with the hyperpa-
rameter settings into hca and use its document completion
and PMI evaluation options“-V -p -hdoc,5.” In this way, all
algorithms are compared using identical software.
5.1 Runtime Comparisons
To see how the algorithms work at scale, we consider the
cycle times and memory requirements of the different ver-
sions running on the full LAT data set. These are given in
Table 3. Cycle times in minutes are for a full pass through all
documents and memory requirements are given in megabytes.
LDA, HDP-LDA (where aθ = 0) and NP-LDA are as de-
scribed in Section 3. The right half of the table gives per-
formance for the burstiness model of Figure 3. Note only
a portion of the computation is linear in K so, for instance
NP-LDA with burstiness using K = 2000 topics on the same
dataset takes roughly 90 minutes a cycle and 2.43GB mem-
ory. Moreover, given it is coded in Python using inefficient
5Some C++ versions also exist.
Table 3: Cycle times and memory requirements on the LA
Times TREC 4 data using K = 500 topics. “Burst” is the
allocation, onlinehdp has comparable memory requirements
to hca. In subsequent experiments, we also saw HDP required
5-7 times more memory than hca.
Experiments show that the convergence rates (in cycle
counts not time) are similar for the various Gibbs algorithms
(LDA, Burst LDA, Burst NP-LDA, etc.). Gibbs for full non-
parametric LDA with the burstiness front end gives substan-
tial improvements over vanilla Gibbs LDA while requiring
only 50% more memory and 3 times greater computation
time. Note that the table indicator samplers have previ-
ously been reported to give 1-2% improvement in perplexity
over Chinese restaurant samplers , which in turn retain a
substantial improvement over earlier variational algorithms
for HDP-LDA .
We find that sampling hyper-parameters (for instance, dis-
counts and concentrations of Pitman-Yor processes) to be
important for performance. A substantial part of the time
for topic burstiness is hyper-parameter sampling, something
that is usually less than 5% for the other models. This is
because the model has a different concentration parameter
for every topic, thus much more of the inefficient adaptive
rejection sampling is done for the bursty models versus oth-
A subset of the results are presented in Table 4 and some
informative plots given in Figure 4 and Figure 5. These
represent the average values computed over 4 independent
runs. Note that the differences between hca’s “Burst HDP”
and “Burst NP-LDA” in the table are not significant at the
5% level, but are only mildy significant.
LDA reaches an earlier minimum for perplexity and then
it usually increases, though PMI does increase as well. Mod-
els like HDP-LDA and NP-LDA usually keep on improving
in PMI as the number of topics increases and hold-out per-
plexity often waivers about, gradually increasing after a later
minimum. For instance, for the small MLT data set they
reach a minimum perplexity at about K=20. All the while,
PMI keeps improving. For data sets like Reuters-21578 a
much larger number of topics can be supported, for instance
K > 500 easily. The eventual increase in perplexity for
larger K seems counter-intuitive given the non-parametric
slogan of“estimating the right dimension of the model from
the data”. However, remember, we have initialization arti-
facts to deal with. Initializing with substantially too many
topics leads to fragmentation/duplication of the topics not
subsequently handled by simple Gibbs sampling. To deal
with this sort of affect, we need something like split-merge
operators in the sampler .
20 40 60 80 100
20 40 60 80 100
PMI of Topics
No of Topics
Figure 4: Perplexity and PMI on the RML data for LDA,
Burst LDA and NP-LDA, Burst NP-LDA.
10 20 30 40 50 60 70 80 90 100
No of Topics
Figure 5: Perplexity as the number of topics (K) changes
for different algorithms on the MLT data.
With the burstiness models, however, the change in per-
formance is dramatic. Burstiness almost always improves
PMI, sometimes substantially and the drop in perplexity
is always dramatic. Burstiness makes perplexity peak much
earlier w.r.t. the number of topics, but for the non-parametric
models the subsequent rise in perplexity thereafter is mild.
The non-parametric models cope better with the challenges
of sampling behind the bursty front-end. However the best
perplexity is reached for low number of topics on MLT where
all the different models (LDA, HDP-LDA, NP-LDA) have
similar perplexity. For the larger RML data set, LDA’s per-
plexity again peaks earlier but NP-LDA’s keeps improving
for the number of topics considered.
This section compares hca performance with previous al-
In order to compare the different systems, hca versus on-
linehdp and HDP we use the RML and TNG data. We used
a fixed set of hyperparameters with no sampling so all dis-
count parameters are set to 0 and the relevant concentration
parameters set to 1 (bα,bθ) and a symmetric β = (0.01)?1 is
used. For onlinehdp we did a large number of runs vary-
Comparison with onlinehdp and HDP
Table 4: Document completion perplexity and PMI for hca variants. Data is presented as “Perplexity/PMI”. “HDP” is short
ing τ = 1,4,16,64.
batchsize = 250,1000. Note τ = 64,κ = 0.8 are recommend
in . Only the fastest and best converging result is given
for onlinehdp. We did one run of both hca and HDP with
these settings noting that the differences are way outside of
the range of typical statistical variation between individual
runs. Plots of the runs over time are given in Figures 6
and 7. The final PMI scores for the 3 algorithms are given
κ = 0.5,0.8 and K = 150,300 and
0 2000 4000 6000 8000 10000 12000
Figure 6: Comparative perplexity for one run on the RML
in Table 5.
Table 5: PMI scores for the comparative runs.
Table 6: Effective Number of Topics for the comparative
The improvement in perplexity of hca over HDP is not that
surprising because comparative experiments on even simple
0 2000 4000 6000 8000 10000 12000
Figure 7: Comparative perplexity for one run on the TNG
models show the significant improvement of table indicator
methods over CRP methods , and Sato et al.  also re-
port substantial differences between different formulations
for variational HDP-LDA. However, the poor performance
of onlinehdp needs some explanation. On looking at the
topics discovered by onlinehdp, we see there are many du-
plicates. Moreover, the topic proportions given by the ? α vec-
tor show extreme bias towards earlier topics. It is known, for
instance, that variational methods applied to the Dirichlet
make the probability estimates more extreme. In this model
one is working with a tree of Betas, so it seems the effect is
confounded. A useful diagnostic here is the“Effective Num-
ber of Topics” which is given by exponentiating the entropy
of the estimated ? α vector, shown in Table 6. One can see hca
and HDP are similar here but onlinehdp has a dramatically
reduced number of topics. The non-duplicated topics in the
onlinehdp result, however, look good in terms of compre-
hensability, so the online stochastic variational method is
clearly a good way to get a smaller number of topics from a
very large data set.
Mallet suppports asymmetric-symmetric LDA, which is a
form of truncated HDP-LDA using finite symmetric Dirich-
let to truncate a GEM. We compare the implementation
of HDP-LDA in Mallet and hca. Results are reported for
Comparison with Mallet
Table 7: Comparative Results for Mallet.
1404 ± 8
4081 ± 27
1357 ± 14
3844 ± 24
1280 ± 2
3999 ± 10
1389 ± na
3726 ± na
1145 ± 2
3586 ± 8
1375 ± na
3676 ± na
Table 8: Comparative Results for PCVB0
1285 ± 10
1275 ± 10
1267 ± 5
1223 ± 5
1193 ± 5
1151 ± 5
“RML”and“TNG”datasets with 300 topics as per previous,
and also some from Table 4. As suggested in  we run
Mallet for 2000 iterations, and optimise the hyperparam-
eters every 10 major Gibbs cycles after an initial burn-in
period of 50 cycles, to get the best results. Table 7 presents
the comparative results. We can see that hca generally pro-
duces better results. Note that results produced by the full
asymmetric version NP-LDA are even better, an option not
implemented in Mallet.
We also sought to compare hca with the variants of PCVB0
reported in . These are a family of simplified variational
algorithms, though the different variants seem to perform
similarly. Without details of the document pre-processing,
it was difficult to reproduce comparable datasets. Thus only
results for their “KOS blog corpus,” available preprocessed
from the UCI Machine Learning Repository, where used in
producing the comparisons presented in Table 8. We note
the smaller difference here in perplexity is such that better
hyper-parameter estimation with PCVB0 could well make
the algorithms more equal. Interestly, Sato et al. report lit-
tle difference between the symmetric or asymmetric priors
on the Dirichlet on?φk. In contrast, our corresponding asym-
metric version NP-LDA shows significant improvements.
Comparison with PCVB0
A split-merge variant of HDP-LDA has been developed
 that was compared with online and batch variational al-
gorithms. For the NIPS data they have made runs with
K = 300 and they estimate all hyperparameters. They use
a 80-20% split for document completion and we replicated
the experiment with the same dataset, parameter settings
and sampling. The results are show in Figure 8 and should
be compared with [2, Figure 2(b)]. Their results show plots
for 40 hours whereas we ran for 4.5 hours, so our algorithm is
approximately 4 times faster per document. Our Gibbs im-
plementation of HDP-LDA substantially beats all other non-
split-merge algorithms. Not surprisingly, the sophisticated
split-merge sampler eventually reaches the performance of
ours. Note the NP-LDA model is superior to HDP-LDA on
this data, and the bursty versions are clearly superior to all
Comparison on NIPS 1988-2000 Dataset
Figure 8: Convergence on Roweis’ NIPS data for K = 300.
5.4Effect of Hyperparameters on the Num-
ber of Topics
Standard reporting of experiments using HDP-LDA usu-
ally sets the β parameter which governs the symmetric prior
for the?φk. For instance, some authors  call this η and
it is set to 0.01. Here we explore what happens when we
vary this parameter for the RML data. Note we have done
this experiment on most of the data sets and the results are
comparable. We train HDP-LDA for 1000 Gibbs cycles and
0500 100015002000 2500
PMI of Topics
Figure 9: Perplexity and PMI for the RML data when vary-
ing β in the symmetric prior for HDP-LDA.
then record the evaluation measures. This takes 60 min-
utes on the desktop for each value of β. We also do a run
where β is sampled. For each of the curves, the stopping
point on the right gives the number of full topics used by
the algorithms (ignoring trivially populated topics with 1-2
words). So the lowest perplexity is achieved by HDP-LDA
with β = 0.001 where roughly K = 2,400 topics are used.
Sampling β roughly tracks the lowest achieved for each num-
ber of topics.
The PMI results also indicate that for larger β one obtains
more comprehensible topics, though less of them. Thus there
is a trade-off: if you want less but more comprehensible
topics, for instance a coarser summary of the content, then
make β larger. If you want a better fit to the data, or more
finely grained topics, then estimate β properly.
Table 9: Low proportion topics (proportion below 0.001)
with lower variance factor for LAT data when K = 500.
Zsa gabor capos slapping avert anhalt enright rolls-
Royce cop-slapping Hensley judgeship Leona
herald tribune examiner dailies gannet batten numeric
press-telegram petersburg sentinel
Baker PT evangelist bakers Tammy Faye swagged evan-
gelists televangelists defrocked
Thus we can see that the number of topics found by HDP-
LDA is significantly affected by the hyper parameter β, and
thus it is probably inadvisable to fix it without careful ex-
perimentation, consideration or sampling.
number of topics on RML, with roughly 20,000 documents
is up to 2,000. Inspection shows a good number of these are
comprehensible. With larger collections we claim it would
be impractical to attempt to“estimate”the right number of
topics. For larger collections, one could be estimating tens
of thousands of topics. Is this large number of topics even
5.5Topic Specific Concentrations
For the topic burstiness model of Figure 3 we had topic
specific concentrations to the PYP, bφ,k. Now the concen-
tration and discount together control the variance. So for
document i and topic k, the variance of a word probability
ψi,k,w from its mean φk,w will be
the ratio the variance factor. If it is close to one then the
word proportions?ψi,kfor the topic have little relationship to
their mean?φk. If close to zero they are similar. Figure 10
considers 500 topics from a model built on the LAT data
with K = 500 using PYP-LDA and topic burstiness. About
φk,w . We call
0 0.20.40.60.81 184.108.40.206.82
Figure 10: Topic proportions versus the variance factor for
LAT data when K = 500.
15% of the topics have low values for concentration that
make the topics effectively random, and thus not properly
used. Examples of topics with low proportions but variance
factor below 0.4, so the topics are still use able, are given
in Table 9. The first topic is actually about two issues: the
first is the Zsa Zsa Gabor slapping incident, and the second
is about Orange County Dist. Atty.s Avert and Enright.
We have shown that an implementation of the HDP-LDA
non-parametric topic model and related non-parametric ex-
tension NP-LDA using block table indicator Gibbs sampling
methods  are significantly superior to recent variational
methods in terms of perplexity of results. The NP-LDA is
also significantly superior in perplexity to the Mallet imple-
mentation of truncated HDP-LDA (masquerading as asym-
metric symmetric LDA). Taking account of the different im-
plementation languages, the newer Gibbs samplers and vari-
ational methods also have the same memory footprint. Mal-
let is substantially faster, however, and performs well for
We note that these have two goals, (A) better estimat-
ing prior topic or word proportions, and (B) estimating the
“right”number of topics. The non-parametric methods seem
superior at the first goal (A) over the parametric equivalents.
Given that the estimated number of topics grows substan-
tially with the collection sizes, it is not clear how important
goal (B) can be. Arguably, goal (A) is the more important
Moreover, we have developed a Gibbs theory of burstiness
• Is implemented as a front-end so can in principle read-
ily be applied to most variants of a topic model that
use a Gibbs sampler.
• It is a factor of 1.5-2 slower per major Gibbs cycle.
This will allow the wide variety of topic-model variants to
easily take advantage of the burstiness model.
Through the experiments, we have illustrated some char-
acterizations of the models, for instance:
• Our asymmetric-asymmetric NP-LDA model is about
75% slower than HDP-LDA but generally performs
better than HDP-LDA, a different result to published
results [25, 20] due to the different algorithms.
• The topic comprehensibility (as measured using PMI)
is substantially improved by the burstiness version, as
reported in the original work .
• The topic concentration parameter in the burstiness
model goes very low when the topic is insignificant.
We can use this to estimate which topics have become
inactive in the model.
• The concentration parameter for the topic-word vec-
tors significantly affects results, so care should be taken
in experiments using these models.
Both authors were funded partly by NICTA. NICTA is
funded by the Australian Government through the Depart-
ment of Communications and the Australian Research Coun-
cil through the ICT Centre of Excellence Program. Thanks
to Changyou Chen and Kar Wai Lim for their feedback and
Changyou for running the HDP experiments.
 J. Boyd-Graber, D. Blei, and X. Zhu. A topic model
for word sense disambiguation. In EMNLP-CoNLL,
pages 1024–1033, 2007.
 M. Bryant and E. Sudderth. Truly nonparametric
online variational inference for hierarchical Dirichlet
processes. In P. Bartlett, F. Pereira, C. Burges,
L. Bottou, and K. Weinberger, editors, Advances in
Neural Information Processing Systems 25, pages
 W. Buntine and M. Hutter. A Bayesian view of the
Poisson-Dirichlet process. Technical Report
arXiv:1007.0296 [math.ST], arXiv, Feb. 2012.
 C. Chen, L. Du, and W. Buntine. Sampling table
configurations for the hierarchical Poisson-Dirichlet
process. In Machine Learning and Knowledge
Discovery in Databases: European Conference, ECML
PKDD, pages 296–311. Springer, 2011.
 G. Doyle and C. Elkan. Accounting for burstiness in
topic models. In Proc. of the 26th Annual Int. Conf.
on Machine Learning, ICML ’09, pages 281–288, 2009.
 L. Du. Non-parametric Bayesian Methods for
Structured Topic Models A Mixture Distribution
Approach. PhD thesis, School of Computer Science,
the Australian National University, Canberra,
 L. Du, W. Buntine, and H. Jin. Modelling sequential
text with an adaptive topic model. In Proc. of the
2012 Joint Conf. on EMNLP and CoNLL, pages
535–545. ACM, 2012.
 L. Du, W. Buntine, and M. Johnson. Topic
segmentation with a structured topic model. In
HLT-NAACL, pages 190–200. The Association for
Computational Linguistics, 2013.
 L. Du, W. Buntine, and M. Johnson. Topic
segmentation with a structured topic model. In
Proceedings of NAACL-HLT, pages 190–200, 2013.
 W. R. Gilks and P. Wild. Adaptive rejection sampling
for Gibbs sampling. Applied Statistics, pages 337–348,
 T. Griffiths and M. Steyvers. Finding scientific topics.
PNAS Colloquium, 2004.
 S. Harter. A probabilistic approach to automatic
keyword indexing. Part II. An algorithm for
probabilistic indexing. Jnl. of the American Society
for Information Science, 26(5):280–289, 1975.
 M. Hoffman, D. Blei, C. Wang, and J. Paisley.
Stochastic variational inference. Journal of Machine
Learning Research, 14:1303–1347, 2013.
 H. Ishawaran and L. James. Generalized weighted
Chinese restaurant processes for species sampling
mixture models. Statistica Sinica, 13:1211–1235, 2003.
 H. Ishwaran and L. James. Gibbs sampling methods
for stick-breaking priors. Journal of ASA,
 A. K. McCallum. Mallet: A machine learning for
language toolkit. http://mallet.cs.umass.edu, 2002.
 D. Newman, J. Lau, K. Grieser, and T. Baldwin.
Automatic evaluation of topic coherence. In Proc. of
the 2010 Annual Conf. of the NAACL, pages
100ˆ a˘A¸ S–108, 2010.
 S. Robertson and H. Zaragoza. The probabilistic
relevance framework: BM25 and beyond. Found.
Trends Inf. Retr., 3(4):333–389, Apr. 2009.
 M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth.
The author-topic model for authors and documents. In
Proc. of the 20th Annual Conf. on Uncertainty in
Artificial Intelligence (UAI-04), pages 487–49, 2004.
 I. Sato, K. Kurihara, and H. Nakagawa. Practical
collapsed variational Bayes inference for hierarchical
Dirichlet process. In Proc. of the 18th ACM SIGKDD
international conf. on Knowledge discovery and data
mining, pages 105–113. ACM, 2012.
 I. Sato and H. Nakagawa. Topic models with
power-law using Pitman-Yor process. KDD ’10, pages
673–682. ACM, 2010.
 Y. Teh, K. Kurihara, and M. Welling. Collapsed
variational inference for HDP. In NIPS ’07. 2007.
 Y. W. Teh. A Bayesian interpretation of interpolated
Kneser-Ney. Technical Report TRA2/06, School of
Computing, National University of Singapore, 2006.
 Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei.
Hierarchical Dirichlet processes. Journal of the ASA,
 H. Wallach, D. Mimno, and A. McCallum. Rethinking
LDA: Why priors matter. In Advances in Neural
Information Processing Systems 19, 2009.
 H. Wallach, I. Murray, R. Salakhutdinov, and
D. Mimno. Evaluation methods for topic models. In
ICML ’09, pages 672–679. 2009.
 C. Wang, J. Paisley, and D. Blei. Online variational
inference for the hierarchical Dirichlet process. In
AISTATS ’11. 2011.