Content uploaded by Jeffrey Lund
Author content
All content in this area was uploaded by Jeffrey Lund on Sep 25, 2023
Content may be subject to copyright.
Automatic Evaluation of Local Topic Quality
Jeffrey Lund, Piper Armstrong, Wilson Fearn, Stephen Cowley, Courtni Byun, Jordan Boyd-Graber, Kevin Seppi
Computer Science Department
Brigham Young University
{jefflund, piper.armstrong, wfearn,
scowley4, emilyhales, kseppi}@byu.edu
jbg@umiacs.umd.edu
Abstract
Topic models are typically evaluated with re-
spect to the global topic distributions that they
generate, using metrics such as coherence, but
without regard to local (token-level) topic as-
signments. Token-level assignments are im-
portant for downstream tasks such as classifi-
cation. Even recent models, which aim to im-
prove the quality of these token-level topic as-
signments, have been evaluated only with re-
spect to global metrics. We propose a task
designed to elicit human judgments of token-
level topic assignments. We use a variety of
topic model types and parameters and discover
that global metrics agree poorly with human
assignments.
Since human evaluation is expensive we pro-
pose a variety of automated metrics to evaluate
topic models at a local level. Finally, we cor-
relate our proposed metrics with human judg-
ments from the task on several datasets. We
show that an evaluation based on the percent
of topic switches correlates most strongly with
human judgment of local topic quality. We
suggest that this new metric, which we call
consistency, be adopted alongside global met-
rics such as topic coherence when evaluating
new topic models.
1 Introduction
Topic models such as Latent Dirichlet Allocation
(or LDA) (Blei et al.,2003) aim to automatically
discover topics in a collection of documents, giv-
ing users a glimpse into themes present in the doc-
uments. LDA jointly derives a set of topics (a
distribution over words) and token-topic assign-
ments (a distribution over the topics for each to-
ken). While the topics by themselves are valuable,
the token-topic assignments are also useful as fea-
tures for document classification (Ramage et al.,
2009;Nguyen et al.,2015;Lund et al.,2018) and,
in principle, for topic-based document segmenta-
tion.
Given the number of algorithms available for
topic modeling, the questions of algorithm selec-
tion and model evaluation can be as daunting as
it is important. When the model is used for a
downstream evaluation task (e.g., document clas-
sification), these questions can often be answered
by maximizing downstream task performance. In
most other cases, automated metrics such as topic
coherence (Newman et al.,2010) can help assess
topic model quality. Generally speaking, these
metrics evaluate topic models globally, meaning
that the metrics evaluate characteristics of the top-
ics (word distributions) themselves without regard
to the quality of the topic assignments of individ-
ual tokens.
In the context of human interaction, this means
that models produce global topic-word distribu-
tions that typically make sense to users and serve
to give a good high-level overview of the general
themes and trends in the data. However, the lo-
cal topic assignments can be bewildering. For ex-
ample, Figure 1shows typical topic assignments
using LDA. Arguably, most, if not all, of the sen-
tence should be assigned to the Music topic since
the sentence is about a music video for a partic-
ular song. However, parts of the sentence are
assigned to other topics including Gaming and
Technology, possibly because other sentences in
the same document are concerned with those top-
ics. Even noun-phrases, such as ‘Mario Winans’
in Figure 1, which presumably should be assigned
to the same topic, are split across topics.
In the context of downstream tasks, global eval-
uation ignores the fact that local topic assignments
are often used as features. If the topic assignments
are inaccurate, the accuracy of the classifier may
suffer.
The literature surrounding this issue has fo-
arXiv:1905.13126v1 [cs.IR] 18 May 2019
Adance1break1by P.Diddy1is also
featured2in both settings4of the video2,
intercut1with scenes2of Mario3Winans1
playing1the drums1.
Music1Film2Gaming3Technology4
Figure 1: Topic assignments from LDA on a sentence
from a Wikipedia document. Notice that even noun-
phrases are split in a way which is bewildering to users.
Adance1break1by P.Diddy1is also
featured1in both settings2of the video2,
intercut2with scenes2of Mario Winans
playing2the drums2.
Music1Film2Gaming3Technology4
Figure 2: An example of how topics might be assigned
if done by a human.
cused on improving local topic assignments, but
no metrics that specifically assess the quality of
these assignments have been proposed. Instead the
literature evaluates models with global metrics or
subjective examination.
For example, HMM-LDA (Griffiths et al.,2004)
integrates syntax and topics by allowing words to
be generated from a special syntax-specific topic.
TagLDA (Zhu et al.,2006) adds a tag specific
word distribution for each topic, allowing syn-
tax to impose local topic structure. The syntac-
tic topic model, or STM (Boyd-Graber and Blei,
2009), extends this idea and generates topics us-
ing syntactic information from a parse tree. An al-
ternative approach to improving local topic qual-
ity is by adding a Markov property to topic as-
signments. The hidden topic Markov model (Gru-
ber et al.,2007, HTMM) does this by adding a
switch variable on each token which determines
whether to reuse the previous topic or generate a
new topic. More recently, Balikas et al. (2016a)
proposed SentenceLDA which assigns each sen-
tence to a single topic. CopulaLDA (Balikas et al.,
2016b) supersedes SentenceLDA, and instead uses
copulas to impose topic consistency within each
sentence of a document.
This paper evaluates token-level topic assign-
ment quality to understand which topic models
produce meaningful local topics for individual
documents and proposes metrics that correlate
with human judgment of the quality of these as-
signments.
2 Global Evaluation
Prior work in automated metrics to evaluate topic
model quality primarily deals with global evalu-
ations (i.e. evaluations of the topic-word distri-
butions that represent topics). Early topic models
such as LDA were typically evaluated using held-
out likelihood or perplexity (Blei et al.,2003).
Wallach et al. (2009) give details on how to es-
timate perplexity. Indeed, perplexity is still fre-
quently used to evaluate models, and each of the
models mentioned in the previous section, includ-
ing CopulaLDA, which was designed to improve
local topic quality, use perplexity to evaluate the
model. However, while held-out perplexity can
be useful to test the generalization of predictive
models, it has been shown to be negatively corre-
lated with human evaluations of global topic qual-
ity (Chang et al.,2009). This result was elicited
using a topic-word intrusion task, in which hu-
man evaluators are shown the top nmost proba-
ble words in a topic-word distribution and asked to
identify a randomly chosen ‘intruder’ word which
was injected into the word list. The topic-word in-
trusion task operates under the assumption that if
a topic is semantically coherent, then the intruder
will be easy to identify.
2.1 Coherence
While human evaluation of topic coherence is
useful, automated evaluations are easier to de-
ploy. Consequently, Newman et al. (2010) pro-
pose a variety of automated evaluations of topic
coherence and correlate these metrics with hu-
man evaluations using the topic-word intrusion
task mentioned above. They show that an evalua-
tion based on aggregating pointwise mutual infor-
mation (PMI) scores across the top nmost likely
terms in a topic distribution correlates well with
human evaluations. This metric, colloquially re-
ferred to simply as ‘coherence’, is currently the
most popular form of automated topic model eval-
uation. Note that coherence is a measure of global
topic quality, since it considers only the global
topic-word distributions. We follow this pattern of
leveraging human intuition in the development of
our own automated metrics proposed in Section 4.
2.2 Significance
For the purpose of user interaction, topics are typ-
ically summarized by their top nmost probable
words. However, when topics are used as features
for downstream tasks such as document classifi-
cation, the characteristics of the entire distribution
become more important. With this in mind, con-
sider two topics which rank the words of the vo-
cabulary by probability in the same order. Suppose
that one of these distributions is more uniform than
the other (i.e., has higher entropy). While both
distributions would be equally interpretable to a
human examining them, the topic-word distribu-
tion with lower entropy places more weight on the
high-rank words and is much more specific.
Using this intuition, AlSumait et al. (2009) de-
velops metrics for evaluating topic significance.
While this work was originally used to rank topics,
it has also been used to characterize entire models
by measuring average significance across all top-
ics in a single model (Lund et al.,2017).
Topic significance is evaluated by measuring
the distance between topic distributions and some
background distribution. For example, we can
measure significance with respect to the uniform
distribution (SI GUN I). Alternatively, we can use
the empirical distribution of words in the corpus,
which we call the vacuous distribution, as our
background distribution (SI GVAC).
Like coherence, topic significance is a global
measure of topic quality since it considers the
topic-word distributions without regard to local
topic assignments. However, it differs from topic
coherence in that it considers the entire topic dis-
tribution. Lund et al. (2017) found that when top-
ics were used as features for document classifica-
tion, models with similar coherence scores could
perform differently on downstream classification
accuracy, but the models with higher significance
scores obtained better accuracy.
Automated global metrics have proven useful
for evaluating the topics themselves, that is, the
topic-word distributions. However, no metric has
been shown to effectively evaluate local topic
quality. Therefore, we first correlate existing met-
rics with human judgment of local topic quality;
we obtain these judgments through the crowd-
sourcing task described below.
3 Crowdsourcing Task
Following the general design philosophy in de-
veloping the coherence metric in Newman et al.
(2010), we train a variety of models on various
datasets to obtain data with varying token-level
topic quality. We then evaluate these models us-
ing crowdsourcing data on a task designed to elicit
human evaluation of local topic model quality. By
then correlating the human evaluation with exist-
ing, global metrics, we determine that global met-
rics are inadequate, and then propose new metrics
to better measure local topic quality.
3.1 Datasets and Models
We choose three datasets from domains with dif-
ferent writing styles. These datasets include Ama-
zon product reviews,1the well known Twenty
Newsgroups dataset,2and a collection of news ar-
ticles from the New York Times.3We apply stop-
word removal using a standard list of stopwords,
and we remove any token which does not appear
in at least 100 documents. Statistics for these three
datasets can be found in Table 1.
Dataset Documents Tokens Vocabulary
Amazon 39388 1389171 3406
Newsgroups 18748 1045793 2578
New York Times 9997 2190595 3328
Table 1: Statistics on datasets used in user study and
metric evaluation.
Once again aiming for a wide variety of topic
models for our evaluation, for each of these
datasets, we train three types of topic models.
As a baseline, we train Latent Dirichlet Alloca-
tion (Blei et al.,2003) on each of the three datasets
using the gensim defaults.4CopulaLDA (Balikas
et al.,2016b) is the most recent and reportedly
the best model with respect to local topic qual-
ity; we use the authors’ implementation and pa-
rameters. Finally, we use the Anchor Words algo-
rithm (Arora et al.,2013), which is a fast and scal-
able alternative to traditional probabilistic topic
models based on non-negative matrix factoriza-
tion. In our implementation of Anchor Words we
only consider words as candidate anchors if they
appear in at least 500 documents, the dimensional-
1http://jmcauley.ucsd.edu/data/amazon/
2http://www.ai.mit.edu/people/jrennie/20Newsgroups/
3http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?
catalogId=LDC2008T19
4https://radimrehurek.com/gensim
ity of the reduced space is 1000, and the threshold
for exponentiated gradient descent is 1e-10. By it-
self, Anchor Words only recovers the topic-word
distributions, so we follow Nguyen et al. (2015)
and use variational inference for LDA with fixed
topics to assign each token to a topic.
In addition to varying the datasets and topic
modeling algorithms, we also vary the number of
topics with the hope of increasing the diversity of
observed topic model quality. For both LDA and
Anchor Words, we use 20, 50, 100, 150, and 200
topics. For CopulaLDA, we use 20, 50, and 100
topics.5We vary the number of topics to produce
models with small numbers of coherent, albeit less
significant, topics as well as models with large
numbers of more significant topics. Since each
model includes some amount of non-determinism,
we train five instances of each dataset, model, and
topic cardinality and average our results.
In the interest of reproducibility, the data, the
scripts for importing and preprocessing the data,
and the code for training and evaluating these topic
models are available in an open source repository.6
3.2 Task Design
The goal for our crowdsourcing task is to have hu-
man annotators evaluate local topic quality. Not
only will this task allow us to evaluate and com-
pare topic models themselves, but it will also al-
low us to determine the effectiveness of automated
metrics. Because local topic quality is subjec-
tive, directly asking annotators to judge assign-
ment quality can result in poor inter-annotator
agreement. Instead, we prefer to ask users to per-
form a task which illuminates the underlying qual-
ity indirectly. This parallels the reliance on the
word intrusion task to rate topic coherence (Chang
et al.,2009).
We call this proposed task ‘topic-word match-
ing’. In this task, we show the annotator a short
snippet from the data with a single token under-
lined along with five topic summaries (i.e., the 10
most probable words in the topic-word distribu-
tion). We then ask the user to select the topic
which best fits the underlined token. One of the
five options is the topic that the model actually as-
5Unfortunately, CopulaLDA does not scale beyond 100
topics. In contrast to LDA and Anchor Words, which run in
minutes and seconds respectively, CopulaLDA takes days to
run using the original authors’ implementation. Our runs with
150 and 200 topics never finished, as they where finally killed
due to excessive memory consumption on 32GB systems.
6Available after blind review
signs to the underlined token. The intuition is that
the annotator will agree more often with a topic
model which makes accurate local topic assign-
ments. As alternatives to the model-selected topic
for the token, we also include the three most prob-
able topics in the document, excluding the topic
assigned to the underlined token. A model which
gives high quality token-level topic assignments
should consistently choose the best possible topic
for each individual token, even if these topics are
closely related. Finally, we include a randomly se-
lected intruder topic as a fifth option. This fifth op-
tion is included to help distinguish between an in-
stance where the user sees equally reasonable top-
ics for the underlined token (in which case, the in-
truding topic will not be selected), and when there
are no reasonable options for the underlined token
(in which case, all five topics are equally likely to
be chosen). Figure 3shows an example of this task
shown to annotators.
For each of our 39 trained models (i.e., for each
model type, dataset, and topic cardinality), we ran-
domly select 1,000 tokens to annotate. For each
of the 39,000 selected tokens, we obtain 5 judg-
ments. We aggregate the 5 judgments by select-
ing the contributor response with the highest con-
fidence, with agreement weighted by contributor
trust. Contributor trust is based on accuracy on
test questions.
We deploy this task on a popular crowdsourc-
ing website7and pay contributors $0.12 USD per
page, with 10 annotations per page. For quality
control on this task, each page contains one test
question. The test questions in our initial pilot
study are questions we hand-select for their obvi-
ous nature. For our test questions in the final study,
we use the ones mentioned above in addition to
questions from the pilot studies with both high an-
notator confidence and perfect agreement. We re-
quire that contributors maintain at least a 70% ac-
curacy on test questions throughout the job, and
that they spend at least 30 seconds per page, but
otherwise impose no other constraints on contrib-
utors. We discuss the results from this final study
in Section 5.
3.3 Agreement Results
We first measure inter-annotator agreement using
Krippendorff’s alpha with a nominal level of mea-
surement (Krippendorff,2013). Generally speak-
7https://www.figure-eight.com
Figure 3: Example of the topic-word matching task. Users are asked to select the topic which best explains the
underlined token (“Olympic”).
ing, α= 1 indicates perfect reliability, while
α < 0indicates systematic disagreement. Over
all the judgments we obtain, we compute a value
of α= 0.44, which indicates a moderate level of
agreement.
We note that when using crowdsourcing, par-
ticularly with subjective tasks such as topic-
word matching, we expect somewhat low inter-
annotator agreement. However, previous work in-
dicates that when properly aggregated, we can still
filter out noisy judgments and obtain reasonable
opinions (Nowak and R¨
uger,2010).
Figure 4summarizes the human agreement with
the three different model types. Surprisingly, de-
spite claiming to produce superior local topic qual-
ity, CopulaLDA actually performs slightly worse
than LDA according to our results with the topic-
word matching task. The fact that CopulaLDA
performs poorly despite being designed to im-
prove local topic quality illustrates the need for
effective local topic quality metrics.
We also note that users agree with Anchor
Words more often than LDA by a wide margin,
indicating that Anchor Words achieves superior
token-level topic assignments. However, in terms
of global topic quality, Anchor Words is roughly
similar to LDA (Arora et al.,2013). One possi-
ble explanation for this is that when using Anchor
Words the task of learning the global topic-word
distributions is separate from the problem of pro-
ducing accurate local topic assignments, making
both tasks easier. For many tasks an argument can
be made for a joint-model, so further investigation
into this phenomenon is warranted.
Figure 4: Plot showing human agreement with each
model type. CopulaLDA performs slightly worse than
LDA. Humans preferred topic assignments from An-
chor Words by a wide margin.
3.4 Global Metrics Correlation
For Coherence and Significance, we compute a
least-squares regression for human-model agree-
ment on the topic-word matching task. As seen
in Table 2, we report the coefficient of determina-
tion (r2) for each global metric and dataset. Note
that global metrics do correlate somewhat with hu-
man judgment of local topic quality. However, the
correlation is moderate to poor, especially in the
case of coherence, and we propose new metrics
that will achieve greater correlation with human
evaluations.
Metric Amazon Newsgroups New York Times
Global
SIGVAC 0.6960 0.6081 0.6063
SIGUN I 0.6310 0.4839 0.4935
COHERENCE 0.4907 0.4463 0.3799
Table 2: Coefficient of determination (r2) between
global metrics and crowdsourced topic-word matching
annotations.
4 Proposed Metrics
We develop an automated methodology for
evaluating local topic model quality. Following
the pattern used by Newman et al. (2010) to
develop coherence, we propose a variety of
potential metrics that reflect greater token-level
topic quality such as that in Figure 2. As with
coherence, we correlate these automated metrics
with human evaluations in order to determine
which automated metric yields the most accurate
estimate of local topic quality as judged by human
annotators.
Topic Switch Percent (SWI TCHP) It is a plati-
tude of writing that a sentence expresses one idea,
and by this logic we would expect the topic as-
signments in a sentence or local token cluster to be
fairly consistent. Using this intuition, we propose
our first metric which measures the percentage of
times a topic switch occurs relative to the num-
ber of times a switch could occur. The intuition
behind this is that tokens near each other should
switch infrequently, and thus be consistent in ex-
pressing a single idea. In a corpus with ntokens,
with zibeing the topic assignment of the ith to-
ken in the corpus, and δ(i, j)being the Kronecker
delta function, we measure this consistency with
1
n−1
n−1
X
i=1
δ(zi, zi+1).(1)
Topic Switch Variation of Information
(SWITCH VI) Following from the intuition
from SWITCHP, there are times when a sentence
or local cluster could express multiple ideas,
which would result in frequent natural topic
switching. An example of this is in figure 2which
has a noun phrase at the beginning referencing
P.Diddy, but then switches to talking about music
videos. Therefore this proposed metric still penal-
izes topic switches like SWITCHP, but penalizes
less those models which switch consistently
between the same (presumably related) topics.
This metric uses variation of information (or VI),
which measures the amount of information lost
in changing from one partition to another (Meil˘
a,
2003).
Assuming that our model has Ktopics, and
once again using zias the topic assignment
for token wi, we consider two partitions S=
{S1, ..., SK}and T={T1, ..., TK}of the set
of tokens w, such that Si={wj|zj=i}and
Ti={wj|zj+1 =i}. Variation of information
is defined as
H(S) + H(T)−2I(S, T ),(2)
where H(·)is entropy and I(S, T )is the mutual
information between Sand T. In other words, we
measure how much information we lose in our
topic assignments if we reassign every token to
the topic of the token that follows.
Average Rank (AVGRA NK)The most common
way of presenting topics to humans is as a set of
related words, namely the most probable words
in the topic-word distributions. Consequently, we
would expect words in the same topic to also occur
close to one another with high frequency. Lever-
aging this intuition, where rank(wi, zi)is the rank
of ith word wiin its assigned topic ziwhen sorted
by probability, we use the following:
1
n
n
X
i=1
rank(wi, zi).(3)
With this evaluation the lower bound is 1,
although this would require that every token be
assigned to a topic for which its word is the mode.
However, this is only possible if the number of
topics is equal to the vocabulary size.
Window Probabilities (WINDOW)Modifying
slightly the intuition behind SWITCHP pertaining
to local tokens having similar topic assignments,
WIN DOW seeks to reward topic models which
have topic assignments which not only explain in-
dividual tokens, but also the tokens within a win-
dow around the assignment. Given a window size,
and once again using φas the topic-word distribu-
tions, we compute the following:
1
n(2s+ 1)
n
X
i
i+s
X
j=i−s
φzj,wi.(4)
In our experiments, we use a window size of 3
(s= 1), meaning that for each token we consider
its topic assignment, as well as the topic assign-
ments for the tokens immediately preceding and
following the target token. We choose s= 1 be-
cause we want to maintain consistency while al-
lowing for topics to switch mid-sentence in a nat-
ural way.
Topic-Word Divergence (WORDDIV)Stepping
away from human intuition about the structure of
sentences and topics, we imagine a statistical ap-
proach that explores how the assignments in a doc-
ument and the actual word-topic distributions are
related. Given this, consider a topic model with
Ktopics, Vtoken types, and Ddocuments with
topic-word distributions given by a K×Vma-
trix φsuch that φi,j is the conditional probability
of word jgiven topic i. Furthermore, let θdbe
the K-dimension document-topic distribution for
the dth document and ψdbe the V-dimensional
distribution of words for document d. This met-
ric measures how well the topic-word probabili-
ties explain the tokens which are assigned to those
topics:
1
D
D
X
d
JSD(θd·φ|| ψd)(5)
where JSD(P|| Q)is the Jensen-Shannon diver-
gence between the distributions Pand Q. This
evaluation rewards individual topic assignments
which use topics that explain the cooccurrences of
an entire document rather than individual tokens.
5 Automated Evaluations
As before, for each of our proposed metrics, we
compute a least-squares regression for both the
proposed metric and the human-model agreement
on the topic-word matching task. As seen in Ta-
ble 3, we report the coefficient of determination
(r2) for each metric and dataset.
Metric Amazon Newsgroups New York Times
Local
SWITCHP 0.9077 0.8737 0.7022
SWITCHVI 0.8485 0.8181 0.6977
AVGRANK 0.5103 0.5089 0.4473
WIND OW 0.4884 0.3024 0.1127
WORD DIV 0.3112 0.2197 0.0836
Global
SIGVAC 0.6960 0.6081 0.6063
SIGUN I 0.6310 0.4839 0.4935
COHERENCE 0.4907 0.4463 0.3799
Table 3: Coefficient of determination (r2) between au-
tomated metrics and crowdsourced topic-word match-
ing annotations. We include metrics measuring both
local topic quality and global topic quality.
Humans agree more often with models trained
on Amazon reviews than on New York Times.
This likely reflects the underlying data, since
Amazon product reviews tend to be highly focused
on specific products and product features, and the
generated topics naturally reflect these products.
In contrast, New York Times data deals with a
much wider array of subjects and treats them with
nuance and detail not typically found in product
reviews. This makes the judgment of topic assign-
ment more difficult and subjective.
Notwithstanding the differences across datasets,
SWITCHP most closely approximates human
judgments of local topic quality, with an r2which
indicates a strong correlation. This suggests that
when humans examine token-level topic assign-
ments, they are unlikely to expect topic switches
from one token to the next, which fits with what
we observe in Figure 2. As evidenced by the lower
r2for SWITCHVI, even switching between related
topics does not seem to line up with human judg-
ments of local topic quality.
Again, there is a correlation between coherence
and the topic-word matching task, although the
correlation is only moderate. Similarly, word-
based significance metrics have a moderate corre-
lation with topic-word matching. We maintain that
these global topic metrics are important measures
for topic model quality, but they fail to capture lo-
cal topic quality as SWITCHP does.
6 Discussion
Considering the intuition gained from the motivat-
ing example in Figure 1, it is not surprising that
humans would prefer topic models which are lo-
cally consistent. Thus, our result that SWITCHP
is correlated with human judgments of local topic
quality best parallels that intuition.
We note that our annotators are only shown
the topic assignment for a single token and do
not know what topics have been assigned to the
surrounding tokens. Despite this, our annotators
apparently prefer models which are consistent.
While the result is intuitive, it is surprising that
it is illuminated through a task that asks them to
only identify the topic for a single token.
Given our results, we recommend that topic
switch percent be adopted as an automated met-
ric to measure the quality of token-level topic as-
signments. We would refer to this metric collo-
quially as ‘consistency’ in the same way that PMI
scores on the top nwords of a topic are referred
to as coherence. We advocate that future work on
new topic models include validation with respect
to topic consistency, just as recent work has in-
cluded evaluation of topic coherence.
However, we are careful to point out that topic
consistency should not be used to the exclusion of
other measures of topic model quality. After all,
topic consistency is trivially maximized by min-
imizing topic switches without regard to the ap-
propriateness of the topic assignment. Instead, we
advocate that future models be evaluated with re-
spect to global topic quality (e.g., coherence, sig-
nificance, perplexity) as well as local topic qual-
ity (i.e., consistency). These measures, in addition
to evaluation of applicable downstream tasks (e.g.,
classification accuracy), will give practitioners the
information necessary to make informed decisions
about model selection.
7 Conclusion
We develop a novel crowdsourcing task, which we
call topic-word matching, to illicit human judg-
ments of local topic model quality. We apply this
human evaluation to a wide variety of models, and
find that topic switch percent (or SWITCHP) cor-
relates well with this human evaluation. We pro-
pose that this new metric, which we colloquially
refer to as consistency, be adopted alongside eval-
uations of global topic quality for future work with
topic model comparison.
References
Loulwah AlSumait, Daniel Barbar´
a, James Gentle,
and Carlotta Domeniconi. 2009. Topic significance
ranking of LDA generative models. In Proceedings
of European Conference of Machine Learning.
Sanjeev Arora, Rong Ge, Yonatan Halpern, David
Mimno, Ankur Moitra, David Sontag, Yichen Wu,
and Michael Zhu. 2013. A practical algorithm for
topic modeling with provable guarantees. In Pro-
ceedings of the International Conference of Machine
Learning.
Georgios Balikas, Massih-Reza Amini, and Marianne
Clausel. 2016a. On a topic model for sentences. In
Proceedings of the ACM SIGIR Conference on Re-
search and Development in Information Retrieval.
Georgios Balikas, Hesam Amoualian, Marianne
Clausel, Eric Gaussier, and Massih-Reza Amini.
2016b. Modeling topic dependencies in semanti-
cally coherent text spans with copulas. In Proceed-
ings of International Conference on Computational
Linguistics.
David M. Blei, Andrew Ng, and Michael Jordan. 2003.
Latent dirichlet allocation. Journal of Machine
Learning Research, 3:993–1022.
Jordan L Boyd-Graber and David M Blei. 2009. Syn-
tactic topic models. In Proceedings of Advances in
Neural Information Processing Systems, pages 185–
192.
Jonathan Chang, Sean Gerrish, Chong Wang, Jordan L
Boyd-graber, and David M Blei. 2009. Reading
tea leaves: How humans interpret topic models. In
Advances in neural information processing systems,
pages 288–296.
Thomas L. Griffiths, Mark Steyvers, David M. Blei,
and Joshua B. Tenenbaum. 2004. Integrating top-
ics and syntax. In Advances in neural information
processing systems.
Amit Gruber, Yair Weiss, and Michal Rosen-Zvi. 2007.
Hidden topic markov models. In Artificial intelli-
gence and statistics.
Klaus Krippendorff. 2013. Content Analysis: An Intro-
duction to Its Methodology, 3rd edition, pages 221–
250. Thousand Oaks.
Jeffrey Lund, Connor Cook, Kevin Seppi, and Jordan
Boyd-Graber. 2017. Tandem anchoring: A multi-
word anchor approach for interactive topic model-
ing. In Proceedings of the Association for Compu-
tational Linguistics.
Jeffrey Lund, Stephen Cowley, Wilson Fearn, Emily
Hales, and Kevin Seppi. 2018. Labeled anchors and
a scalable, transparent, and interactive classifier. In
Proceedings of Empirical Methods in Natural Lan-
guage Processing.
Marina Meil˘
a. 2003. Comparing clusterings by the
variation of information. In Learning theory and
kernel machines, pages 173–187. Springer.
David Newman, Jey Han Lau, Karl Grieser, and Timo-
thy Baldwin. 2010. Automatic evaluation of topic
coherence. In Proceedings of the Association for
Computational Linguistics.
Thang Nguyen, Jordan Boyd-Graber, Jeffrey Lund,
Kevin Seppi, and Eric Ringger. 2015. Is your anchor
going up or down? Fast and accurate supervised
topic models. In Conference of the North Ameri-
can Chapter of the Association for Computational
Linguistics.
Stefanie Nowak and Stefan R¨
uger. 2010. How reliable
are annotations via crowdsourcing: a study about
inter-annotator agreement for multi-label image an-
notation. In Proceedings of the international con-
ference on Multimedia information retrieval, pages
557–566. ACM.
Daniel Ramage, David Hall, Ramesh Nallapati, and
Christopher D Manning. 2009. Labeled lda: A su-
pervised topic model for credit attribution in multi-
labeled corpora. In Proceedings of the 2009 Con-
ference on Empirical Methods in Natural Language
Processing: Volume 1-Volume 1, pages 248–256.
Association for Computational Linguistics.
Hanna M Wallach, Iain Murray, Ruslan Salakhutdinov,
and David Mimno. 2009. Evaluation methods for
topic models. In Proceedings of the International
Conference of Machine Learning. ACM.
Xiaojin Zhu, David Blei, and John Lafferty. 2006.
Taglda: Bringing document structure knowledge
into topic models. Technical report, Technical Re-
port TR-1553, University of Wisconsin.