ArticlePDF Available

Finding Scientific Topics

Authors:

Abstract and Figures

A first step in identifying the content of a document is determining which topics that document addresses. We describe a generative model for documents, introduced by Blei, Ng, and Jordan [Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) J. Machine Learn. Res. 3, 993-1022], in which each document is generated by choosing a distribution over topics and then choosing each word in the document from a topic selected according to this distribution. We then present a Markov chain Monte Carlo algorithm for inference in this model. We use this algorithm to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics. We show that the extracted topics capture meaningful structure in the data, consistent with the class designations provided by the authors of the articles, and outline further applications of this analysis, including identifying "hot topics" by examining temporal dynamics and tagging abstracts to illustrate semantic content.
Content may be subject to copyright.
Colloquium
Finding scientific topics
Thomas L. Griffiths*
†‡
and Mark Steyvers
§
*Department of Psychology, Stanford University, Stanford, CA 94305; Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology,
Cambridge, MA 02139-4307; and §Department of Cognitive Sciences, University of California, Irvine, CA 92697
A first step in identifying the content of a document is determining
which topics that document addresses. We describe a generative
model for documents, introduced by Blei, Ng, and Jordan [Blei,
D. M., Ng, A. Y. & Jordan, M. I. (2003) J. Machine Learn. Res. 3,
993-1022], in which each document is generated by choosing a
distribution over topics and then choosing each word in the
document from a topic selected according to this distribution. We
then present a Markov chain Monte Carlo algorithm for inference
in this model. We use this algorithm to analyze abstracts from
PNAS by using Bayesian model selection to establish the number of
topics. We show that the extracted topics capture meaningful
structure in the data, consistent with the class designations pro-
vided by the authors of the articles, and outline further applica-
tions of this analysis, including identifying ‘‘hot topics’’ by exam-
ining temporal dynamics and tagging abstracts to illustrate
semantic content.
When scientists decide to write a paper, one of the first
things they do is identify an interesting subset of the many
possible topics of scientific investigation. The topics addressed by
a paper are also one of the first pieces of information a person
tries to extract when reading a scientific abstract. Scientific
experts know which topics are pursued in their field, and this
information plays a role in their assessments of whether papers
are relevant to their interests, which research areas are rising or
falling in popularity, and how papers relate to one another. Here,
we present a statistical method for automatically extracting a
representation of documents that provides a first-order approx-
imation to the kind of knowledge available to domain experts.
Our method discovers a set of topics expressed by documents,
providing quantitative measures that can be used to identify the
content of those documents, track changes in content over time,
and express the similarity between documents. We use our
method to discover the topics covered by papers in PNAS in a
purely unsupervised fashion and illustrate how these topics can
be used to gain insight into some of the structure of science.
The statistical model we use in our analysis is a generative
model for documents; it reduces the complex process of pro-
ducing a scientific paper to a small number of simple probabi-
listic steps and thus specifies a probability distribution over all
possible documents. Generative models can be used to postulate
complex latent structures responsible for a set of observations,
making it possible to use statistical inference to recover this
structure. This kind of approach is particularly useful with text,
where the observed data (the words) are explicitly intended to
communicate a latent structure (their meaning). The particular
generative model we use, called Latent Dirichlet Allocation, was
introduced in ref. 1. This generative model postulates a latent
structure consisting of a set of topics; each document is produced
by choosing a distribution over topics, and then generating each
word at random from a topic chosen by using this distribution.
The plan of this article is as follows. In the next section, we
describe Latent Dirichlet Allocation and present a Markov chain
Monte Carlo algorithm for inference in this model, illustrating
the operation of our algorithm on a small dataset. We then apply
our algorithm to a corpus consisting of abstracts from PNAS
from 1991 to 2001, determining the number of topics needed to
account for the information contained in this corpus and ex-
tracting a set of topics. We use these topics to illustrate the
relationships between different scientific disciplines, assessing
trends and ‘‘hot topics’’ by analyzing topic dynamics and using
the assignments of words to topics to highlight the semantic
content of documents.
Documents, Topics, and Statistical Inference
A scientific paper can deal with multiple topics, and the words
that appear in that paper reflect the particular set of topics it
addresses. In statistical natural language processing, one com-
mon way of modeling the contributions of different topics to a
document is to treat each topic as a probability distribution over
words, viewing a document as a probabilistic mixture of these
topics (1–6). If we have Ttopics, we can write the probability of
the ith word in a given document as
Pw
i
j1
T
Pw
i
z
i
jPz
i
j,[1]
where z
i
is a latent variable indicating the topic from which the
ith word was drawn and P(w
i
z
i
j) is the probability of the word
w
i
under the jth topic. P(z
i
j) gives the probability of choosing
a word from topics jin the current document, which will var y
across different documents.
Intuitively, P(wz) indicates which words are important to a
topic, whereas P(z) is the prevalence of those topics within a
document. For example, in a journal that published only articles
in mathematics or neuroscience, we could express the probability
distribution over words with two topics, one relating to mathe-
matics and the other relating to neuroscience. The content of the
topics would be reflected in P(wz); the ‘‘mathematics’’ topic
would give high probability to words like theory, space, or
problem, whereas the ‘‘neuroscience’’ topic would give high
probability to words like synaptic, neurons, and hippocampal.
Whether a particular document concerns neuroscience, mathe-
matics, or computational neuroscience would depend on its
distribution over topics, P(z), which determines how these topics
are mixed together in forming documents. The fact that multiple
topics can be responsible for the words occurring in a single
document discriminates this model from a standard Bayesian
classifier, in which it is assumed that all the words in the
document come from a single class. The ‘‘soft classification’’
provided by this model, in which each document is characterized
in terms of the contributions of multiple topics, has applications
in many domains other than text (7).
This paper results from the Arthur M. Sackler Colloquium of the National Academy of
Sciences, ‘‘Mapping Knowledge Domains,’’ held May 9–11, 2003, at the Arnold and Mabel
Beckman Center of the National Academies of Sciences and Engineering in Irvine, CA.
To whom correspondence should be addressed. E-mail: gruffydd@psych.stanford.edu.
© 2004 by The National Academy of Sciences of the USA
5228–5235
PNAS
April 6, 2004
vol. 101
suppl. 1 www.pnas.orgcgidoi10.1073pnas.0307752101
Viewing documents as mixtures of probabilistic topics makes
it possible to formulate the problem of discovering the set of
topics that are used in a collection of documents. Given D
documents containing Ttopics expressed over Wunique words,
we can represent P(wz) with a set of Tmultinomial distributions
over the Wwords, such that P(wzj)
w
(j)
, and P(z) with
a set of Dmultinomial distributions
over the Ttopics, such that
for a word in document d,P(zj)
j
(d)
. To discover the set
of topics used in a corpus w{w
1
,w
2
,...,w
n
}, where each w
i
belongs to some document d
i
, we want to obtain an estimate of
that gives high probability to the words that appear in the
corpus. One strategy for obtaining such an estimate is to simply
attempt to maximize P(w
,
), following from Eq. 1directly by
using the Expectation-Maximization (8) algorithm to find max-
imum likelihood estimates of
and
(2, 3). However, this
approach is susceptible to problems involving local maxima and
is slow to converge (1, 2), encouraging the development of
models that make assumptions about the source of
.
Latent Dirichlet Allocation (1) is one such model, combining
Eq. 1with a prior probability distribution on
to provide a
complete generative model for documents. This generative
model specifies a simple probabilistic procedure by which new
documents can be produced given just a set of topics
, allowing
to be estimated without requiring the estimation of
. In Latent
Dirichlet Allocation, documents are generated by first picking a
distribution over topics
from a Dirichlet distribution, which
determines P(z) for words in that document. The words in the
document are then generated by picking a topic jfrom this
distribution and then picking a word from that topic according
to P(wzj), which is determined by a fixed
(j)
. The estimation
problem becomes one of maximizing P(w
,
)⫽兰
P(w
,
)P(
)d
, where P(
) is a Dirichlet (
) distribution. The
integral in this expression is intractable, and
is thus usually
estimated by using sophisticated approximations, either varia-
tional Bayes (1) or expectation propagation (9).
Using Gibbs Sampling to Discover Topics
Our strategy for discovering topics differs from previous ap-
proaches in not explicitly representing
or
as parameters to be
estimated, but instead considering the posterior distribution over
the assignments of words to topics, P(zw). We then obtain
estimates of
and
by examining this posterior distribution.
Evaluating P(zw) requires solving a problem that has been
studied in detail in Bayesian statistics and statistical physics,
computing a probability distribution over a large discrete state
space. We address this problem by using a Monte Carlo proce-
dure, resulting in an algorithm that is easy to implement, requires
little memory, and is competitive in speed and performance with
existing algorithms.
We use the probability model for Latent Dirichlet Allocation,
with the addition of a Dirichlet prior on
. The complete
probability model is thus
w
i
z
i
,
z
iDiscrete
z
i
Dirichlet
z
i
d
iDiscrete
d
i
Dirichlet(
)
Here,
and
are hyperparameters, specifying the nature of the
priors on
and
. Although these hyperparameters could be
vector-valued as in refs. 1 and 9, for the purposes of this article
we assume symmetric Dirichlet priors, with
and
each having
a single value. These priors are conjugate to the multinomial
distributions
and
, allowing us to compute the joint distri-
bution P(w,z) by integrating out
and
. Because P(w,z)
P(wz)P(z) and
and
only appear in the first and second terms,
respectively, we can perform these integrals separately. Inte-
grating out
gives the first term
Pwz
⌫共W
⌫共
W
T
j1
T
w
⌫共n
j
w
⌫共n
j
W
,[2]
in which n
j
(w)
is the number of times word whas been assigned
to topic jin the vector of assignments z, and () is the standard
gamma function. The second term results from integrating out
, to give
Pz
⌫共T
⌫共
T
D
d1
D
j
⌫共n
j
d
⌫共n
d
T
,[3]
where n
j
(d)
is the number of times a word from document dhas
been assigned to topic j. Our goal is then to evaluate the posterior
distribution.
PzwPw,z
z
Pw,z.[4]
Unfortunately, this distribution cannot be computed directly,
because the sum in the denominator does not factorize and
involves T
n
terms, where nis the total number of word instances
in the corpus.
Computing P(zw) involves evaluating a probability distribu-
tion on a large discrete state space, a problem that arises often
in statistical physics. Our setting is similar, in particular, to the
Potts model (e.g., ref. 10), with an ensemble of discrete variables
z, each of which can take on values in {1, 2,..., T}, and an
energy function given by H(z)⬀⫺log P(w,z)⫽⫺log P(wz)
log P(z). Unlike the Potts model, in which the energy function
is usually defined in terms of local interactions on a lattice, here
the contribution of each z
i
depends on all z
i
values through the
counts n
j
(w)
and n
j
(d)
. Intuitively, this energy function favors
ensembles of assignments zthat form a good compromise
between having few topics per document and having few words
per topic, with the terms of this compromise being set by the
hyperparameters
and
. The fundamental computational
problems raised by this model remain the same as those of the
Potts model: We can evaluate H(z) for any configuration z, but
the state space is too large to enumerate, and we cannot compute
the partition function that converts this into a probability
distribution (in our case, the denominator of Eq. 4). Conse-
quently, we apply a method that physicists and statisticians have
developed for dealing with these problems, sampling from the
target distribution by using Markov chain Monte Carlo.
In Markov chain Monte Carlo, a Markov chain is constructed
to converge to the target distribution, and samples are then taken
from that Markov chain (see refs. 10 12). Each state of the chain
is an assignment of values to the variables being sampled, in this
case z, and transitions between states follow a simple rule. We
use Gibbs sampling (13), known as the heat bath algorithm in
statistical physics (10), where the next state is reached by
sequentially sampling all variables from their distribution when
conditioned on the current values of all other variables and the
data. To apply this algorithm we need the full conditional
distribution P(z
i
z
i
,w). This distribution can be obtained by a
probabilistic argument or by cancellation of terms in Eqs. 2and
3, yielding
Pz
i
jz
i
,w兲⬀ n
i,j
w
i
n
i,j
W
n
i,j
d
i
n
i,
d
iT
,[5]
where n
i
()
is a count that does not include the current assignment
of z
i
. This result is quite intuitive; the first ratio expresses the
probability of w
i
under topic j, and the second ratio expresses the
probability of topic jin document d
i
. Critically, these counts are
Griffiths and Steyvers PNAS
April 6, 2004
vol. 101
suppl. 1
5229
the only information necessary for computing the full condi-
tional distribution, allowing the algorithm to be implemented
efficiently by caching the relatively small set of nonzero counts.
Having obtained the full conditional distribution, the Monte
Carlo algorithm is then straightforward. The z
i
variables are
initialized to values in {1, 2,...,T}, determining the initial state
of the Markov chain. We do this with an on-line version of the
Gibbs sampler, using Eq. 5to assign words to topics, but with
counts that are computed from the subset of the words seen so
far rather than the full data. The chain is then run for a number
of iterations, each time finding a new state by sampling each z
i
from the distribution specified by Eq. 5. Because the only
information needed to apply Eq. 5is the number of times a word
is assigned to a topic and the number of times a topic occurs in
a document, the algorithm can be run with minimal memory
requirements by caching the sparse set of nonzero counts and
updating them whenever a word is reassigned. After enough
iterations for the chain to approach the target distribution, the
current values of the z
i
variables are recorded. Subsequent
samples are taken after an appropriate lag to ensure that their
autocorrelation is low (10, 11).
With a set of samples from the posterior distribution P(zw),
statistics that are independent of the content of individual topics
can be computed by integrating across the full set of samples. For
any single sample we can estimate
and
from the value zby
ˆ
j
w
n
j
w
n
j
W
[6]
ˆ
j
d
n
j
d
n
d
T
.[7]
These values correspond to the predictive distributions over new
words wand new topics zconditioned on wand z.
A Graphical Example
To illustrate the operation of the algorithm and to show that it
runs in time comparable with existing methods of estimating
,
we generated a small dataset in which the output of the algorithm
can be shown graphically. The dataset consisted of a set of 2,000
images, each containing 25 pixels in a 5 5 grid. The intensity
of any pixel is specified by an integer value between zero and
infinity. This dataset is of exactly the same form as a word-
document cooccurrence matrix constructed from a database of
documents, with each image being a document, with each pixel
being a word, and with the intensity of a pixel being its frequency.
The images were generated by defining a set of 10 topics
corresponding to horizontal and vertical bars, as shown in Fig.
1a, then sampling a multinomial distribution
for each image
from a Dirichlet distribution with
1, and sampling 100 pixels
(words) according to Eq. 1. A subset of the images generated in
this fashion are shown in Fig. 1b. Although some images show
evidence of many samples from a single topic, it is difficult to
discern the underlying structure of most images.
We applied our Gibbs sampling algorithm to this dataset,
together with the two algorithms that have previously been used
for inference in Latent Dirichlet Allocation: variational Bayes
(1) and expectation propagation (9). (The implementations of
variational Bayes and expectation propagation were provided by
Tom Minka and are available at www.stat.cmu.edu~minka
papersaspect.html.) We divided the dataset into 1,000 training
images and 1,000 test images and ran each algorithm four times,
using the same initial conditions for all three algorithms on a
given run. These initial conditions were found by an online
application of Gibbs sampling, as mentioned above. Variational
Bayes and expectation propagation were run until convergence,
and Gibbs sampling was run for 1,000 iterations. All three
algorithms used a fixed Dirichlet prior on
, with
1. We
tracked the number of floating point operations per iteration for
each algorithm and computed the test set perplexity for the
estimates of
provided by the algorithms at several points.
Perplexity is a standard measure of performance for statistical
models of natural language (14) and is defined as exp{log
P(w
test
)n
test
}, where w
test
and n
test
indicate the identities and
number of words in the test set, respectively. Perplexity indicates
the uncertainty in predicting a single word; lower values are
better, and chance performance results in a perplexity equal to
the size of the vocabulary, which is 25 in this case. The perplexity
for all three models was evaluated by using importance sampling
as in ref. 9, and the estimates of
used for evaluating Gibbs
sampling were each obtained from a single sample as in Eq. 6.
The results of these computations are shown in Fig. 1c. All three
algorithms are able to recover the underlying topics, and Gibbs
sampling does so more rapidly than either variational Bayes or
expectation propagation. A graphical illustration of the opera-
tion of the Gibbs sampler is shown in Fig. 2. The log-likelihood
stabilizes quickly, in a fashion consistent across multiple runs,
and the topics expressed in the dataset slowly emerge as appro-
priate assignments of words to topics are discovered.
These results show that Gibbs sampling can be competitive in
speed with existing algorithms, although further tests with larger
datasets involving real text are necessary to evaluate the
strengths and weaknesses of the different algorithms. The effects
of including the Dirichlet (
) prior in the model and the use of
methods for estimating the hyperparameters
and
need to be
assessed as part of this comparison. A variational algorithm for
These estimates cannot be combined across samples for any analysis that relies on the
content of specic topics. This issue arises because of a lack of identiability. Because
mixtures of topics are used to form documents, the probability distribution over words
implied by the model is unaffected by permutations of the indices of the topics. Conse-
quently, no correspondence is needed between individual topics across samples; just
because two topics have index jin two samples is no reason to expect that similar words
were assigned to those topics in those samples. However, statistics insensitive to permu-
tation of the underlying topics can be computed by aggregating across samples.
Fig. 1. (a) Graphical representation of 10 topics, combined to produce
‘‘documents’’ like those shown in b, where each image is the result of 100
samples from a unique mixture of these topics. (c) Performance of three
algorithms on this dataset: variational Bayes (VB), expectation propagation
(EP), and Gibbs sampling. Lower perplexity indicates better performance, with
chance being a perplexity of 25. Estimates of the standard errors are smaller
than the plot symbols, which mark 1, 5, 10, 20, 50, 100, 150, 200, 300, and 500
iterations.
5230
www.pnas.orgcgidoi10.1073pnas.0307752101 Grifths and Steyvers
this ‘‘smoothed’’ model is described in ref. 1, which may be more
similar to the Gibbs sampling algorithm described here. Ulti-
mately, these different approaches are complementary rather
than competitive, providing different means of performing
approximate inference that can be selected according to the
demands of the problem.
Model Selection
The statistical model we have described is conditioned on three
parameters, which we have suppressed in the equations above:
the Dirichlet hyperparameters
and
and the number of topics
T. Our algorithm is easily extended to allow
,
, and zto be
sampled, but this extension can slow the convergence of the
Markov chain. Our strategy in this article is to fix
and
and
explore the consequences of varying T. The choice of
and
can
have important implications for the results produced by the
model. In particular, increasing
can be expected to decrease
the number of topics used to describe a dataset, because it
reduces the impact of sparsity in Eq. 2. The value of
thus affects
the granularity of the model: a corpus of documents can be
sensibly factorized into a set of topics at several different scales,
and the particular scale assessed by the model will be set by
.
With scientific documents, a large value of
would lead the
model to find a relatively small number of topics, perhaps at the
level of scientific disciplines, whereas smaller values of
will
produce more topics that address specific areas of research.
Given values of
and
, the problem of choosing the
appropriate value for Tis a problem of model selection, which
we address by using a standard method from Bayesian statistics
(15). For a Bayesian statistician faced with a choice between a
set of statistical models, the natural response is to compute the
posterior probability of that set of models given the observed
data. The key constituent of this posterior probability will be the
likelihood of the data given the model, integrating over all
parameters in the model. In our case, the data are the words in
the corpus, w, and the model is specified by the number of topics,
T, so we wish to compute the likelihood P(wT). The complica-
tion is that this requires summing over all possible assignments
of words to topics z. However, we can approximate P(wT)by
taking the harmonic mean of a set of values of P(wz,T) when
zis sampled from the posterior P(zw,T) (15). Our Gibbs
sampling algorithm provides such samples, and the value of
P(wz,T) can be computed from Eq. 2.
The Topics of Science
The algorithm outlined above can be used to find the topics that
account for the words used in a set of documents. We applied this
algorithm to the abstracts of papers published in PNAS from
1991 to 2001, with the aim of discovering some of the topics
addressed by scientific research. We first used Bayesian model
selection to identify the number of topics needed to best account
for the structure of this corpus, and we then conducted a detailed
analysis with the selected number of topics. Our detailed analysis
involved examining the relationship between the topics discov-
ered by our algorithm and the class designations supplied by
PNAS authors, using topic dynamics to identify ‘‘hot topics’’ and
using the topic assignments to highlight the semantic content in
abstracts.
How Many Topics? To evaluate the consequences of changing the
number of topics T, we used the Gibbs sampling algorithm
outlined in the preceding section to obtain samples from the
posterior distribution over zat several choices of T. We used all
28,154 abstracts published in PNAS from 1991 to 2001, with each
of these abstracts constituting a single document in the corpus
(we will use the words abstract and document interchangeably
from this point forward). Any delimiting character, including
hyphens, was used to separate words, and we deleted any words
that occurred in less than five abstracts or belonged to a standard
‘‘stop’’ list used in computational linguistics, including numbers,
individual characters, and some function words. This gave us a
vocabulary of 20,551 words, which occurred a total of 3,026,970
times in the corpus.
For all runs of the algorithm, we used
0.1 and
50T,
keeping constant the sum of the Dirichlet hyperparameters,
which can be interpreted as the number of virtual samples
contributing to the smoothing of
. This value of
is relatively
small and can be expected to result in a fine-grained decompo-
sition of the corpus into topics that address specific research
areas. We computed an estimate of P(wT) for Tvalues of 50, 100,
200, 300, 400, 500, 600, and 1,000 topics. For all values of T,
except the last, we ran eight Markov chains, discarding the first
1,000 iterations, and then took 10 samples from each chain at a
lag of 100 iterations. In all cases, the log-likelihood values
stabilized within a few hundred iterations, as in Fig. 2. The
simulation with 1,000 topics was more time-consuming, and thus
we used only six chains, taking two samples from each chain after
700 initial iterations, again at a lag of 100 iterations.
Estimates of P(wT) were computed based on the full set of
samples for each value of Tand are shown in Fig. 3. The results
suggest that the data are best accounted for by a model incor-
porating 300 topics. P(wT) initially increases as a function of T,
reaches a peak at T300, and then decreases thereafter. This
kind of profile is often seen when varying the dimensionality of
a statistical model, with the optimal model being rich enough to
fit the information available in the data, yet not so complex as
to begin fitting noise. As mentioned above, the value of Tfound
through this procedure depends on the choice of
and
, and
it will also be affected by specific decisions made in forming the
dataset, such as the use of a stop list and the inclusion of
documents from all PNAS classifications. By using just P(wT)to
choose a value of T, we are assuming very weak prior constraints
on the number of topics. P(wT) is just the likelihood term in the
inference to P(Tw), and the prior P(T) might overwhelm this
likelihood if we had a particularly strong preference for a smaller
number of topics.
Fig. 2. Results of running the Gibbs sampling algorithm. The log-likelihood,
shown on the left, stabilizes after a few hundred iterations. Traces of the
log-likelihood are shown for all four runs, illustrating the consistency in values
across runs. Each row of images on the right shows the estimates of the topics
after a certain number of iterations within a single run, matching the points
indicated on the left. These points correspond to 1, 2, 5, 10, 20, 50, 100, 150,
200, 300, and 500 iterations. The topics expressed in the data gradually emerge
as the Markov chain approaches the posterior distribution.
Grifths and Steyvers PNAS
April 6, 2004
vol. 101
suppl. 1
5231
Scientific Topics and Classes. When authors submit a paper to
PNAS, they choose one of three major categories, indicating
whether a paper belongs to the Biological, the Physical, or the
Social Sciences, and one of 33 minor categories, such as Ecology,
Pharmacology, Mathematics, or Economic Sciences. (Anthro-
pology and Psychology can be chosen as a minor categor y for
papers in both Biological and Social Sciences. We treat these
minor categories as distinct for the purposes of our analysis.)
Having a class designation for each abstract in the corpus
provides two opportunities. First, because the topics recovered
by our algorithm are purely a consequence of the statistical
structure of the data, we can evaluate whether the class desig-
nations pick out differences between abstracts that can be
expressed in terms of this statistical structure. Second, we can
use the class designations to illustrate how the distribution over
topics can reveal relationships between documents and between
document classes.
We used a single sample taken after 2,000 iterations of Gibbs
sampling and computed estimates of
(d)
by means of Eq. 7. (In
this and other analyses, similar results were obtained by exam-
ining samples across multiple chains, up to the permutation of
topics, and the choice of this particular sample to display the
results was arbitrary.) Using these estimates, we computed a
mean
vector for each minor category, considering just the 2,620
abstracts published in 2001. We then found the most diagnostic
topic for each minor category, defined to be the topic jfor which
the ratio of
j
for that category to the sum of
j
across all other
categories was greatest. The results of this analysis are shown in
Fig. 4. The matrix shown in Fig. 4 Upper indicates the mean value
of
for each minor category, restricted to the set of most
diagnostic topics. The strong diagonal is a consequence of our
selection procedure, with diagnostic topics having high proba-
bility within the classes for which they are diagnostic, but low
probability in other classes. The off-diagonal elements illustrate
the relationships between classes, with similar classes showing
similar distributions across topics.
The distributions over topics for the different classes illustrate
how this statistical model can capture similarity in the semantic
content of documents. Fig. 4 reveals relationships between
specific minor categories, such as Ecology and Evolution, and
some of the correspondences within major categories; for ex-
ample, the minor categories in the Physical and Social Sciences
show much greater commonality in the topics appearing in their
abstracts than do the Biological Sciences. The results can also be
used to assess how much different disciplines depend on partic-
ular methods. For example, topic 39, relating to mathematical
methods, receives reasonably high probability in Applied Math-
ematics, Applied Physical Sciences, Chemistry, Engineering,
Mathematics, Physics, and Economic Sciences, suggesting that
mathematical theory is particularly relevant to these disciplines.
The content of the diagnostic topics themselves is shown in
Fig. 4 Lower, listing the five words given highest probability by
each topic. In some cases, a single topic was the most diagnostic
for several classes: topic 2, containing words relating to global
climate change, was diagnostic of Ecology, Geology, and Geo-
physics; topic 280, containing words relating to evolution and
natural selection, was diagnostic of both Evolution and Popu-
lation Biology; topic 222, containing words relating to cognitive
neuroscience, was diagnostic of Psychology as both a Biological
and a Social Science; topic 39, containing words relating to
mathematical theory, was diagnostic of both Applied Mathe-
matics and Mathematics; and topic 270, containing words having
to do with spectroscopy, was diagnostic of both Chemistr y and
Physics. The remaining topics were each diagnostic of a single
minor category and, in general, seemed to contain words rele-
vant to enquiry in that discipline. The only exception was topic
109, diagnostic of Economic Sciences, which contains words
generally relevant to scientific research. This may be a conse-
quence of the relatively small number of documents in this class
(only three in 2001), which makes the estimate of
extremely
unreliable. Topic 109 also serves to illustrate that not all of the
topics found by the algorithm correspond to areas of research;
some of the topics picked out scientific words that tend to occur
together for other reasons, like those that are used to describe
data or those that express tentative conclusions.
Finding strong diagnostic topics for almost all of the minor
categories suggests that these categories have differences that
can be expressed in terms of the statistical structure recovered
by our algorithm. The topics discovered by the algorithm are
found in a completely unsupervised fashion, using no informa-
tion except the distribution of the words themselves, implying
that the minor categories capture real differences in the content
of abstracts, at the level of the words used by authors. It also
shows that this algorithm finds genuinely informative structure
in the data, producing topics that connect with our intuitive
understanding of the semantic content of documents.
Hot and Cold Topics. Historians, sociologists, and philosophers of
science and scientists themselves recognize that topics rise and
fall in the amount of scientific interest they generate, although
whether this is the result of social forces or rational scientific
practice is the subject of debate (e.g., refs. 16 and 17). Because
our analysis reduces a corpus of scientific documents to a set of
topics, it is straightforward to analyze the dynamics of these
topics as a means of gaining insight into the dynamics of science.
If understanding these dynamics is the goal of our analysis, we
can formulate more sophisticated generative models that incor-
porate parameters describing the change in the prevalence of
topics over time. Here, we present a basic analysis based on a
post hoc examination of the estimates of
produced by the
model. Being able to identify the ‘‘hot topics’’ in science at a
particular point is one of the most attractive applications of this
kind of model, providing quantitative measures of the preva-
lence of particular kinds of research that may be useful for
historical purposes and for determination of targets for scientific
funding. Analysis at the level of topics provides the opportunity
to combine information about the occurrences of a set of
semantically related words with cues that come from the content
of the remainder of the document, potentially highlighting trends
Fig. 3. Model selection results, showing the log-likelihood of the data for
different settings of the number of topics, T. The estimated standard errors for
each point were smaller than the plot symbols.
5232
www.pnas.orgcgidoi10.1073pnas.0307752101 Grifths and Steyvers
that might be less obvious in analyses that consider only the
frequencies of single words.
To find topics that consistently rose or fell in popularity from
1991 to 2001, we conducted a linear trend analysis on
j
by year,
using the same single sample as in our previous analyses. We
applied this analysis to the sample used to generate Fig. 4.
Consistent with the idea that science shows strong trends, with
topics rising and falling regularly in popularity, 54 of the topics
showed a statistically significant increasing linear trend, and 50
showed a statistically significant decreasing linear trend, both at
the P0.0001 level. The three hottest and coldest topics,
assessed by the size of the linear trend test statistic, are shown
in Fig. 5. The hottest topics discovered through this analysis were
topics 2, 134, and 179, corresponding to global warming and
Fig. 4. (Upper) Mean values of
at each of the diagnostic topics for all 33 PNAS minor categories, computed by using all abstracts published in 2001. Higher
probabilities are indicated with darker cells. (Lower) The ve most probable words in the topics themselves listed in the same order as on the horizontal axis in
Upper.
Grifths and Steyvers PNAS
April 6, 2004
vol. 101
suppl. 1
5233
climate change, gene knockout techniques, and apoptosis (pro-
grammed cell death), the subject of the 2002 Nobel Prize in
Physiology. The cold topics were not topics that lacked preva-
lence in the corpus but those that showed a strong decrease in
popularity over time. The coldest topics were 37, 289, and 75,
corresponding to sequencing and cloning, structural biology, and
immunology. All these topics were very popular in about 1991
and fell in popularity over the period of analysis. The Nobel
Prizes again provide a good means of validating these trends,
with prizes being awarded for work on sequencing in 1993 and
immunology in 1989.
Tagging Abstracts. Each sample produced by our algorithm con-
sists of a set of assignments of words to topics. We can use these
assignments to identify the role that words play in documents. In
particular, we can tag each word with the topic to which it was
assigned and use these assignments to highlight topics that are
particularly informative about the content of a document. The
abstract shown in Fig. 6 is tagged with topic labels as superscripts.
Words without superscripts were not included in the vocabulary
supplied to the model. All assignments come from the same
single sample as used in our previous analyses, illustrating the
kind of words assigned to the evolution topic discussed above
(topic 280).
This kind of tagging is mainly useful for illustrating the content
of individual topics and how individual words are assigned, and
it was used for this purpose in ref. 1. It is also possible to use the
results of our algorithm to highlight conceptual content in other
ways. For example, if we integrate across a set of samples, we can
compute a probability that a particular word is assigned to the
most prevalent topic in a document. This probability provides a
graded measure of the importance of a word that uses informa-
tion from the full set of samples, rather than a discrete measure
computed from a single sample. This form of highlighting is used
to set the contrast of the words shown in Fig. 6 and picks out the
words that determine the topical content of the document. Such
methods might provide a means of increasing the efficiency of
searching large document databases, in particular, because it can
be modified to indicate words belonging to the topics of interest
to the searcher.
Conclusion
We have presented a statistical inference algorithm for Latent
Dirichlet Allocation (1), a generative model for documents in
Fig. 5. The plots show the dynamics of the three hottest and three coldest topics from 1991 to 2001, dened as those topics that showed the strongest positive
and negative linear trends. The 12 most probable words in those topics are shown below the plots.
Fig. 6. A PNAS abstract (18) tagged according to topic assignment. The superscripts indicate the topics to which individual words were assigned in a single
sample, whereas the contrast level reects the probability of a word being assigned to the most prevalent topic in the abstract, computed across samples.
5234
www.pnas.orgcgidoi10.1073pnas.0307752101 Grifths and Steyvers
which each document is viewed as a mixture of topics, and have
shown how this algorithm can be used to gain insight into the
content of scientific documents. The topics recovered by our
algorithm pick out meaningful aspects of the structure of science
and reveal some of the relationships between scientific papers in
different disciplines. The results of our algorithm have several
interesting applications that can make it easier for people to
understand the information contained in large knowledge do-
mains, including exploring topic dynamics and indicating the role
that words play in the semantic content of documents.
The results we have presented use the simplest model of this
kind and the simplest algorithm for generating samples. In future
research, we intend to extend this work by exploring both more
complex models and more sophisticated algorithms. Whereas in
this article we have focused on the analysis of scientific docu-
ments, as represented by the articles published in PNAS, the
methods and applications we have presented are relevant to a
variety of other knowledge domains. Latent Dirichlet Allocation
is a statistical model that is appropriate for any collection of
documents, from e-mail records and newsgroups to the entire
World Wide Web. Discovering the topics underlying the struc-
ture of such datasets is the first step to being able to visualize
their content and discover meaningful trends.
We thank Josh Tenenbaum, Dave Blei, and Jun Liu for thoughtful
comments that improved this paper, Kevin Boyack for providing the
PNAS class designations, Shawn Cokus for writing the random number
generator, and Tom Minka for writing the code used for the comparison
of algorithms. Several simulations were performed on the BlueHorizon
supercomputer at the San Diego Supercomputer Center. This work was
supported by funds from the NTT Communication Sciences Laboratory
(Japan) and by a Stanford Graduate Fellowship (to T.L.G.).
1. Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) J. Machine Learn. Res. 3, 9931022.
2. Hofmann, T. (2001) Machine Learn. J. 42, 177196.
3. Cohn, D. & Hofmann, T. (2001) in Advances in Neural Information Processing
Systems 13 (MIT Press, Cambridge, MA), pp. 430436.
4. Iyer, R. & Ostendorf, M. (1996) in Proceedings of the International Conference
on Spoken Language Processing (Applied Science & Engineering Laboratories,
Alfred I. duPont Inst., Wilmington, DE), Vol 1., pp. 236 239.
5. Bigi, B., De Mori, R., El-Beze, M. & Spriet, T. (1997) in 1997 IEEE Workshop
on Automatic Speech Recognition and Understanding Proceedings (IEEE,
Piscataway, NJ), pp. 535542.
6. Ueda, N. & Saito, K. (2003) in Advances in Neural Information Processing
Systems (MIT Press, Cambridge, MA), Vol. 15.
7. Erosheva, E. A. (2003) in Bayesian Statistics (Oxford Univ. Press, Oxford), Vol. 7.
8. Dempster, A. P., Laird, N. M. & Rubin, D. B. (1977) J. R. Stat. Soc. B 39, 138.
9. Minka, T. & Lafferty, J. (2002) Expectation-propagation for the generative
aspect model. In Proceedings of the 18th Conference on Uncertainty in Artificial
Intelligence (Elsevier, New York).
10. Newman, M. E. J. & Barkema, G. T. (1999) Monte Carlo Methods in Statistical
Physics (Oxford Univ. Press, Oxford).
11. Gilks, W. R., Richardson, S. & Spiegelhalter, D. J. (1996) Markov Chain Monte
Carlo in Practice (Chapman & Hall, New York).
12. Liu, J. S. (2001) Monte Carlo Strategies in Scientific Computing (Springer, New
York).
13. Geman, S. & Geman, D. (1984) IEEE Trans. Pattern Anal. Machine Intelligence
6, 721741.
14. Manning, C. D. & Schutze, H. (1999) Foundations of Statistical Natural
Language Processing (MIT Press, Cambridge, MA).
15. Kass, R. E. & Raftery, A. E. (1995) J. Am. Stat. Assoc. 90, 773795.
16. Kuhn, T. S. (1970) The Structure of Scientific Revolutions (Univ. of Chicago
Press, Chicago), 2nd Ed.
17. Salmon, W. (1990) in Scientific Theories, Minnesota Studies in the Philosophy of
Science, ed. Savage, C. W. (Univ. of Minnesota Press, Minneapolis), Vol. 14.
18. Findlay, C. S. (1991) Proc. Natl. Acad. Sci. USA 88, 4874 4876.
Grifths and Steyvers PNAS
April 6, 2004
vol. 101
suppl. 1
5235
... Thus, determining the quality, coherence or relevance of the results often requires human interpretation. Griffiths and Steyvers (2004) emphasized the role of human judgement in evaluating the quality and interpretability of topics. They suggested that while various statistical measures can provide guidance on the number of topics, the final decision often benefits from human expertise to ensure the topics are coherent, interpretable and aligned with the objectives of the analysis (Griffiths & Steyvers, 2004). ...
... Griffiths and Steyvers (2004) emphasized the role of human judgement in evaluating the quality and interpretability of topics. They suggested that while various statistical measures can provide guidance on the number of topics, the final decision often benefits from human expertise to ensure the topics are coherent, interpretable and aligned with the objectives of the analysis (Griffiths & Steyvers, 2004). Due to the reasons explained above, we evaluated the results obtained using both the elbow method and coherence scores to determine the optimal number of topics. ...
Article
Full-text available
Aim This study is set to determine the main topics of the nursing field and to show the changing perspectives over time by analysing the abstracts of several major nursing research journals using text mining methodology. Design Text mining and network analysis. Methods Text analysis combines automatic and manual operations to identify patterns in unstructured data. Detailed searches covering 1998–2021 were conducted in PubMed archives to collect articles from six nursing journals: Journal of Advanced Nursing , International Journal of Nursing Studies , Western Journal of Nursing Research , Nursing Research , Journal of Nursing Scholarship and Research in Nursing and Health . This study uses a four‐phase text mining and network approach, gathering text data and cleaning, preprocessing, text analysis and advanced analyses. Analyses and data visualization were performed using Endnote, JMP, Microsoft Excel, Tableau and VOSviewer versions. From six journals, 17,581 references in PubMed were combined into one EndNote file. Due to missing abstract information, 2496 references were excluded from the study. The remaining references ( n = 15,085) were used for the text mining analyses. Results Eighteen subjects were determined into two main groups; research method topics and nursing research topics. The most striking topics are qualitative research, concept analysis, advanced practice in the downtrend, and literature search, statistical analysis, randomized control trials, quantitative research, nurse practice environment, risk assessment and nursing science. According to the network analysis results, nursing satisfaction and burnout and nursing practice environment are highly correlated and represent 10% of the total corpus. This study contributes in various ways to the field of nursing research enhanced by text mining. The study findings shed light on researchers becoming more aware of the latest research status, sub‐fields and trends over the years, identifying gaps and planning future research agendas. No patient or public contribution.
... Latent Dirichlet Allocation (LDA) is the major topic of this study. The number of latent themes was calculated using the loglikelihood and perplexity of the data (Arun et al. 2010;Cao et al. 2009;Griffiths and Steyvers 2004). Perplexity is a measure that shows whether 'the model predicts the remaining words in a given subject after witnessing a portion of it, whereas log-likelihood evaluates how well the latent topics represent the observed data' (Guerreiro, Rita, and Trigueiros 2016). ...
... Figure 4 shows the range of possible themes investigated in this study, which ranged from K=2 to K=60. (2014) In the figure, Griffiths and Steyvers (2004) show a sharp rise after 20 topics, which suggests that the model is picking up increasingly detailed and distinct themes in the data as the number of subjects rises. The plateau at about 20 topics indicates that adding more topics does not improve the model's capacity to detect more significant patterns after a certain point. ...
Preprint
There have been many academic investigations on promoting human behaviour in favour of environmental sustainability though education. Yet only a limited number of review paper is found which have summarized the key findings of this vital area. This study uses both bibliometric and text mining approaches to examine the pro-environmental behaviour in education literature for the first time. Through bibliometric analysis, different networks in contemporary literature are highlighted. These networks reveal the influence of social welfare on pro-environmental behaviour, highlighting the value of human innate connection to raise environmental consciousness. Additionally, using posterior probability and Latent Dirichlet allocation (LDA), text mining identifies 12 different topic models by log-likelihood estimation, addressing a variety of topics related to environmental education and behaviour, such as how visitors and sustainability in environmental education affects pro-environmental behaviour, how education is provided in schools and universities directing towards sustainability, how the theory of planned behaviour is applied, how education and pro-environmental behaviour are related, and how sustainable education and travel are explored through different channels.
... The GDCMLDA model involves many hidden parameters, leading to the computation of the posterior distribution, that is not analytically tractable. In such situations, approximation techniques are employed, and one widely adopted method is Gibbs sampling [20]. This method is a Markov chain Monte Carlo (MCMC) approach that iteratively samples from the conditional distributions of the latent variables, allowing us to Specifically, our model contains six unobserved variables: a, b, β, ϕ, θ, and z, categorized into per-document or perword parameters ϕ, θ, and z) and hyperparameters (a, b, and β). ...
Conference Paper
Topic modeling is a powerful tool with wide-ranging applications, and many models have been developed and used over the years. However, most of these models have a common weakness: they do not account for the burstiness of word usage. This phenomenon, where a word is more likely to be used again once it appears in a document, is a key feature of natural language. In this paper, we present a new topic modeling approach that integrated Generalized Dirichlet and Dirichlet Compound Multinomial (DCM) distributions to explicitly model word burstiness while accommodating diverse and adaptable patterns of topic proportions. Our experimental results on various text datasets demonstrate the efficiency of our proposed model compared to existing baseline models, achieving better perplexity and coherence scores.
... To find a "good" number of topics, four different metrics were calculated. Arun et al. (2010) and Cao et al. (2009) which need to be minimized and Griffiths (Griffiths and Steyvers 2004) and Deveaud et al. (2014) which need to be maximized. These measures select the best number of topics using a symmetric Kullback-Leibler divergence of salient distributions which are derived from the factorization of the document-term matrix (Marin Vargas et al. 2021). ...
Article
Full-text available
Background Growing interest in agrobiodiversity and sustainable agricultural practices has stimulated debates on diversifying cropping systems, furthering the potential for the reintroduction of underutilised crops. These crops may support multiple ecosystem services and enhance food security and agricultural value chains. This study used a systematic mapping approach to collate and summarise the state of research literature addresses the research question: What is the evidence for ecosystem service provision and economic value of underutilised crops? We focused on oats, triticale, hull-less barley, narrow-leaved lupin, buckwheat and faba beans due to their limited use in Europe, their broad gene pool, ecological benefits, and nutritional value. Method Three academic databases were used to identify research articles investigating the impacts of using the six underutilised crops of interest on outcomes including breeding, agronomic traits, nutrition and health, and economic values. In addition, current and recently completed European projects were searched to identify ongoing relevant research. After screening for relevance, data was extracted from all included articles and projects and imported into a spreadsheet for cross-tabulation and to produce descriptive statistics. Results From an initial 34,522 articles identified by the searches, 1346 relevant primary research articles containing 2229 studies were included. A total of 38 relevant European projects were identified, with 112 research results or goals relating to the six underutilised crops. Faba bean was the most common crop in both European projects and published literature. No current projects had a focus on hull-less barley. Agronomic traits were the most common primary research topic across the crops (56.39%), with oats and faba bean being well researched. Hull-less barley was the least studied crop across all topics. Within sub-topics related to specific ecosystem services, desirable traits, disease, weed and pest control all ranked highly, whilst invertebrate diversity and nitrogen fixation ranked lowest. Conclusion Primary research varies between crops and topics, with hull-less barley receiving the least interest. Key knowledge gaps were identified in all crops across all topics relating to breeding tools, breeding for desirable traits, agronomic traits of buckwheat, narrow-leaved lupin and hull-less barley, inclusion of the crops in human nutrition and health, and the socioeconomics of these crops. Evidence presented in this map could inform further research areas with these crops and aid future policy making for the inclusion of these crops in rotations and practices that could benefit all stakeholders along the food systems value chain.
... This decision was informed by prior knowledge of the type and quantity of questions the examiners are instructed to ask during the ADOS conversation activities. Hyperparameter estimation was done using the variational expectation-maximization (VEM) algorithm with a starting α value of 50/k (Grün and Hornik, 2011;Griffiths and Steyvers, 2004). ...
Conference Paper
Full-text available
Topic distribution matrices created by topic models are typically used for document classification or as features in a separate machine learning algorithm. Existing methods for evaluating these topic distributions include metrics such as coherence and perplexity; however, there is a lack of statistically grounded evaluation tools. We present a statistical method for investigating group differences in the document-topic distribution vectors created by Latent Dirichlet Allocation (LDA) that uses Aitchison geometry to transform the vectors, multivariate analysis of variance (MANOVA) to compare sample means, and partial eta squared to calculate effect size. Using a corpus of dialogues between Autistic and Typically Developing (TD) children and trained examiners, we found that the topic distributions of Autistic children differed from those of TD children when responding to questions about social difficulties (p = .0083, partial eta squared = .19). Furthermore, the examiners’ topic distributions differed between the Autistic and TD groups when discussing emotions (p = .0035, partial eta squared = .20), social difficulties (p < .001, partial eta squared = .30), and friends (p = .0224, partial eta squared = .17). These results support the use of topic modeling in studying clinically relevant features of social communication such as topic maintenance.
Article
This study analyses the role of the main Spanish political groups in the polarisation of public opinion and the promotion of the culture of disinformation through Twitter (now Platform X). The study carries out an analysis of issues associated with tweets and retweets in Spanish of the total published (n = 33,506 messages out of a total of 49,288 messages), which are contrasted with 2,730 disinformation publications identified by the two most relevant fact-checking projects in Spain (Maldita.es and Newtral.es). Based on the applied methodology, a political-communicative context is observed on Platform X characterised by a high level of self-promotion and polarisation, facilitated by the communication strategy of specific topics, applied by the actors analysed. The results show how these political actors can play an active and differentiated role in the promotion of disinformation content identified by the Maldita.es and Newtral.es data verification projects. This may contribute to the polarisation of Spanish public opinion on Platform X by delegitimising the opinions of their opponents on issues of interest to the public.
Chapter
This chapter introduces the main topics and objectives discussed in subsequent sections, covering various aspects of opinion mining and sentiment analysis. It addresses different challenges and proposes novel methods. Section 3.1 highlights the need to filter out irrelevant information and personal attacks in online discussions to focus on evaluative opinion sentences. It proposes an unsupervised method utilizing natural language processing techniques and machine learning algorithms to automatically filter and classify sentences with evaluative opinions. This section calls for further research to explore more precise and efficient methods for identifying evaluative opinions and their application in sentiment polarity analysis. Section 3.3 presents a novel subproblem in opinion mining, focusing on grouping feature expressions in product reviews. It argues for the necessity of user supervision in practical applications and proposes an EM formulation enhanced with soft constraints to achieve accurate opinion summaries. This section showcases the competence and generality of the proposed method through experimental results from various domains and languages. Section 3.1 introduces the use of topic modeling, specifically the LDA method, for sentiment mining. It extends the LDA method to handle large-scale constraints and proposes two methods for automatically extracting constraints to guide the topic modeling process. The constrained-LDA model and extracted constraints are then applied to group product features, demonstrating superior performance compared to other methods. Section 3.3 addresses the challenge of grouping synonyms in opinion mining, proposing an efficient method based on similarity measurement. Experimental results from different domains validate the effectiveness of the method. Section 3.3 focuses on feature extraction for sentiment classification and compares the impact of different types of features through experimental analysis. This section provides an in-depth study of all feature types and discusses key problems associated with feature extraction algorithms. Section 3.2 explores the use of unsupervised learning methods for sentiment classification, emphasizing their advantages in classifying opinionated texts at different levels and for feature-based opinion mining. This section presents an empirical investigation of unsupervised sentiment classification of Chinese reviews and proposes an algorithm to remove domain-specific sentiment noise words. Section 3.6 introduces the use of substring-group features for sentiment classification through a transductive learning-based algorithm. Experimental results in multiple languages demonstrate the effectiveness of the algorithm and highlight the superiority of the “tfidf-c” approach for term weighting. Therefore, this chapter provides a comprehensive overview of various aspects of opinion mining and sentiment analysis, and proposes several innovative methods to address different challenges. From filtering and classifying evaluative opinion sentences to grouping product features, and utilizing topic modeling and similarity measurement for sentiment mining, this chapter covers a wide range of topics and techniques. The practicality and superiority of the proposed methods are demonstrated through empirical experiments. These studies lay a foundation for further advancements in opinion mining and sentiment analysis, and hold significant value in practical applications.
Conference Paper
Full-text available
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Article
Full-text available
This paper presents a novel statistical method for factor analysis of binary and count data which is closely related to a technique known as Latent Semantic Analysis. In contrast to the latter method which stems from linear algebra and performs a Singular Value Decomposition of co-occurrence tables, the proposed technique uses a generative latent class model to perform a probabilistic mixture decomposition. This results in a more principled approach with a solid foundation in statistical inference. More precisely, we propose to make use of a temperature controlled version of the Expectation Maximization algorithm for model fitting, which has shown excellent performance in practice. Probabilistic Latent Semantic Analysis has many applications, most prominently in information retrieval, natural language processing, machine learning from text, and in related areas. The paper presents perplexity results for different types of text and linguistic data collections and discusses an application in automated document indexing. The experiments indicate substantial and consistent improvements of the probabilistic method over standard Latent Semantic Analysis.
Article
In a 1935 paper and in his book Theory of Probability, Jeffreys developed a methodology for quantifying the evidence in favor of a scientific theory. The centerpiece was a number, now called the Bayes factor, which is the posterior odds of the null hypothesis when the prior probability on the null is one-half. Although there has been much discussion of Bayesian hypothesis testing in the context of criticism of P-values, less attention has been given to the Bayes factor as a practical tool of applied statistics. In this article we review and discuss the uses of Bayes factors in the context of five scientific applications in genetics, sports, ecology, sociology, and psychology. We emphasize the following points:
Book
Scitation is the online home of leading journals and conference proceedings from AIP Publishing and AIP Member Societies
Article
We make an analogy between images and statistical mechanics systems. Pixel gray levels and the presence and orientation of edges are viewed as states of atoms or molecules in a lattice-like physical system. The assignment of an energy function in the physical system determines its Gibbs distribution. Because of the Gibbs distribution, Markov random field (MRF) equivalence, this assignment also determines an MRF image model. The energy function is a more convenient and natural mechanism for embodying picture attributes than are the local characteristics of the MRF. For a range of degradation mechanisms, including blurring, nonlinear deformations, and multiplicative or additive noise, the posterior distribution is an MRF with a structure akin to the image model. By the analogy, the posterior distribution defines another (imaginary) physical system. Gradual temperature reduction in the physical system isolates low energy states (``annealing''), or what is the same thing, the most probable states under the Gibbs distribution. The analogous operation under the posterior distribution yields the maximum a posteriori (MAP) estimate of the image given the degraded observations. The result is a highly parallel ``relaxation'' algorithm for MAP estimation. We establish convergence properties of the algorithm and we experiment with some simple pictures, for which good restorations are obtained at low signal-to-noise ratios.