A Hierarchical Nonparametric Bayesian Approach
to Statistical Language Model Domain Adaptation
Frank Wood and Yee Whye Teh
Gatsby Computational Neuroscience Unit
University College London
London WC1N 3AR, UK
In this paper we present a doubly hierarchi-
cal Pitman-Yor process language model. Its
bottom layer of hierarchy consists of multi-
ple hierarchical Pitman-Yor process language
models, one each for some number of do-
mains. The novel top layer of hierarchy con-
sists of a mechanism to couple together mul-
tiple language models such that they share
statistical strength. Intuitively this sharing
results in the “adaptation” of a latent shared
language model to each domain. We intro-
duce a general formalism capable of describ-
ing the overall model which we call the graph-
ical Pitman-Yor process and explain how to
perform Bayesian inference in it. We present
encouraging language model domain adapta-
tion results that both illustrate the potential
benefits of our new model and suggest new
avenues of inquiry.
Consider the problem of statistical natural language
model domain adaptation. Statistical language mod-
els typically have a very large number of parameters
and thus need a large quantity of training data to pro-
duce good estimates of those parameters. If one re-
quires a domain-specific model, obtaining a sufficient
quantity of domain-specific training data can be both
costly and logistically challenging. It is easy, however,
to obtain non-domain-specific data, e.g. text from the
world wide web. Unfortunately models trained using
such data are often ill-suited for domain-specific ap-
Appearing in Proceedings of the 12thInternational Confe-
rence on Artificial Intelligence and Statistics (AISTATS)
2009, Clearwater Beach, Florida, USA. Volume 5 of JMLR:
W&CP 5. Copyright 2009 by the authors.
plications (Rosenfeld 2000). The phrase domain adap-
tation describes procedures that take a model trained
on a large amount of non-specific data and adapt it to
work well for a specific domain for which less training
data is available.
This paper describes a way to introduce another level
of hierarchy to the already hierarchical Pitman-Yor
process language model (HPYLM) (Teh 2006) such
that a latent shared, non-domain-specific language
model as well as domain specific models are esti-
mated together . We call the resulting model the dou-
bly hierarchical Pitman-Yor process language model
(DHPYLM) (Section 3). Intuitively such a model is
the natural hierarchical Bayesian approach to domain
adaptation. Our first contribution is the development
of a sensible construction for it. Our second contribu-
tion is the development of a new class of nonparametric
Bayesian models which we call graphical Pitman-Yor
processes and the derivation of generic inference al-
gorithms for them (Section 4). The DHPYLM is a
member of this class. Section 5 compares the DH-
PYLM to previous language model domain adapta-
tion approaches and Section 6 reports on experiments
showing the effectiveness of the new model. We start
in the next section by reviewing language modeling
and the HPYLM in particular.
In this paper we focus on domain adaptation for
Markovian (or n-gram) language models. Such n-gram
models are characterized by assuming that the joint
probability of a corpus C = [w1···wT] takes a simpli-
where the probability of word wt is conditionally de-
pendent on at most the n − 1 preceding words. The
A Hierarchical Nonparametric Bayesian Approach to Statistical Language Model Domain Adaptation
adding more data and another (likely more expensive)
resource cost to acquire more in-domain training data.
While the costs specific to each application domain
are different, Fig. 3 suggests that both adding more
in-domain data and adding more out-of-domain data
monotonically improves test perplexity.
of Fig. 3 we plot the “baseline” test perplexity for
a HPYLM model trained on SOU corpus data alone
(this is the same baseline as was established in Fig. 2).
Tests were run for each combination of Brown and
SOU training corpus sizes shown by the small crosses
and absolute test perplexity improvement was inter-
polated between these points to produce isosurfaces
of test perplexity improvement. An example reading
from this figure indicates that with a SOU training
corpus of 20,000 words, adding one million words of
Brown data will reduce the SOU test corpus perplex-
ity from near 500 to somewhat below 360. Equivalent
test corpus perplexity could be had using a SOU-only
HPYLM model by more than doubling the amount
of SOU domain specific training words. In applica-
tion domains where adding more out-of-domain data
is significantly cheaper than acquiring more in-domain
training data this could result in substantial savings.
At the top
In this paper we have introduced a new approach to
statistical language model domain adaptation. This
approach achieves encouraging domain adaptation re-
sults; results that suggest a more thorough and data
intensive comparison of it to other existing domain
adaptation approaches.Additionally, we defined a
graphical Pitman-Yor process, a generalization of the
hierarchical Dirichlet process, and outlined a so-called
multi-floor Chinese restaurant representation for sam-
pling from such a process. Graphical Pitman-Yor pro-
cesses form a general framework within which to ex-
plore a large variety of language models while retaining
the same inference engine. We intend to undertake a
more detailed and thorough theoretical treatment of
graphical Pitman-Yor processes.
Lastly, there are a number of generalizations of this
model which we intend to develop and demonstrate.
First, requiring no modification to the model, but po-
tentially further improving test performance, we will
experiment with mutiple domain adaptation by adding
more than two corpora into the DHPYLM. Secondly
the DHPYLM can integrated into topic models such
that the bag-of-words assumption can be avoided.
M. Bacchiani, M. Riley, B. Roark, and R. Sproat. MAP
adaptation of stochastic grammars.
and Language, 20:41–68, 2006.
J. R. Bellegarda. Statistical language model adaptation:
review and perspectives. Speech Communication, 42:93–
J. Carletta. Unleashing the killer corpus: experiences in
creating the multi-everything AMI meeting corpus. Lan-
guage Resources and Evaluation Journal, 41:181–190,
S. F. Chen and J. T. Goodman. An empirical study of
smoothing techniques for language modeling. Technical
Report TR-10-98, Dept. of Comp. Sci., Harvard, 1998.
S. Della Pietra, V. Della Pietra, R. Mercer, and S. Roukos.
Adaptive language model estimation using minimum
discriminatioon estimation. In Proceedings of the IEEE
International Conference on Acoustics, Speech, and Sig-
nal Processing, pages 633–636, 1992.
S. Goldwater, T. L. Griffiths, and M. Johnson. Interpolat-
ing between types and tokens by estimating power law
generators. In Advances in Neural Information Process-
ing Systems 19, pages 459–466. MIT Press, 2007.
R. Iyer, M. Ostendorf, and H. Gish. Using out-of-domain
data to improve in-domain language models. IEEE Sig-
nal processing letters, 4:221–223, 1997.
R. Kneser and H. Ney. Improved backing-off for m-gram
language modeling. In Proceedings of the IEEE Interna-
tional Conference on Acoustics Speech and Signal Pro-
cessing, volume 1, pages 181–184, 1995.
R. Kneser and V. Steinbiss. On the dynamic adaptation of
stochastic language models. In Proceedings of the IEEE
International Conference on Acoustics, Speech, and Sig-
nal Processing, pages 586–589, 1993.
H. Kucera and W. N. Francis. Computational analysis of
present-day American English. Brown University Press,
Providence, RI, 1967.
R. Rosenfeld. Two decades of statistical language model-
ing: where do we go from here? In Proceedings of the
IEEE, volume 88, pages 1270–1278, 2000.
Y. W. Teh. A hierarchical Bayesian language model based
on Pitman-Yor processes. In Proc. of the Association
for Computational Linguistics, pages 985–992, 2006.
Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hi-
erarchical Dirichlet processes. Journal of the American
Statistical Association, 101(476):1566–1581, 2006.
X. Zhu and R. Rosenfeld.
modeling with the world wide web. In Proceedings of the
IEEE International Conference on Acoustics, Speech,
and Signal Processing, volume 1, pages 533–536, 2001.
Improving trigram language