A Hierarchical Nonparametric Bayesian Approach to Statistical Language Model Domain Adaptation.
ABSTRACT In this paper we present a doubly hierarchi- cal Pitman-Yor process language model. Its bottom layer of hierarchy consists of multi- ple hierarchical Pitman-Yor process language models, one each for some number of do- mains. The novel top layer of hierarchy con- sists of a mechanism to couple together mul- tiple language models such that they share statistical strength. Intuitively this sharing results in the "adaptation" of a latent shared language model to each domain. We intro- duce a general formalism capable of describ- ing the overall model which we call the graph- ical Pitman-Yor process and explain how to perform Bayesian inference in it. We present encouraging language model domain adapta- tion results that both illustrate the potential benefits of our new model and suggest new avenues of inquiry.
- [Show abstract] [Hide abstract]
ABSTRACT: Traditional n -gram language models are widely used in state-of-the-art large vocabulary speech recognition systems. This simple model suffers from some limitations, such as overfitting of maximum-likelihood estimation and the lack of rich contextual knowledge sources. In this paper, we exploit a hierarchical Bayesian interpretation for language modeling, based on a nonparametric prior called Pitman-Yor process. This offers a principled approach to language model smoothing, embedding the power-law distribution for natural language. Experiments on the recognition of conversational speech in multiparty meetings demonstrate that by using hierarchical Bayesian language models, we are able to achieve significant reductions in perplexity and word error rate.IEEE Transactions on Audio Speech and Language Processing 12/2010; · 1.68 Impact Factor - SourceAvailable from: Wray Buntine[Show abstract] [Hide abstract]
ABSTRACT: Exact Bayesian network inference exists for Gaussian and multinomial distributions. For other kinds of distributions, approximations or restrictions on the kind of inference done are needed. In this paper we present generalized networks of Dirichlet distributions, and show how, using the two-parameter Poisson-Dirichlet distribution and Gibbs sampling, one can do approximate inference over them. This involves integrating out the proba-bility vectors but leaving auxiliary discrete count vectors in their place. We illustrate the technique by extending standard topic models to "structured" documents, where the document structure is given by a Bayesian network of Dirichlets. - SourceAvailable from: export.arxiv.org[Show abstract] [Hide abstract]
ABSTRACT: We present power low rank ensembles, a flexible framework for n-gram language modeling where ensembles of low rank matrices and tensors are used to obtain smoothed probability estimates of words in context. Our method is a generalization of n-gram modeling to non-integer n, and includes standard techniques such as absolute discounting and Kneser-Ney smoothing as special cases. On English and Russian evaluation sets, we obtain noticeably lower perplexities relative to state-of-the-art modified Kneser-Ney and class-based n-gram models.12/2013;
Page 1
607
A Hierarchical Nonparametric Bayesian Approach
to Statistical Language Model Domain Adaptation
Frank Wood and Yee Whye Teh
Gatsby Computational Neuroscience Unit
University College London
London WC1N 3AR, UK
{fwood, ywteh}@gatsby.ucl.ac.uk
Abstract
In this paper we present a doubly hierarchi-
cal Pitman-Yor process language model. Its
bottom layer of hierarchy consists of multi-
ple hierarchical Pitman-Yor process language
models, one each for some number of do-
mains. The novel top layer of hierarchy con-
sists of a mechanism to couple together mul-
tiple language models such that they share
statistical strength. Intuitively this sharing
results in the “adaptation” of a latent shared
language model to each domain. We intro-
duce a general formalism capable of describ-
ing the overall model which we call the graph-
ical Pitman-Yor process and explain how to
perform Bayesian inference in it. We present
encouraging language model domain adapta-
tion results that both illustrate the potential
benefits of our new model and suggest new
avenues of inquiry.
1INTRODUCTION
Consider the problem of statistical natural language
model domain adaptation. Statistical language mod-
els typically have a very large number of parameters
and thus need a large quantity of training data to pro-
duce good estimates of those parameters. If one re-
quires a domain-specific model, obtaining a sufficient
quantity of domain-specific training data can be both
costly and logistically challenging. It is easy, however,
to obtain non-domain-specific data, e.g. text from the
world wide web. Unfortunately models trained using
such data are often ill-suited for domain-specific ap-
Appearing in Proceedings of the 12thInternational Confe-
rence on Artificial Intelligence and Statistics (AISTATS)
2009, Clearwater Beach, Florida, USA. Volume 5 of JMLR:
W&CP 5. Copyright 2009 by the authors.
plications (Rosenfeld 2000). The phrase domain adap-
tation describes procedures that take a model trained
on a large amount of non-specific data and adapt it to
work well for a specific domain for which less training
data is available.
This paper describes a way to introduce another level
of hierarchy to the already hierarchical Pitman-Yor
process language model (HPYLM) (Teh 2006) such
that a latent shared, non-domain-specific language
model as well as domain specific models are esti-
mated together . We call the resulting model the dou-
bly hierarchical Pitman-Yor process language model
(DHPYLM) (Section 3). Intuitively such a model is
the natural hierarchical Bayesian approach to domain
adaptation. Our first contribution is the development
of a sensible construction for it. Our second contribu-
tion is the development of a new class of nonparametric
Bayesian models which we call graphical Pitman-Yor
processes and the derivation of generic inference al-
gorithms for them (Section 4). The DHPYLM is a
member of this class. Section 5 compares the DH-
PYLM to previous language model domain adapta-
tion approaches and Section 6 reports on experiments
showing the effectiveness of the new model. We start
in the next section by reviewing language modeling
and the HPYLM in particular.
2LANGUAGE MODELING
REVIEW
In this paper we focus on domain adaptation for
Markovian (or n-gram) language models. Such n-gram
models are characterized by assuming that the joint
probability of a corpus C = [w1···wT] takes a simpli-
fied form
P(C) =
T?
t=1
P(wt|[wt−1···wt−n+1])
where the probability of word wt is conditionally de-
pendent on at most the n − 1 preceding words. The
Page 2
608
A Hierarchical Nonparametric Bayesian Approach to Statistical Language Model Domain Adaptation
maximum likelihood estimate of n-gram model param-
eters is likely to overfit, particularly when there are
zero counts, thus regularization of the model through
“smoothing” is usually necessary (Chen and Goodman
1998). Recently the best known n-gram smoothing
approach, interpolated Kneser-Ney (Kneser and Ney
1995, Chen and Goodman 1998), was shown to be
equivalent to approximate inference in the HPYLM
(Teh 2006, Goldwater et al. 2007). Further, full poste-
rior inference in the HPYLM was show to outperform
interpolated Kneser-Ney. Because of this we chose to
use the HPYLM as a building block in our model.
2.1HIERARCHICAL PITMAN-YOR
PROCESS LANGUAGE MODEL
The HPYLM is a hierarchical nonparametric Bayesian
language model based on the hierarchical Pitman-Yor
process (HPYP) (Teh 2006, Goldwater et al. 2007).
The standard definition of a n-gram HPYLM assumes
a fixed and finite sized dictionary of L unique words
and has the following generative structure:
G[]
∼
∼
...
PY(d0,α0,U)
PY(d1,α1,G[])G[x1]
G[xj···x1]
∼
∼
PY(dj,αj,G[xj−1···x1])
G[wt−n+1···wt−1]
wt|wt−n+1···wt−1
(1)
where the w’s are the observed instances of words
(“tokens”) and the x’s range over the unique words
(“types”) in the dictionary.
PY(d,α,F) means that G is a random distribution
drawn from a Pitman-Yor process with concentration
parameter α, discount parameter d, and base mea-
sure F. One can think of F as being the mean dis-
tribution on which G is “centered” in the sense that
E[G(x)] = F(x). Lastly U is a uniform distribution
over word types. In (1) and in other equations like it
later in the paper we omit conditioning variables for
reasons of readability. For instance G[]is conditionally
dependent on d0, α0, and U.
The notation G∼
Each Ghis a distribution over words following a partic-
ular context h. It also can be thought of as a parame-
ter vector that fully parameterizes a distribution over
words. So, in a slight abuse of notation, we will refer
to G as a parameter (vector) and a distribution in-
terchangeably. The subscripts on the G’s indicate the
preceding context, i.e. G[the,United,States,of]is the distri-
bution over words following the context “the United
States of.” In this case the most likely next word is
almost certainly “America.”
Starting from the top, (1) says that the distribution
over words given no contextual information G[]is cen-
tered on the uniform distribution. The remaining lines
of (1) say that each distribution over words that fol-
lows a particular context is centered on a distribution
over words following the same context with one word
dropped. The directed graphical model with one ver-
tex per G and edges to each G from the G’s that ap-
pear in its base distribution forms a suffix tree. In
Figure 1 this is the structure that appears in each of
the schematic’s triangles.
It becomes apparent that this model is an n-gram
smoothing model only when one examines the form of
the posterior predictive distribution for the next word
to appear in a particular context given the entire train-
ing corpus C. If we use h to denote context vector con-
sisting of n − 1 words then the predictive distribution
of the word w appearing after h under this model is
P(w|h,C)
=
E
"K
k=1
X
ck− di
α + Nδ(w − φk) +α + dK
α + NP(w|h′,C)
#
where K, α, d, [ck]K
variables used in the Chinese restaurant franchise sam-
pler for the HPYP (Teh et al. 2006), N is the number
of times the context h occurs in the training data, h′
is shorthand for removing one word from the context
h, and δ(0) = 1,δ(x) = 0∀x ?= 0 is a standard indica-
tor function. The correspondence of inference in the
HPYLM to historical back-off schemes is established
by considering a single sample approximation to this
expectation (Teh 2006). The first term in the sum on
the right hand side of this expression (sans expecta-
tion) is related to the count of the number of times w
occurs after h in the training corpus. The second term
corresponds to the “back-off” probability of w follow-
ing a shorter-by-one-word context h′. This recursive
form is similar to that of most back-off schemes.
k=1, and [φk]K
k=1are parameters and
3DOUBLY HIERARCHICAL
PITMAN-YOR PROCESS
LANGUAGE MODEL
The DHPYLM consists of a collection of HPYLM’s,
one each for each domain, connected together through
a “latent,” shared HPYLM (see Fig. 1). The intu-
ition behind this model architecture involves imagin-
ing a true, general generative process for text which
is unobservable but affects the actual generation of
all observed domain-specific corpora. Each estimated
domain-specific model is then, as is typical in hiearchi-
cal Bayesian models, reflective of the general genera-
tive process, but sensitive to specific differences arising
from each domain.
Page 3
609
Wood, Teh
GD
[]
GD
[x1]
∼PY(dD
0,αD
0,λD
0U + (1 − λD
0)GL
[])
∼PY(dD
1,αD
1,λD
1GD
[]+ (1 − λD
1)GL
[x1])
...
GD
[xj···x1]
∼PY(dD
j,αD
j,λD
jGD
[xj−1···x1]+ (1 − λD
j)GL
[xj···x1])
wD
t|wD
t−n+1···wD
t−1
∼GD
[wD
t−n+1···wD
t−1]
(2)
The specific DHPYLM model structure we propose
starts with a “latent” HPYLM
GL
[]
∼PY(dL
0,αL
0,U)
GL
[x1]
∼PY(dL
1,αL
1,GL
[])
...
GL
[xj···x1]
∼PY(dL
j,αL
j,GL
[xj−1···x1])(3)
to which no observations are directly attributed. The
formulae in (3) are exactly the same as those in (1)
except that each variable has a superscript indicating
its membership in the latent language model. In the
graphical model shown in Figure 1 this latent language
model runs down the left column.
Additionally the DHPYLM has a HPYLM for each
domain D (Figure 1, graphical model, right column).
The domain-specific HPYLM’s in our model share the
same HPYP suffix tree structure as the latent model,
but they differ significantly in the way the base distri-
bution for each PYP is specified.
The domain specific generative model is given in
Eqn. 2. All but the last line in Eqn. 2 show a distri-
bution over words following a particular context being
“centered” on a mixture with two parts. This is key
to how and why this model works for natural language
model domain adaptation. The first member of each
mixture is a distribution over words following a context
that is one word shorter from the same domain; the
second member is a distribution over words following
the full context from the latent language model instead.
This implies that the DHYPYLM can be understood
as a model which “backs-off” by both dropping words
from the context and by dropping domain specificity
(retaining the full context). Further, as this mixture
base distribution construction is used throughout all
levels of the hierarchy, this “back-off” is recursive.
How to estimate such a model from data remains a
question. Towards explaining this we first introduce
a formalism called the graphical Pitman-Yor process.
Models such as the DHPYLM, HPYLM, and others
can be expressed as instances of graphical Pitman-
Yor processes. Understanding estimation of graphical
Pitman-Yor processes therefore will make clear how to
estimate a DHPYLM from data.
4GRAPHICAL PITMAN-YOR
PROCESS
A graphical Pitman-Yor process (GPYP) is a directed
acyclic graphical model, where each vertex v ∈ V is
labeled with a random distribution Gvand each has a
PYP prior. An example GPYP is given in Fig. 1. Ev-
ery edge w → v ∈ E in the GPYP has a non-negative
weight λw→vand the weights are constrained such the
sum of the weights on all of the incoming edges to
a vertex equals one, i.e.?
v ∈ V. Pa(v) is the set of parents of v in the DAG.
The base distribution of each Gv is a mixture, with
the edge weights being the mixing proportions and the
Gw’s on the parents w ∈ Pa(v) being the components.
The generative model for such a GPYP is
w∈Pa(v)λw→v = 1 for all
Gv
∼ PY
dv,αv,
?
w∈Pa(v)
λw→vGw
∀v ∈ V
The parameters of a GPYP are Θ = {dv,αv,λw→v :
v ∈ V,w ∈ Pa(v)} each with its corresponding prior.
In most modeling situations we cannot directly ob-
serve the random distributions but observe draws from
them instead. For instance, we may observe draws
{xn
v) from a likelihood F whose parame-
ter φn
v∼ Gvis a draw from Gv. In some modeling sit-
uations (like our language modeling application) the
φ’s are themselves directly observable.
v}Nv
n=1∼ F(φn
As is usual in Bayesian modeling we are interested in
generating posterior samples of the random distribu-
tions and GPYP parameters given observations. To do
this we develop a representation of the GPYP in which
the random distributions are integrated out. We call
this representation for the GPYP the multi-floor Chi-
nese restaurant franchise (MFCRF). It builds on the
multi-floor Chinese restaurant process which we intro-
duce next.
Page 4
610
A Hierarchical Nonparametric Bayesian Approach to Statistical Language Model Domain Adaptation
... ...
...
... ...
...
... ...
...
Switch Priors
Latent
Domain 2 Domain 1
...
U
λD
1
t=1:T
D
λD
0
GL
[]
GD
[]
GL
[x]
GD
[x]
λD
2
x∈{1:L}
y∈{1:L}
GL
[yx]
GD
[yx]
wD
t
Figure 1: Left: a schematic of the DHPYLM. The triangles each surround a HPYLM model depicted in its
Chinese restaurant franchise representation. Each PYP is itself depicted in its Chinese restaurant representation
as a (rounded) rectangle with interior circles (tables). The seating arrangement of customers is not indicated.
Solid edges going upwards from each PYP rectangle indicate which distribution(s) are members of the base
distribution of that PYP. Thick solid lines indicate within-domain base distribution members whereas thin solid
lines indicate out-of-domain base distribution members. Dotted lines indicate which switch variable prior is used
for each of the PYP’s. Right: a tri-gram DHPYLM graphical model. The G’s are distributions over words
following particular contexts (given in the subscript). There is one G in this graphical model for every rectangle
in the schematic on the left. The context?wD
each such observation is attributed to the single corresponding G[wD
t−2wD
t−1
?of each observation wtis not explicitly noted. Note that
t−2wD
t−1].
4.1MULTI-FLOOR CHINESE
RESTAURANT PROCESS
Consider a single random distribution Gvin the GPYP
with a mixture base distribution?
and consider a sequence of i.i.d. draws φn
for n = 1,...,Nv.The multi-floor Chinese restau-
rant process (MFCRP) is an extension of the Chinese
restaurant process for normal PYPs which captures
both the clustering structure of the draws as well as
their association with components of the mixture base
distribution.
w∈Pa(v)λw→vGw,
v
∼ Gv
The way that customers are assigned to tables in the
MFCRP is the same as in the CRP. Each draw φn
is identified with a customer entering the restaurant
and sitting at a table zn
v(denoting the cluster that
φn
vbelongs to).Let ck
sitting at the kthtable. If n customers have already
seated themselves according to the MFCRP resulting
in Kv tables being occupied then the probabilities of
the next customer sitting at a currently occupied table
or choosing a new table are given by
v
vbe the number of customers
P(zn+1
= Kv+ 1|{z1
v
= k|{z1
v···zn
v···zn
v})∝
ck
v− dv
αv+ dvKv(4)
P(zn+1
vv})∝
If a new table is created (second line of Eqn. 4) then
Kvis incremented. This means that customers enter-
ing the restaurant care about both how many other
customers are sitting at a table and how many tables
are in the restaurant.
As in the CRP, each table k in the MFCRP is given
a label ψk
vwhich is an i.i.d. draw from the base dis-
tribution. Since this is a mixture we can achieve this
by picking component w with probability λw→v, and
drawing from the chosen parent distribution Gw. Let
sk
vbe the chosen component. Each table is also la-
belled with this quantity. Metaphorically, this corre-
sponds to table k being located on floor sk
floor restaurant.
vof a multi-
4.2 MULTI-FLOOR CHINESE
RESTAURANT FRANCHISE
Returning to the GPYP, we now consider marginal-
izing out all of the random distributions Gv’s in the
graph and replacing them with corresponding MFCRP
representations. The MFCRP representation only re-
quires being able to draw from the PYP base distri-
bution. This means that we can directly use this rep-
resentation for all Gv’s that are leaf vertices in the
graph because we can draw from the base distribu-
tion even if it is a mixture. Accordingly all of the ta-
ble labels in the resulting MFCRP representations of
leaf vertices arise from draws from their correspond-
Page 5
611
Wood, Teh
ing base distributions. The specific parent in the graph
from which they were drawn is indicated by the cor-
responding floor indicator variables. This means that
one can think of the table labels in a leaf node as be-
ing i.i.d. “observations” from the associated parents in
the GPYP. With this insight it becomes apparent that
we can repeat this procedure of switching from Gv to
the MFCRP representation recursively up the graph
because a table in any given restaurant must always
correspond to a customer in one of its parent restau-
rants, the identity of which is determined by the state
of the table-specific floor indicator variable.
The resulting representation, having replaced each Gv
with a MFCRP, stipulates that customers in any given
restaurant (indexed by vertex v) either must be asso-
ciated with direct observations of draws from the un-
derlying Gv, or must have come from a table in a child
restaurant. We call the resulting representation of the
whole GPYP the multi-floor Chinese restaurant fran-
chise (MFCRF). The MFCRF is a generalization of
the Chinese restaurant franchise representation of the
Pitman-Yor process (Teh 2006) to the situation here
where each restaurant can have multiple parent restau-
rants. The distinctive characteristic of the MFCRF
representation is that each table corresponds to a cus-
tomer in one of a set of parent restaurants rather than
a single parent restaurant and that each table main-
tains a label of the identity of the parent restaurant.
4.3 GPYP GIBBS SAMPLER
It is straightforward to derive a Gibbs sampler for the
posterior of a GPYP in the MFCRF representation.
Let Xn
vbe the set of observations from all restaurants
that can be traced to customer n in restaurant v. Let
F(Xn
rameter ψ. As before, let zn
vindicate the table at which
customer n sits. The update equations for the indica-
tor variables associated with a single vertex are
v|ψ) be the probability of observing Xn
vgiven pa-
P(zn
v= k|{z1
∝max((ck−
v···zNv
v}\zn
− dv),0)F(Xn
v,Xn
v,ψk
v|ψk
v,Θ)
vv)
P(zn
v= K−
v+ 1,sK−
v+1
v
= w|{z1
?
v···zNv
v }\zn
v,Xn
v,Θ)
∝(αv+ dvK−
v)λw→v
F(Xn
v|φ)Gw(φ)dφ. (5)
Here the number of customers sitting at each table
(ck−
are tallied with the current customer n “unseated”.
Note that the second equation above is a joint dis-
tribution over the parameter and floor indicator vari-
ables and includes the term λw→v. The value of λw→v
strongly influences the “floor” on which the table ends
up. The floor variable is implicit in the first equation
v ) and the total number of occupied tables (K−
v)
since sk
vis fixed.
These update equations describe how to unseat and
reseat customers in a single restaurant. As in the Chi-
nese restaurant franchise sampler for the HDP (Teh
et al. 2006) and as described in the preceeding sec-
tion, the internal state of all of the restaurants must
be consistent. For instance, if in sampling one of
the restaurants in the GPYP a new table is created,
then we know that its label had to have been a draw
from one of its base distributions. This must be re-
flected in the MFCRF representation by recursively
adding a customer (and table if necessary) to the cor-
responding parent restaurant. Conversely, if a table
becomes empty its associated customer (and table if
necessary) must be removed from the chosen parent
restaurant. These updates propagate changes to the
restaurant to the rest of the franchise. The complete
MFCRF sampler then consists of visiting every restau-
rant in the GPYP and unseating and reseating every
customer in all of the restaurants, maintaining consis-
tency throughout the hierarchy by adding or removing
customers from parent restaurants when tables become
occupied or empty in child restaurants. The main dif-
ference between the MFCRF and the Chinese restau-
rant franchise samplers is that floor variables must be
maintained in order to keep track of the parent restau-
rants from which each table came.
A complete posterior sampler for the GPYP requires
sampling the parameters αv’s, dv’s, λw→v’s, and the
ψk
v’s at the top of the GPYP as well. Next we describe
how these variables are sampled in the specific case of
the DHPYLM.
4.4DHPYLM ESTIMATION
Note that the DHPYLM language model as previously
described is a GPYP with a likelihood of the form
F(x;ψ) = δ(x−ψ) where x is a token (word instance)
and ψ is a type (unique word identity). To be con-
crete we restrict ourselves to describing the specific
model used in our experiments. This is a DHPYLM
model with context length equal to two (correspond-
ing to a trigram model) and with the ΛD
across restaurants on the same level of the tree (see
the dotted lines in the schematic on the left in Fig. 1).
This clearly is not a restriction imposed by the un-
derlying model; greater depths and different tying of
the back-off mixtures are certainly possible. The prior
we use is Sj = PYP(dS
j,U2) where U2is the uni-
form distribution over {0,1}. By choosing this prior
we are able to marginalize out the Λ’s and utilize the
general GPYP process estimation machinery to sam-
ple the switch variables as well. This is a difference
worth highlighting between our approach to language
model domain adaption and the prior art. We do not
j’s shared
j,αS