Page 1

607

A Hierarchical Nonparametric Bayesian Approach

to Statistical Language Model Domain Adaptation

Frank Wood and Yee Whye Teh

Gatsby Computational Neuroscience Unit

University College London

London WC1N 3AR, UK

{fwood, ywteh}@gatsby.ucl.ac.uk

Abstract

In this paper we present a doubly hierarchi-

cal Pitman-Yor process language model. Its

bottom layer of hierarchy consists of multi-

ple hierarchical Pitman-Yor process language

models, one each for some number of do-

mains. The novel top layer of hierarchy con-

sists of a mechanism to couple together mul-

tiple language models such that they share

statistical strength. Intuitively this sharing

results in the “adaptation” of a latent shared

language model to each domain. We intro-

duce a general formalism capable of describ-

ing the overall model which we call the graph-

ical Pitman-Yor process and explain how to

perform Bayesian inference in it. We present

encouraging language model domain adapta-

tion results that both illustrate the potential

benefits of our new model and suggest new

avenues of inquiry.

1INTRODUCTION

Consider the problem of statistical natural language

model domain adaptation. Statistical language mod-

els typically have a very large number of parameters

and thus need a large quantity of training data to pro-

duce good estimates of those parameters. If one re-

quires a domain-specific model, obtaining a sufficient

quantity of domain-specific training data can be both

costly and logistically challenging. It is easy, however,

to obtain non-domain-specific data, e.g. text from the

world wide web. Unfortunately models trained using

such data are often ill-suited for domain-specific ap-

Appearing in Proceedings of the 12thInternational Confe-

rence on Artificial Intelligence and Statistics (AISTATS)

2009, Clearwater Beach, Florida, USA. Volume 5 of JMLR:

W&CP 5. Copyright 2009 by the authors.

plications (Rosenfeld 2000). The phrase domain adap-

tation describes procedures that take a model trained

on a large amount of non-specific data and adapt it to

work well for a specific domain for which less training

data is available.

This paper describes a way to introduce another level

of hierarchy to the already hierarchical Pitman-Yor

process language model (HPYLM) (Teh 2006) such

that a latent shared, non-domain-specific language

model as well as domain specific models are esti-

mated together . We call the resulting model the dou-

bly hierarchical Pitman-Yor process language model

(DHPYLM) (Section 3). Intuitively such a model is

the natural hierarchical Bayesian approach to domain

adaptation. Our first contribution is the development

of a sensible construction for it. Our second contribu-

tion is the development of a new class of nonparametric

Bayesian models which we call graphical Pitman-Yor

processes and the derivation of generic inference al-

gorithms for them (Section 4). The DHPYLM is a

member of this class. Section 5 compares the DH-

PYLM to previous language model domain adapta-

tion approaches and Section 6 reports on experiments

showing the effectiveness of the new model. We start

in the next section by reviewing language modeling

and the HPYLM in particular.

2LANGUAGE MODELING

REVIEW

In this paper we focus on domain adaptation for

Markovian (or n-gram) language models. Such n-gram

models are characterized by assuming that the joint

probability of a corpus C = [w1···wT] takes a simpli-

fied form

P(C) =

T?

t=1

P(wt|[wt−1···wt−n+1])

where the probability of word wt is conditionally de-

pendent on at most the n − 1 preceding words. The

Page 2

608

A Hierarchical Nonparametric Bayesian Approach to Statistical Language Model Domain Adaptation

maximum likelihood estimate of n-gram model param-

eters is likely to overfit, particularly when there are

zero counts, thus regularization of the model through

“smoothing” is usually necessary (Chen and Goodman

1998). Recently the best known n-gram smoothing

approach, interpolated Kneser-Ney (Kneser and Ney

1995, Chen and Goodman 1998), was shown to be

equivalent to approximate inference in the HPYLM

(Teh 2006, Goldwater et al. 2007). Further, full poste-

rior inference in the HPYLM was show to outperform

interpolated Kneser-Ney. Because of this we chose to

use the HPYLM as a building block in our model.

2.1HIERARCHICAL PITMAN-YOR

PROCESS LANGUAGE MODEL

The HPYLM is a hierarchical nonparametric Bayesian

language model based on the hierarchical Pitman-Yor

process (HPYP) (Teh 2006, Goldwater et al. 2007).

The standard definition of a n-gram HPYLM assumes

a fixed and finite sized dictionary of L unique words

and has the following generative structure:

G[]

∼

∼

...

PY(d0,α0,U)

PY(d1,α1,G[])G[x1]

G[xj···x1]

∼

∼

PY(dj,αj,G[xj−1···x1])

G[wt−n+1···wt−1]

wt|wt−n+1···wt−1

(1)

where the w’s are the observed instances of words

(“tokens”) and the x’s range over the unique words

(“types”) in the dictionary.

PY(d,α,F) means that G is a random distribution

drawn from a Pitman-Yor process with concentration

parameter α, discount parameter d, and base mea-

sure F. One can think of F as being the mean dis-

tribution on which G is “centered” in the sense that

E[G(x)] = F(x). Lastly U is a uniform distribution

over word types. In (1) and in other equations like it

later in the paper we omit conditioning variables for

reasons of readability. For instance G[]is conditionally

dependent on d0, α0, and U.

The notation G∼

Each Ghis a distribution over words following a partic-

ular context h. It also can be thought of as a parame-

ter vector that fully parameterizes a distribution over

words. So, in a slight abuse of notation, we will refer

to G as a parameter (vector) and a distribution in-

terchangeably. The subscripts on the G’s indicate the

preceding context, i.e. G[the,United,States,of]is the distri-

bution over words following the context “the United

States of.” In this case the most likely next word is

almost certainly “America.”

Starting from the top, (1) says that the distribution

over words given no contextual information G[]is cen-

tered on the uniform distribution. The remaining lines

of (1) say that each distribution over words that fol-

lows a particular context is centered on a distribution

over words following the same context with one word

dropped. The directed graphical model with one ver-

tex per G and edges to each G from the G’s that ap-

pear in its base distribution forms a suffix tree. In

Figure 1 this is the structure that appears in each of

the schematic’s triangles.

It becomes apparent that this model is an n-gram

smoothing model only when one examines the form of

the posterior predictive distribution for the next word

to appear in a particular context given the entire train-

ing corpus C. If we use h to denote context vector con-

sisting of n − 1 words then the predictive distribution

of the word w appearing after h under this model is

P(w|h,C)

=

E

"K

k=1

X

ck− di

α + Nδ(w − φk) +α + dK

α + NP(w|h′,C)

#

where K, α, d, [ck]K

variables used in the Chinese restaurant franchise sam-

pler for the HPYP (Teh et al. 2006), N is the number

of times the context h occurs in the training data, h′

is shorthand for removing one word from the context

h, and δ(0) = 1,δ(x) = 0∀x ?= 0 is a standard indica-

tor function. The correspondence of inference in the

HPYLM to historical back-off schemes is established

by considering a single sample approximation to this

expectation (Teh 2006). The first term in the sum on

the right hand side of this expression (sans expecta-

tion) is related to the count of the number of times w

occurs after h in the training corpus. The second term

corresponds to the “back-off” probability of w follow-

ing a shorter-by-one-word context h′. This recursive

form is similar to that of most back-off schemes.

k=1, and [φk]K

k=1are parameters and

3DOUBLY HIERARCHICAL

PITMAN-YOR PROCESS

LANGUAGE MODEL

The DHPYLM consists of a collection of HPYLM’s,

one each for each domain, connected together through

a “latent,” shared HPYLM (see Fig. 1). The intu-

ition behind this model architecture involves imagin-

ing a true, general generative process for text which

is unobservable but affects the actual generation of

all observed domain-specific corpora. Each estimated

domain-specific model is then, as is typical in hiearchi-

cal Bayesian models, reflective of the general genera-

tive process, but sensitive to specific differences arising

from each domain.

Page 3

609

Wood, Teh

GD

[]

GD

[x1]

∼PY(dD

0,αD

0,λD

0U + (1 − λD

0)GL

[])

∼PY(dD

1,αD

1,λD

1GD

[]+ (1 − λD

1)GL

[x1])

...

GD

[xj···x1]

∼PY(dD

j,αD

j,λD

jGD

[xj−1···x1]+ (1 − λD

j)GL

[xj···x1])

wD

t|wD

t−n+1···wD

t−1

∼GD

[wD

t−n+1···wD

t−1]

(2)

The specific DHPYLM model structure we propose

starts with a “latent” HPYLM

GL

[]

∼PY(dL

0,αL

0,U)

GL

[x1]

∼PY(dL

1,αL

1,GL

[])

...

GL

[xj···x1]

∼PY(dL

j,αL

j,GL

[xj−1···x1])(3)

to which no observations are directly attributed. The

formulae in (3) are exactly the same as those in (1)

except that each variable has a superscript indicating

its membership in the latent language model. In the

graphical model shown in Figure 1 this latent language

model runs down the left column.

Additionally the DHPYLM has a HPYLM for each

domain D (Figure 1, graphical model, right column).

The domain-specific HPYLM’s in our model share the

same HPYP suffix tree structure as the latent model,

but they differ significantly in the way the base distri-

bution for each PYP is specified.

The domain specific generative model is given in

Eqn. 2. All but the last line in Eqn. 2 show a distri-

bution over words following a particular context being

“centered” on a mixture with two parts. This is key

to how and why this model works for natural language

model domain adaptation. The first member of each

mixture is a distribution over words following a context

that is one word shorter from the same domain; the

second member is a distribution over words following

the full context from the latent language model instead.

This implies that the DHYPYLM can be understood

as a model which “backs-off” by both dropping words

from the context and by dropping domain specificity

(retaining the full context). Further, as this mixture

base distribution construction is used throughout all

levels of the hierarchy, this “back-off” is recursive.

How to estimate such a model from data remains a

question. Towards explaining this we first introduce

a formalism called the graphical Pitman-Yor process.

Models such as the DHPYLM, HPYLM, and others

can be expressed as instances of graphical Pitman-

Yor processes. Understanding estimation of graphical

Pitman-Yor processes therefore will make clear how to

estimate a DHPYLM from data.

4GRAPHICAL PITMAN-YOR

PROCESS

A graphical Pitman-Yor process (GPYP) is a directed

acyclic graphical model, where each vertex v ∈ V is

labeled with a random distribution Gvand each has a

PYP prior. An example GPYP is given in Fig. 1. Ev-

ery edge w → v ∈ E in the GPYP has a non-negative

weight λw→vand the weights are constrained such the

sum of the weights on all of the incoming edges to

a vertex equals one, i.e.?

v ∈ V. Pa(v) is the set of parents of v in the DAG.

The base distribution of each Gv is a mixture, with

the edge weights being the mixing proportions and the

Gw’s on the parents w ∈ Pa(v) being the components.

The generative model for such a GPYP is

w∈Pa(v)λw→v = 1 for all

Gv

∼ PY

dv,αv,

?

w∈Pa(v)

λw→vGw

∀v ∈ V

The parameters of a GPYP are Θ = {dv,αv,λw→v :

v ∈ V,w ∈ Pa(v)} each with its corresponding prior.

In most modeling situations we cannot directly ob-

serve the random distributions but observe draws from

them instead. For instance, we may observe draws

{xn

v) from a likelihood F whose parame-

ter φn

v∼ Gvis a draw from Gv. In some modeling sit-

uations (like our language modeling application) the

φ’s are themselves directly observable.

v}Nv

n=1∼ F(φn

As is usual in Bayesian modeling we are interested in

generating posterior samples of the random distribu-

tions and GPYP parameters given observations. To do

this we develop a representation of the GPYP in which

the random distributions are integrated out. We call

this representation for the GPYP the multi-floor Chi-

nese restaurant franchise (MFCRF). It builds on the

multi-floor Chinese restaurant process which we intro-

duce next.

Page 4

610

A Hierarchical Nonparametric Bayesian Approach to Statistical Language Model Domain Adaptation

... ...

...

... ...

...

... ...

...

Switch Priors

Latent

Domain 2 Domain 1

...

U

λD

1

t=1:T

D

λD

0

GL

[]

GD

[]

GL

[x]

GD

[x]

λD

2

x∈{1:L}

y∈{1:L}

GL

[yx]

GD

[yx]

wD

t

Figure 1: Left: a schematic of the DHPYLM. The triangles each surround a HPYLM model depicted in its

Chinese restaurant franchise representation. Each PYP is itself depicted in its Chinese restaurant representation

as a (rounded) rectangle with interior circles (tables). The seating arrangement of customers is not indicated.

Solid edges going upwards from each PYP rectangle indicate which distribution(s) are members of the base

distribution of that PYP. Thick solid lines indicate within-domain base distribution members whereas thin solid

lines indicate out-of-domain base distribution members. Dotted lines indicate which switch variable prior is used

for each of the PYP’s. Right: a tri-gram DHPYLM graphical model. The G’s are distributions over words

following particular contexts (given in the subscript). There is one G in this graphical model for every rectangle

in the schematic on the left. The context?wD

each such observation is attributed to the single corresponding G[wD

t−2wD

t−1

?of each observation wtis not explicitly noted. Note that

t−2wD

t−1].

4.1MULTI-FLOOR CHINESE

RESTAURANT PROCESS

Consider a single random distribution Gvin the GPYP

with a mixture base distribution?

and consider a sequence of i.i.d. draws φn

for n = 1,...,Nv.The multi-floor Chinese restau-

rant process (MFCRP) is an extension of the Chinese

restaurant process for normal PYPs which captures

both the clustering structure of the draws as well as

their association with components of the mixture base

distribution.

w∈Pa(v)λw→vGw,

v

∼ Gv

The way that customers are assigned to tables in the

MFCRP is the same as in the CRP. Each draw φn

is identified with a customer entering the restaurant

and sitting at a table zn

v(denoting the cluster that

φn

vbelongs to).Let ck

sitting at the kthtable. If n customers have already

seated themselves according to the MFCRP resulting

in Kv tables being occupied then the probabilities of

the next customer sitting at a currently occupied table

or choosing a new table are given by

v

vbe the number of customers

P(zn+1

= Kv+ 1|{z1

v

= k|{z1

v···zn

v···zn

v})∝

ck

v− dv

αv+ dvKv(4)

P(zn+1

vv})∝

If a new table is created (second line of Eqn. 4) then

Kvis incremented. This means that customers enter-

ing the restaurant care about both how many other

customers are sitting at a table and how many tables

are in the restaurant.

As in the CRP, each table k in the MFCRP is given

a label ψk

vwhich is an i.i.d. draw from the base dis-

tribution. Since this is a mixture we can achieve this

by picking component w with probability λw→v, and

drawing from the chosen parent distribution Gw. Let

sk

vbe the chosen component. Each table is also la-

belled with this quantity. Metaphorically, this corre-

sponds to table k being located on floor sk

floor restaurant.

vof a multi-

4.2 MULTI-FLOOR CHINESE

RESTAURANT FRANCHISE

Returning to the GPYP, we now consider marginal-

izing out all of the random distributions Gv’s in the

graph and replacing them with corresponding MFCRP

representations. The MFCRP representation only re-

quires being able to draw from the PYP base distri-

bution. This means that we can directly use this rep-

resentation for all Gv’s that are leaf vertices in the

graph because we can draw from the base distribu-

tion even if it is a mixture. Accordingly all of the ta-

ble labels in the resulting MFCRP representations of

leaf vertices arise from draws from their correspond-

Page 5

611

Wood, Teh

ing base distributions. The specific parent in the graph

from which they were drawn is indicated by the cor-

responding floor indicator variables. This means that

one can think of the table labels in a leaf node as be-

ing i.i.d. “observations” from the associated parents in

the GPYP. With this insight it becomes apparent that

we can repeat this procedure of switching from Gv to

the MFCRP representation recursively up the graph

because a table in any given restaurant must always

correspond to a customer in one of its parent restau-

rants, the identity of which is determined by the state

of the table-specific floor indicator variable.

The resulting representation, having replaced each Gv

with a MFCRP, stipulates that customers in any given

restaurant (indexed by vertex v) either must be asso-

ciated with direct observations of draws from the un-

derlying Gv, or must have come from a table in a child

restaurant. We call the resulting representation of the

whole GPYP the multi-floor Chinese restaurant fran-

chise (MFCRF). The MFCRF is a generalization of

the Chinese restaurant franchise representation of the

Pitman-Yor process (Teh 2006) to the situation here

where each restaurant can have multiple parent restau-

rants. The distinctive characteristic of the MFCRF

representation is that each table corresponds to a cus-

tomer in one of a set of parent restaurants rather than

a single parent restaurant and that each table main-

tains a label of the identity of the parent restaurant.

4.3 GPYP GIBBS SAMPLER

It is straightforward to derive a Gibbs sampler for the

posterior of a GPYP in the MFCRF representation.

Let Xn

vbe the set of observations from all restaurants

that can be traced to customer n in restaurant v. Let

F(Xn

rameter ψ. As before, let zn

vindicate the table at which

customer n sits. The update equations for the indica-

tor variables associated with a single vertex are

v|ψ) be the probability of observing Xn

vgiven pa-

P(zn

v= k|{z1

∝max((ck−

v···zNv

v}\zn

− dv),0)F(Xn

v,Xn

v,ψk

v|ψk

v,Θ)

vv)

P(zn

v= K−

v+ 1,sK−

v+1

v

= w|{z1

?

v···zNv

v }\zn

v,Xn

v,Θ)

∝(αv+ dvK−

v)λw→v

F(Xn

v|φ)Gw(φ)dφ. (5)

Here the number of customers sitting at each table

(ck−

are tallied with the current customer n “unseated”.

Note that the second equation above is a joint dis-

tribution over the parameter and floor indicator vari-

ables and includes the term λw→v. The value of λw→v

strongly influences the “floor” on which the table ends

up. The floor variable is implicit in the first equation

v ) and the total number of occupied tables (K−

v)

since sk

vis fixed.

These update equations describe how to unseat and

reseat customers in a single restaurant. As in the Chi-

nese restaurant franchise sampler for the HDP (Teh

et al. 2006) and as described in the preceeding sec-

tion, the internal state of all of the restaurants must

be consistent. For instance, if in sampling one of

the restaurants in the GPYP a new table is created,

then we know that its label had to have been a draw

from one of its base distributions. This must be re-

flected in the MFCRF representation by recursively

adding a customer (and table if necessary) to the cor-

responding parent restaurant. Conversely, if a table

becomes empty its associated customer (and table if

necessary) must be removed from the chosen parent

restaurant. These updates propagate changes to the

restaurant to the rest of the franchise. The complete

MFCRF sampler then consists of visiting every restau-

rant in the GPYP and unseating and reseating every

customer in all of the restaurants, maintaining consis-

tency throughout the hierarchy by adding or removing

customers from parent restaurants when tables become

occupied or empty in child restaurants. The main dif-

ference between the MFCRF and the Chinese restau-

rant franchise samplers is that floor variables must be

maintained in order to keep track of the parent restau-

rants from which each table came.

A complete posterior sampler for the GPYP requires

sampling the parameters αv’s, dv’s, λw→v’s, and the

ψk

v’s at the top of the GPYP as well. Next we describe

how these variables are sampled in the specific case of

the DHPYLM.

4.4DHPYLM ESTIMATION

Note that the DHPYLM language model as previously

described is a GPYP with a likelihood of the form

F(x;ψ) = δ(x−ψ) where x is a token (word instance)

and ψ is a type (unique word identity). To be con-

crete we restrict ourselves to describing the specific

model used in our experiments. This is a DHPYLM

model with context length equal to two (correspond-

ing to a trigram model) and with the ΛD

across restaurants on the same level of the tree (see

the dotted lines in the schematic on the left in Fig. 1).

This clearly is not a restriction imposed by the un-

derlying model; greater depths and different tying of

the back-off mixtures are certainly possible. The prior

we use is Sj = PYP(dS

j,U2) where U2is the uni-

form distribution over {0,1}. By choosing this prior

we are able to marginalize out the Λ’s and utilize the

general GPYP process estimation machinery to sam-

ple the switch variables as well. This is a difference

worth highlighting between our approach to language

model domain adaption and the prior art. We do not

j’s shared

j,αS