Page 1

Improvements to the Sequence Memoizer

Jan Gasthaus

Gatsby Computational Neuroscience Unit

University College London

London, WC1N 3AR, UK

j.gasthaus@gatsby.ucl.ac.uk

Yee Whye Teh

Gatsby Computational Neuroscience Unit

University College London

London, WC1N 3AR, UK

ywteh@gatsby.ucl.ac.uk

Abstract

The sequence memoizer is a model for sequence data with state-of-the-art per-

formance on language modeling and compression. We propose a number of

improvements to the model and inference algorithm, including an enlarged range

of hyperparameters, a memory-efficient representation, and inference algorithms

operating on the new representation. Our derivations are based on precise defi-

nitions of the various processes that will also allow us to provide an elementary

proof of the “mysterious” coagulation and fragmentation properties used in the

original paper on the sequence memoizer by Wood et al. (2009). We present some

experimental results supporting our improvements.

1 Introduction

Thesequencememoizer(SM)isaBayesiannonparametricmodelfordiscretesequencedataproducing

state-of-the-art results for language modeling and compression [1, 2]. It models each symbol of a

sequence using a predictive distribution that is conditioned on all previous symbols, and thus can be

understood as a non-Markov sequence model. Given the very large (infinite) number of predictive

distributions needed to model arbitrary sequences, it is essential that statistical strength be shared in

their estimation. To do so, the SM uses a hierarchical Pitman-Yor process prior over the predictive

distributions [3]. One innovation of the SM over [3] is its use of coagulation and fragmentation

properties [4, 5] that allow for efficient representation of the model using a data structure whose size

is linear in the sequence length. However, in order to make use of these properties, all concentration

parameters, which were allowed to vary freely in [3], were fixed to zero.

In this paper we explore a number of further innovations to the SM. Firstly, we propose a more flexible

setting of the hyperparameters with potentially non-zero concentration parameters that still allow the

use of the coagulation/fragmentation properties. In addition to better predictive performance, the

setting also partially mitigates a problem observed in [1], whereby on encountering a long sequence

of the same symbol, the model becomes overly confident that it will continue with the same symbol.

The second innovation addresses memory usage issues in inference algorithms for the SM. In

particular, current algorithms use a Chinese restaurant franchise representation for the HPYP, where

the seating arrangement of customers in each restaurant is represented by a list, each entry being the

number of customers sitting around one table [3]. This is already an improvement over the na¨ ıve

Chinese restaurant franchise in [6] which stores pointers from customers to the tables they sit at, but

can still lead to huge memory requirements when restaurants contain many tables. One approach to

mitigate this problem has been explored in [7], which uses a representation that stores a histogram of

table sizes instead of the table sizes themselves. Our proposal is to store even less, namely only the

minimal statistics about each restaurant required to make predictions: the number of customers and

the number of tables occupied by the customers. Inference algorithms will have to be adapted to this

compact representation, and we describe and compare a number of these.

1

Page 2

In Section 2 we will give precise definitions of Pitman-Yor processes and Chinese restaurant processes.

These will be used to define the SM model in Section 3, and to derive the results about the extended

hyperparameter setting in Section 4 and the memory-efficient representation in Section 5. As a

side benefit we will also be able to give an elementary proof of the coagulation and fragmentation

properties in Section 4, which was presented as a fait accompli in [1], while the general and rigorous

treatment in the original papers [4, 5] is somewhat inaccessible to a wider audience.

2 Pitman-Yor Processes and Chinese Restaurant Processes

A Pitman-Yor process (PYP) is a particular distribution over distributions over some probability

space Σ [8, 9]. We denote by PY(α,d,G0) a PYP with concentration parameter α > −d, discount

parameter d ∈ [0,1), and base distribution G0over Σ. We can describe a Pitman-Yor process using

its associated Chinese restaurant process (CRP). A Chinese restaurant has customers sitting around

tables which serve dishes. If there are c customers we index them with [c] = {1,...,c}. We define a

seating arrangement of the customers as a set of disjoint non-empty subsets partitioning [c]. Each

subset is a table and consists of the customers sitting around it, e.g. {{1,3},{2}} means customers 1

and 3 sit at one table and customer 2 sits at another by itself. Let Acbe the set of seating arrangements

of c customers, and Actthose with exactly t tables. The CRP describes a distribution over seating

arrangements as follows: customer 1 sits at a table; for customer c + 1, if A ∈ Acis the current

seating arrangement, then she joins a table a ∈ A with probability|a|−d

probabilityα+|A|d

conditional probabilities together,

P(A) =[α + d]|A|−1

[α + 1]c−1

1

a∈A

where [y]n

constant. Fixing the number of tables to be t ≤ c, the distribution, denoted as CRPct(d), becomes:

P(A) =

Sd(c,t)

where the normalization constant Sd(c,t) =?

that conditioning on a fixed t the seating arrangement will not depend on α, only on d.

α+cand starts a new table with

α+c. We denote the resulting distribution over Acas CRPc(α,d). Multiplying the

d

?

[1 − d]|a|−1

1

for each A ∈ Ac, (1)

d=?n−1

i=0y + id is Kramp’s symbol. Note that the denominator is the normalization

?

a∈A[1 − d]|a|−1

1

for each A ∈ Act,

?

(2)

A∈Act

a∈A[1 − d]|a|−1

1

is a generalized Stirling

number of type (−1,−d,0) [10]. These can be computed recursively [3] (see also Section 5). Note

Suppose G ∼ PY(α,d,G0) and z1,...,zc|G

on z1:c= z1,...,zc. In particular, marginalizing out G, the distribution of z1:ccan be described as

follows: draw A ∼ CRPc(α,d), on each table serve a dish which is an iid draw from G0, finally let

variable zitake on the value of the dish served at the table that customer i sat at. Now suppose we

wish to perform inference given observation of z1:c. This is equivalent to conditioning on the dishes

that each customer is served. Since customers at the same table are served the same dish, the different

values among the zi’s split the restaurant into multiple sections, with customers and tables in each

section being served a distinct dish. There can be more than one table in each section since multiple

tables can serve the same dish (if G0has atoms). If s ∈ Σ is a dish, let csbe the number of zi’s with

value s (number of customers served dish s), tsthe number of tables, and As∈ Acststhe seating

arrangement of customers around the tables serving dish s (we reindex the cscustomers to be [cs]).

The joint distribution over seating arrangements and observations is then:1

??

where t·=?

P({cs,ts},z1:c) =

s∈Σ

Inference then amounts to computing the posterior of either {ts,As} or only {ts} given z1:c(csare

fixed) and can be achieved by Gibbs sampling or other means.

iid

∼ G. The CRP describes the PYP in terms of its effect

P({cs,ts,As},z1:c) =

s∈Σ

??

G0(s)ts

??

[α + d]t·−1

[α + 1]c·−1

d

1

?

s∈Σ

?

a∈As

[1 − d]|a|−1

1

?

,

(3)

s∈Σtsand similarly for c·.We can marginalize out {As} from (3) using (2):

G0(s)ts

??

[α + d]t·−1

[α + 1]c·−1

d

1

?

s∈Σ

Sd(cs,ts)

?

.

(4)

1We have omitted the set subscript {·}s∈Σ. We will drop these subscripts when they are clear from context.

2

Page 3

3The Sequence Memoizer and its Chinese Restaurant Representation

In this section we review the sequence memoizer (SM) and itsrepresentation using Chinese restaurants

[3, 11, 1, 2]. Let Σ be the discrete set of symbols making up the sequences to be modeled, and

let Σ∗be the set of finite sequences of symbols from Σ. The SM models a sequence x1:T =

x1,x2,...,xT∈ Σ∗using a set of conditional distributions:

T?

where Gu(s) is the conditional probability of the symbol s ∈ Σ occurring after a context u ∈ Σ∗(the

sequence of symbols occurring before s). The parameters of the model consist of all the conditional

distributions {Gu}u∈Σ∗, and are given a hierarchical Pitman-Yor process (HPYP) prior:

Gε∼ PY(αε,dε,H)

Gu|Gσ(u)∼ PY(αu,du,Gσ(u))

where ε is the empty sequence, σ(u) is the sequence obtained by dropping the first symbol in u, and

H is the overall base distribution over Σ (we take H to be uniform over a finite Σ). Note that we

have generalized the model to allow each Guto have its own concentration and discount parameters,

whereas [1, 2] worked with αu= 0 and du= d|u|(i.e. context length-dependent discounts).

As in previous works, the hierarchy over {Gu} is represented using a Chinese restaurant franchise

[6]. Each Guhas a corresponding restaurant indexed by u. Customers in the restaurant are draws

from Gu, tables are draws from its base distribution Gσ(u), and dishes are the drawn values from Σ.

For each s ∈ Σ and u ∈ Σ∗, let cusand tusbe the numbers of customers and tables in restaurant u

served dish s, and let Aus∈ Acustusbe their seating arrangement. Each observation of xiin context

x1:i−1corresponds to a customer in restaurant x1:i−1who is served dish xi, and each table in each

restaurant u, being a draw from the base distribution Gσ(u), corresponds to a customer in the parent

restaurant σ(u). Thus, the numbers of customers and tables have to satisfy the constraints

P(x1:T) =

i=1

P(xi|x1:i−1) =

T?

i=1

Gx1:i−1(xi),

(5)

for u ∈ Σ∗\{ε}, (6)

cus= cx

us+

?

v:σ(v)=u

tvs,

(7)

where cx

The goal of inference is to compute the posterior over the states {cus,tus,Aus}s∈Σ,u∈Σ∗ of the

restaurants (and possibly the concentration and discount parameters). The joint distribution can be

obtained by multiplying the probabilities of all seating arrangements (3) in all restaurants:

??

us= 1 if s = xiand u = x1:i−1for some i, and 0 otherwise.

P({cus,tus,Aus},x1:T) =

s∈Σ

H(s)tεs

?

?

u∈Σ∗

?

[αu+ du]tu·−1

[αu+ 1]cu·−1

du

1

?

s∈Σ

?

a∈Aus

[1 − du]|a|−1

1

?

.

(8)

The first parentheses contain the probability of draws from the overall base distribution H, and the

second parentheses contain the probability of the seating arrangement in restaurant u. Given a state

of the restaurants drawn from the posterior, the predictive probability of symbol s in context v can

then be computed recursively (with P∗

σ(ε)(s) defined to be H(s)):

P∗

v(s) =cvs− tvsd

αv+ cv·

+αv+ tv·d

αv+ cv·

P∗

σ(v)(s).

(9)

4 Non-zero Concentration Parameters

In [1] the authors proposed setting all the concentration parameters to zero. Though limiting the

flexibility of the model, this allowed them to take advantage of coagulation and fragmentation

properties of PYPs [4, 5] to marginalize out all but a linear number (in T) of restaurants from the

hierarchy. We propose the following enlarged family of hyperparameter settings: let αε= α > 0

be free to vary at the root of the hierarchy, and set each αu= ασ(u)dufor each u ∈ Σ∗\{ε}. The

3

Page 4

a1

a2

C

Figure1: Illustrationoftherelationshipbe-

tween the restaurants A1, A2, C and Fa.

A1

A2

Fa1

Fa2

discounts can vary freely. In addition to more flexible modeling, this also partially mitigates the

overconfidence problem [2]. To see why, notice from (9) that the predictive probability is a weighted

average of predictive probabilities given contexts of various lengths. Since αv> 0, the model gives

higher weights to the predictive probabilities of shorter contexts (compared to αv = 0). These

typically give less extreme values since they include influences not just from the sequence of identical

symbols, but also from other observations of other symbols in other contexts.

Our hyperparameter settings also retain the coagulation and fragmentation properties which allow us

to marginalize out many PYPs in the hierarchy for efficient inference. We will provide an elementary

proof of these results in terms of CRPs in the following. First we describe the coagulation and

fragmentation operations. Let c ≥ 1 and suppose A2 ∈ Acand A1 ∈ A|A2|are two seating

arrangements where the number of customers in A1is the same as that of tables in A2. Each customer

in A1can be put in one-to-one correspondence to a table in A2and sits at a table in A1. Now

consider re-representing A1and A2. Let C ∈ Acbe the seating arrangement obtained by coagulating

(merging) tables of A2corresponding to customers in A1sitting at the same table. Further, split A2

into sections, one for each table a ∈ C, where each section Fa∈ A|a|contains the |a| customers and

tables merged to make up a. The converse of coagulating tables of A2into C is of course to fragment

each table a ∈ C into the smaller tables in Fa. Note that there is a one-to-one correspondence

between tables in C and in A1, and the number of customers in each table of A1is that of tables in

the corresponding Fa. Thus A1and A2can be reconstructed from C and {Fa}a∈C.

Theorem 1 ([4, 5]). Suppose A2∈ Ac, A1∈ A|A2|, C ∈ Acand Fa∈ A|a|for each a ∈ C are

related as above. Then the following describe equivalent distributions:

(I) A2∼ CRPc(αd2,d2) and A1|A2∼ CRP|A2|(α,d1).

(II) C ∼ CRPc(αd2,d1d2) and Fa|C ∼ CRP|a|(−d1d2,d2) for each a ∈ C.

Proof. We simply show that the joint distributions are the same. Starting with (I) and using (1),

?[α + d1]|A1|−1

1

a∈A1

=[αd2+ d1d2]|A1|−1

[αd2+ 1]c−1

1

a∈A1

P(A1,A2) =

d1

[α + 1]|A2|−1

?

[1 − d1]|a|−1

??

= δn−1[β + 1]n−1

1

??[αd2+ d2]|A2|−1

???

for all β,δ,n. Re-grouping the products and

d2

[αd2+ 1]c−1

1

?

b∈A2

[1 − d2]|b|−1

?

1

?

d1d2

[d2− d1d2]|a|−1

d2

b∈A2

[1 − d2]|b|−1

1

.

We used the identity [βδ + δ]n−1

expressing the same quantities in terms of C and {Fa},

=[αd2+ d1d2]|C|−1

[αd2+ 1]c−1

1

δ

1

d1d2

?

a∈C

?

[d2− d1d2]|Fa|−1

d2

?

b∈Fa

[1 − d2]|b|−1

1

?

= P(C,{Fa}a∈C).

We see that conditioning on C each Fa∼ CRP|a|(−d1d2,d2). Marginalizing {Fa} out using (1),

P(C) =[αd2+ d1d2]|C|−1

[αd2+ 1]c−1

1

d1d2

?

a∈C

[1 − d1d2]|a|−1

1

.

So C ∼ CRPc(αd2,d1d2) and (I)⇒(II). Reversing the same argument shows that (II)⇒(I).

Statement (I) of the theorem is exactly the Chinese restaurant franchise of the hierarchical model

G1|G0∼ PY(α,d1,G0), G2|G1∼ PY(αd2,d2,G1) with c iid draws from G2. The theorem shows

4

Page 5

that the clustering structure of the c customers in the franchise is equivalent to the seating arrangement

in a CRP with parameters αd2,d1d2, i.e. G2|G0∼ PY(αd2,d1d2,G0) with G1marginalized out.

Conversely, the fragmentation operation (II) regains Chinese restaurant representations for both

G2|G1and G1|G0from one for G2|G0.

This result can be applied to marginalize out all but a linear number of PYPs from (6) [1]. The

resulting model is still a HPYP of the same form as (6), except that it only need be defined over the

prefixes of x1:Tas well as some subset of their ancestors. In the rest of this paper we will refer to (6)

and its Chinese restaurant franchise representation (8) with the understanding that we are operating in

this reduced hierarchy. Let U denote the reduced set of contexts, and redefine σ(u) to be the parent

of u in U. The concentration and discount parameters need to be modified accordingly.

5 Compact Representation

Current inference algorithms for the SM and hierarchical Pitman-Yor processes operate in the Chinese

restaurant franchise representation, and use either Gibbs sampling [3, 11, 1] or particle filtering [2].

To lower memory requirements, instead of storing the precise seating arrangement of each restaurant,

the algorithms only store the numbers of customers, numbers of tables and sizes of all tables in the

franchise. This is sufficient for sampling and for prediction. However, for large data sets the amount

of memory required to store the sizes of the tables can still be very large. We propose algorithms that

only store the numbers of customers and tables but not the table sizes. This compact representation

needs to store only two integers (cus,tus) per context/symbol pair, as opposed to tusintegers.2These

counts are already sufficient for prediction, as (9) does not depend on the table sizes. We will also

consider a number of sampling algorithms in this representation.

Our starting point is the joint distribution over the Chinese restaurant franchise (8). Integrating out

the seating arrangements {Aus} using (2) gives the joint distribution over {cus,tus}:

??

Note that each cusis in fact determined by (7) so in fact the only unobserved variables in (10) are

{tus}. With this joint distribution we can now derive various sampling algorithms.

P({cus,tus},x1:T) =

s∈Σ

H(s)tεs

??

u∈U

?

[αu+ du]tu·−1

[αu+ 1]cu·−1

du

1

?

s∈Σ

Sdu(cus,tus)

?

.

(10)

5.1Sampling Algorithms

Direct Gibbs Sampling of {cus,tus}. It is straightforward derive a Gibbs sampler from (10). Since

each cusis determined by cx

which for tusin the range {1,...,cus} has conditional distribution

[αu+ du]tu·−1

du

[ασ(u)+ 1]cσ(u)·−1

1

where tu·, cσ(u)·and cσ(u)sall depend on tusthrough the constraints (7). One problem with this

sampler is that we need to compute Sdu(c,t) for all 1 ≤ c,t ≤ cus. If duis fixed these can be

precomputed and stored, but the resulting memory requirement is again large since each restaurant

typically has its own duvalue. If duis updated in the sampling, then these will need to be computed

each time as well, costing O(c2

range, socarehastobetakentoavoidnumericalunder-/overflow(e.g. byperformingthecomputations

in the log domain, involving many expensive log and exp computations).

Re-instantiating Seating Arrangements. Another strategy is to re-instantiate the seating arrange-

ment by sampling Aus ∼ CRPcustus(du) from its conditional distribution given cus,tus(see

Section 5.2 below), then performing the original Gibbs sampling of seating arrangements [3, 11].

This produces a new number of tables tusand the seating arrangement can be discarded. Note

however that when tuschanges this sampler will introduce changes to ancestor restaurants (by adding

usand the tvsat child restaurants v, it is sufficient to update each tus,

P(tus|rest) ∝

Sdu(cus,tus)Sdσ(u)(cσ(u)s,tσ(u)s),

(11)

us) per iteration. Further, Sd(c,t) typically has very high dynamic

2In both representations one may also want to store the total number of customers and tables in each restaurant

for efficiency. In practice, where there is additional overhead due to the data structures involved, storage space

for the full representation can be reduced by treating context/symbol pairs with only one customer separately.

5

Page 6

or removing customers), so these will need to have their seating arrangements instantiated as well. To

implement this sampler efficiently, we visit restaurants in depth-first order, keeping in memory only

the seating arrangements of all restaurants on the path to the current one. The computational cost is

O(custus), but with a potentially smaller hidden constant (no log/exp computations are required).

Original Gibbs Sampling of {cus,tus}. A third strategy is to “imagine” having a seating arrange-

ment and running the original Gibbs sampler, incrementing tusif a table would have been created,

and decrementing tusif a table would have been deleted. Recall that the original Gibbs sampler

operates by iterating over customers, treating each as the last customer in the restaurant, removing

it, then adding it back into the restaurant. When removing, if the customer were sitting by himself,

a table would need to be deleted too, so the probability of decrementing tusis the probability of a

customer sitting by himself. From (2), this can be worked out to be

P(decrement tus) =Sdu(cus− 1,tus− 1)

Sdu(cus,tus)

.

(12)

The numerator is due to a sum over all seating arrangements where the other cus− 1 customers sit at

the other tus− 1 tables. When adding back the customer, the probability of incrementing the number

of tables is the probability that the customer sits at a new table of the same dish s:

(αu+ dutu·)P∗

(αu+ dutu·)P∗

where P∗

customer removed. This sampler also requires computation of Sdu(c,t), but only for 1 ≤ t ≤ tus

which can be significantly smaller than cus. Computation cost is O(custus) (but again with a larger

constant due to computing the Stirling numbers in a stable way). We did not find a sampling method

taking less time than O(custus).

Particle Filtering. (13) gives the probability of incrementing tus(and adding a customer to the

parent restaurant) when a customer is added into a restaurant. This can be used as the basis for a

particle filter, which iterates through the sequence x1:T, adding a customer corresponding to s = xi

in context u = x1:i−1at each step. Since no customer deletion is required, the cost is very small:

just O(cus) for the cuscustomers per s and u (plus the cost of traversing the hierarchy to the current

restaurant, which is always necessary). Particle filtering works very well in online settings, e.g.

compression [2], and as initialization for Gibbs sampling.

P(increment tus) =

σ(u)(s)

σ(u)(s) + cus− tusdu,

(13)

σ(u)(s) is the predictive (9) with the current value of tus, and cus,tusare values with the

5.2 Re-instantiating Ausgiven cus,tus

To simplify notation, here we will let d = du,c = cus,t = tusand A = Aus∈ Act. We will use the

forward-backward algorithm in an undirected chain to sample A from CRPct(d) given in (2). First

we re-express A using two sets of variables z1,...,zcand y1,...,yc. Label a table a ∈ A using

the index of the first customer at the table, i.e. the smallest element of a. Let zibe the number of

tables occupied by the first i customers, and yithe label of the table that customer i sits at. The

variables satisfy the following constraints: z1= 1, zc= t, and zi= zi−1in which case yi∈ [i − 1]

or zi= zi−1+ 1 in which case yi= i. This gives a one-to-one correspondence between seating

arrangements in Actand settings of the variables satisfying the above constraints. Consider the

following distribution over the variables satisfying the constraints: z1,...,zcis distributed according

to a Markov network with z1= 1, zc= t, and edge potentials:

?

Given z1:c, we give each yithe following distribution conditioned on y1:i−1:

?1

i−1−zid

f(zi,zi−1) =

i − 1 − zid

1

0

if zi= zi−1,

if zi= zi−1+ 1,

otherwise.

(14)

It is easy to see that the normalization constant is simply Sd(c,t) and

P(z1:c) =

i:zi=zi−1(i − 1 − zid)

Sd(c,t)

.

(15)

P(yi|z1:c,y1:i−1) =

if yi= i and zi= zi−1+ 1,

if zi= zi−1and yi∈ [i − 1].

Pi−1

j=11(yj=yi)−d

(16)

6

Page 7

0123

x 105

0

0.5

1

1.5

2

2.5

x 107

Input size

Calgary: news

context/symbol pairs

tables (sampling)

tables (particle filter)

(a)

02468

x 105

0

2

4

6

8

x 107

Brown corpus

Input size

context/symbol pairs

tables (sampling)

tables (particle filter)

(b)

02468

x 105

0

10

20

30

40

Input size

Seconds per iteration

Brown corpus

Original

Re−instantiating

(c)

Figure 2: (a), (b) Number of context/symbol pairs and total number of tables (counted after particle filter

initialization and 10 sampling iterations using the compact original sampler) as a function input size. Subfigure

(a) shows the counts obtained from a byte-level model of the news file in the Calgary corpus, whereas (b)

shows the counts for word-level model of the Brown corpus (training set). The space required for the compact

representation is proportional to the number of context/symbol pairs, whereas for the full representation it is

proportional to the number of tables. Note also that sampling tends to increase the number of tables over the

particle filter initialization. (c) Time per iteration (seconds) as a function of input size for the original Gibbs

sampler in the compact representation and the re-instantiating sampler (on the Brown corpus).

Multiplying all the probabilities together, we see that P(z1:c,y1:c) is exactly equal to P(A) in (2).

Thus we can sample A by first sampling z1:cfrom (15), then each yiconditioned on previous ones

using (16), and converting this representation into A. We use a backward-filtering-forward-sampling

algorithm to sample z1:c, as this avoids numerical underflow problems that can arise when using

forward-filtering. Backward-filtering avoids these problems by incorporating the constraint that zc

has to equal t into the messages from the beginning.

Fragmenting a Restaurant. In particle filtering and in prediction, we often need to re-instantiate a

restaurant which was previously marginalized out. We can do so by sampling Ausgiven cus,tusfor

each s, then fragmenting each Aususing Theorem 1, counting the resulting numbers of customers

and tables, then forgetting the seating arrangements.

6 Experiments

In order to evaluate the proposed improvements in terms of reduced memory requirements and to

compare the performance of the different sampling schemes we performed three sets of experiments.3

In the first experiment we evaluated the potential space saving due to the compact representation.

Figure 2 shows the number of context/symbol pairs and the total number of tables as a function

of data set size. While the difference does not seem dramatic, there is still a significant amount of

memory that can be saved by using the compact representation, as there is no additional overhead and

memory fragmentation due to variable-size arrays. The comparison between the byte-level model and

the word-level model in Figure 2 also demonstrates that the compact representation saves more space

when |Σ| is small (which leads to context/symbol pairs having larger cus’s and tus’s). Finally, Figure

2 illustrates another interesting effect: the number of tables is generally larger after a few iterations of

Gibbs sampling have been performed after the initialization using a single-particle particle filter [2].

The second experiment compares the computational cost of the compact original sampler and

the sampler that re-instantiates full seating arrangements. The main computational cost of the

original sampler is computing the ratio (12), while sampling the seating arrangements is the main

computational cost of the re-instantiating sampler. Figure 2(c) shows the time needed for one iteration

of Gibbs sampling as a function of data set size. The re-instantiating sampler is found to be much

more efficient, as it avoids the overhead involved in computing the Stirling numbers in a stable

manner (e.g. log/exp computations). For the original sampler, time can be traded off with space

3All experiments were performed on two data sets: the news file from the Calgary corpus (modeled as a

sequence of 377,109 bytes; |Σ| = 256), and the Brown corpus (preprocessed as in [12]), modeled as a sequence

of words (800,000 words training set; 181,041 words test set; |Σ| = 16383). Following [1], the discount

parameters were fixed to .62,.69,.74,.80 for the first 4 levels and .95 for all subsequent levels of the hierarchy.

7

Page 8

α

Particle Filter only

Fragment

8.45

8.41

8.37

8.33

8.32

8.32

Gibbs (1 sample)

Fragment Parent

8.44

8.40

8.37

8.33

8.32

8.31

Gibbs (50 samples averaged)

Fragment

8.43

8.39

8.35

8.32

8.31

8.31

Online

PF

8.04

8.01

7.98

7.95

7.94

7.95

Parent

8.41

8.39

8.37

8.34

8.33

8.33

Parent

8.39

8.38

8.35

8.32

8.31

8.31

Gibbs

8.04

8.01

7.98

7.94

7.94

7.95

0

1

3

10

20

50

8.41

8.39

8.37

8.33

8.32

8.32

Table 1: Average log-loss on the Brown corpus (test set) for different values of α, different inference strategies,

and different modes of prediction. Inference is performed by either just using the particle filter or using the

particle filter followed by 50 burn-in iterations of Gibbs sampling. Subsequently either 1 or 50 samples are

collected for prediction. Prediction is performed either using fragmentation or by predicting from the parent

node. The final two columns labelled Online show the results obtained by using the particle filter on the test set

as well, after training with either just the particle filter or particle filter followed by 50 Gibbs iterations. Non-zero

values of α can be seen to provide a significant increase in perfomance, while the gains due to averaging samples

or proper fragmentation during prediction are small.

by tabulating all required Stirling numbers along the path down the tree (as was done in these

experiments). However, this leads to an additional memory overhead that mostly undoes any savings

from the compact representation.

The third set of experiments uses the re-instantiating sampler and compares different modes of

prediction and the effect of the non-zero concentration parameter. The results are shown in Table 1.

Predictions with the SM can be made in several different ways. After obtaining one or more samples

from the posterior distribution over customers and tables (either using particle filtering or Gibbs

sampling on the training set) one has a choice of either using particle filtering on the test set as well

(online setting), or making predictions while keeping the model fixed. One also has a choice when

making predictions involving contexts that were marginalized out from the model: one can either

re-instantiate these contexts by fragmentation or simply predict from the parent (or even the child) of

the required node. While one ultimately wants to average predictions over the posterior distribution,

one may consider using just a single sample for computational reasons.

7 Discussion

In this paper we proposed an enlarged set of hyperparameters for the sequence memoizer that re-

tains the coagulation/fragmentation properties important for efficient inference, and we proposed a

new minimal representation of the Chinese restaurant processes to reduce the memory requirement

of the sequence memoizer. We developed novel inference algorithms for the new representation,

and presented experimental results exploring their behaviors. We found that the algorithm which

re-instantiates seating arrangements is significantly more efficient than the other two Gibbs sam-

plers, while particle filtering is most efficient but produces slightly worse predictions. Along the

way, we formalized the metaphorical language often used to describe Chinese restaurant processes

in the machine learning literature, and were able to provide an elementary proof of the coagula-

tion/fragmentation properties. We believe this more precise language will be of use to researchers

interested in hierarchical Dirichlet processes and its various generalizations.

We are currently exploring methods to compute or approximate the generalized Stirling numbers, and

efficient methods to optimize the hyperparameters in the sequence memoizer. A parting remark is

that the posterior distribution over {cus,tus} in (10) is in the form of a standard Markov network

with sum constraints (7). Thus other inference algorithms like loopy belief propagation or variational

inference can potentially be applied. There are however two difficulties to be resolved before these

are possible: the large domains of the variables, and the large dynamic ranges of the factors.

Acknowledgments

We would like to thank the Gatsby Charitable Foundation for generous funding.

8

Page 9

References

[1] F. Wood, C. Archambeau, J. Gasthaus, L. F. James, and Y. W. Teh. A stochastic memoizer for

sequence data. In Proceedings of the International Conference on Machine Learning, volume 26,

pages 1129–1136, 2009.

[2] J. Gasthaus, F. Wood, and Y. W. Teh. Lossless compression based on the Sequence Memoizer.

In James A. Storer and Michael W. Marcellin, editors, Data Compression Conference, pages

337–345, Los Alamitos, CA, USA, 2010. IEEE Computer Society.

[3] Y. W. Teh. A Bayesian interpretation of interpolated Kneser-Ney. Technical Report TRA2/06,

School of Computing, National University of Singapore, 2006.

[4] J. Pitman. Coalescents with multiple collisions. Annals of Probability, 27:1870–1902, 1999.

[5] M. W. Ho, L. F. James, and J. W. Lau. Coagulation fragmentation laws induced by general co-

agulations of two-parameter Poisson-Dirichlet processes. http://arxiv.org/abs/math.PR/0601608,

2006.

[6] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal

of the American Statistical Association, 101(476):1566–1581, 2006.

[7] P. Blunsom, T. Cohn, S. Goldwater, and M. Johnson. A note on the implementation of

hierarchical Dirichlet processes. In Proceedings of the ACL-IJCNLP 2009 Conference Short

Papers, pages 337–340, Suntec, Singapore, August 2009. Association for Computational

Linguistics.

[8] J. Pitman and M. Yor. The two-parameter Poisson-Dirichlet distribution derived from a stable

subordinator. Annals of Probability, 25:855–900, 1997.

[9] H. Ishwaran and L. F. James. Gibbs sampling methods for stick-breaking priors. Journal of the

American Statistical Association, 96(453):161–173, 2001.

[10] L. C. Hsu and P. J.-S. Shiue. A unified approach to generalized Stirling numbers. Advances in

Applied Mathematics, 20:366–384, 1998.

[11] Y. W. Teh. A hierarchical Bayesian language model based on Pitman-Yor processes. In

Proceedings of the 21st International Conference on Computational Linguistics and 44th

Annual Meeting of the Association for Computational Linguistics, pages 985–992, 2006.

[12] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model.

Journal of Machine Learning Research, 3:1137–1155, 2003.

9