Page 1

Indian Buffet Processes with Power-law Behavior

Yee Whye Teh and Dilan G¨ or¨ ur

Gatsby Computational Neuroscience Unit, UCL

17 Queen Square, London WC1N 3AR, United Kingdom

{ywteh,dilan}@gatsby.ucl.ac.uk

Abstract

The Indian buffet process (IBP) is an exchangeable distribution over binary ma-

trices used in Bayesian nonparametric featural models. In this paper we propose

a three-parameter generalization of the IBP exhibiting power-law behavior. We

achieve this by generalizing the beta process (the de Finetti measure of the IBP) to

the stable-beta process and deriving the IBP corresponding to it. We find interest-

ing relationships between the stable-beta process and the Pitman-Yor process (an-

other stochastic process used in Bayesian nonparametric models with interesting

power-law properties). We derive a stick-breaking construction for the stable-beta

process, and find that our power-law IBP is a good model for word occurrences in

document corpora.

1 Introduction

The Indian buffet process (IBP) is an infinitely exchangeable distribution over binary matrices with

a finite number of rows and an unbounded number of columns [1, 2]. It has been proposed as a

suitable prior for Bayesian nonparametric featural models, where each object (row) is modeled with

a potentially unbounded number of features (columns). Applications of the IBP include Bayesian

nonparametric models for ICA [3], choice modeling [4], similarity judgements modeling [5], dyadic

data modeling [6] and causal inference [7].

Inthispaperweproposeathree-parametergeneralizationoftheIBPwithpower-lawbehavior. Using

the usual analogy of customers entering an Indian buffet restaurant and sequentially choosing dishes

from an infinitely long buffet counter, our generalization with parameters α > 0, c > −σ and

σ ∈ [0,1) is simply as follows:

• Customer 1 tries Poisson(α) dishes.

• Subsequently, customer n + 1:

– tries dish k with probabilitymk−σ

– tries Poisson(αΓ(1+c)Γ(n+c+σ)

n+c, for each dish that has previously been tried;

Γ(n+1+c)Γ(c+σ)) new dishes.

where mkis the number of previous customers who tried dish k. The dishes and the customers

correspond to the columns and the rows of the binary matrix respectively, with an entry of the matrix

being one if the corresponding customer tried the dish (and zero otherwise). The mass parameter α

controls the total number of dishes tried by the customers, the concentration parameter c controls

the number of customers that will try each dish, and the stability exponent σ controls the power-law

behavior of the process. When σ = 0 the process does not exhibit power-law behavior and reduces

to the usual two-parameter IBP [2].

Many naturally occurring phenomena exhibit power-law behavior, and it has been argued that using

models that can capture this behavior can improve learning [8]. Recent examples where this has led

to significant improvements include unsupervised morphology learning [8], language modeling [9]

1

Page 2

and image segmentation [10]. These examples are all based on the Pitman-Yor process [11, 12, 13],

a generalization of the Dirichlet process [14] with power-law properties. Our generalization of the

IBP extends the ability to model power-law behavior to featural models, and we expect it to lead to

a wealth of novel applications not previously well handled by the IBP.

The approach we take in this paper is to first define the underlying de Finetti measure, then to derive

the conditional distributions of Bernoulli process observations with the de Finetti measure integrated

out. This automatically ensures that the resulting power-law IBP is infinitely exchangeable. We call

the de Finetti measure of the power-law IBP the stable-beta process. It is a novel generalization of

the beta process [15] (which is the de Finetti measure of the normal two-parameter IBP [16]) with

characteristics reminiscent of the stable process [17, 11] (in turn related to the Pitman-Yor process).

We will see that the stable-beta process has a number of properties similar to the Pitman-Yor process.

In the following section we first give a brief description of completely random measures, a class of

random measures which includes the stable-beta and the beta processes. In Section 3 we introduce

the stable-beta process, a three parameter generalization of the beta process and derive the power-

law IBP based on the stable-beta process. Based on the proposed model, in Section 4 we construct

a model of word occurrences in a document corpus. We conclude with a discussion in Section 5.

2 Completely Random Measures

In this section we give a brief description of completely random measures [18]. Let Θ be a measure

space with Ω its σ-algebra. A random variable whose values are measures on (Θ,Ω) is referred

to as a random measure. A completely random measure (CRM) µ over (Θ,Ω) is a random mea-

sure such that µ(A)⊥ ⊥µ(B) for all disjoint measurable subsets A,B ∈ Ω. That is, the (random)

masses assigned to disjoint subsets are independent. An important implication of this property is

that the whole distribution over µ is determined (with usually satisfied technical assumptions) once

the distributions of µ(A) are given for all A ∈ Ω.

CRMs can always be decomposed into a sum of three independent parts: a (non-random) measure,

an atomic measure with fixed atoms but random masses, and an atomic measure with random atoms

and masses. CRMs in this paper will only contain the second and third components. In this case we

can write µ in the form,

µ =

N

?

k=1

ukδφk+

M

?

l=1

vlδψl,

(1)

where uk,vl> 0 are the random masses, φk∈ Θ are the fixed atoms, ψl∈ Θ are the random atoms,

and N,M ∈ N∪{∞}. To describe µ fully it is sufficient to specify N and {φk}, and to describe the

joint distribution over the random variables {uk},{vl},{ψl} and M. Each ukhas to be independent

from everything else and has some distribution Fk. The random atoms and their weights {vl,ψl}

are jointly drawn from a 2D Poisson process over (0,∞] × Θ with some nonatomic rate measure

Λ called the L´ evy measure. The rate measure Λ has to satisfy a number of technical properties; see

[18, 19] for details. If?

is described by Λ and {φk,Fk}N

µ ∼ CRM(Λ,{φk,Fk}N

Θ

?

(0,∞]Λ(du×dθ) = M∗< ∞ then the number of random atoms M in µ

is Poisson distributed with mean M∗, otherwise there are an infinite number of random atoms. If µ

k=1as above, we write,

k=1).

(2)

3 The Stable-beta Process

In this section we introduce a novel CRM called the stable-beta process (SBP). It has no fixed atoms

while its L´ evy measure is defined over (0,1) × Θ:

Λ0(du × dθ) = α

where the parameters are: a mass parameter α > 0, a concentration parameter c > −σ, a stability

exponent 0 ≤ σ < 1, and a smooth base distribution H. The mass parameter controls the overall

mass of the process and the base distribution gives the distribution over the random atom locations.

Γ(1 + c)

Γ(1 − σ)Γ(c + σ)u−σ−1(1 − u)c+σ−1duH(dθ)

(3)

2

Page 3

The mean of the SBP can be shown to be E[µ(A)] = αH(A) for each A ∈ Ω, while var(µ(A)) =

α1−σ

of the SBP around its mean. The stability exponent also governs the power-law behavior of the SBP.

When σ = 0 the SBP does not have power-law behavior and reduces to a normal two-parameter beta

process [15, 16]. When c = 1 − σ the stable-beta process describes the random atoms with masses

< 1 in a stable process [17, 11]. The SBP is so named as it can be seen as a generalization of both

the stable and the beta processes. Both the concentration parameter and the stability exponent can

be generalized to functions over Θ though we will not deal with this generalization here.

1+cH(A). Thus the concentration parameter and the stability exponent both affect the variability

3.1Posterior Stable-beta Process

Consider the following hierarchical model:

µ ∼ CRM(Λ0,{}),

Zi|µ ∼ BernoulliP(µ)

iid, for i = 1,...,n. (4)

The random measure µ is a SBP with no fixed atoms and with L´ evy measure (3), while Zi ∼

BernoulliP(µ)isaBernoulliprocesswithmeanµ[16]. ThisisalsoaCRM:inasmallneighborhood

dθ around θ ∈ Θ it has a probability µ(dθ) of having a unit mass atom in dθ; otherwise it does not

have an atom in dθ. If µ has an atom at θ the probability of Zihaving an atom at θ as well is µ({θ}).

If µ has a smooth component, say µ0, Ziwill have random atoms drawn from a Poisson process

with rate measure µ0. In typical applications to featural models the atoms in Zigive the features

associated with data item i, while the weights of the atoms in µ give the prior probabilities of the

corresponding features occurring in a data item.

We are interested in both the posterior of µ given Z1,...,Zn, as well as the conditional distribu-

tion of Zn+1|Z1,...,Znwith µ marginalized out. Let θ∗

Z1,...,Znwith atom θ∗

given Z1,...,Znis still a CRM, but now including fixed atoms given by θ∗

L´ evy measure and the distribution of the mass at each fixed atom θ∗

1,...,θ∗

Kbe the K unique atoms among

koccurring mktimes. Theorem 3.3 of [20] shows that the posterior of µ

1,...,θ∗

K. Its updated

kare,

k=1),

µ|Z1,...,Zn∼ CRM(Λn,{θ∗

k,Fnk}K

(5)

where

Λn(du × dθ) =α

Γ(1 + c)

Γ(1 − σ)Γ(c + σ)u−σ−1(1 − u)n+c+σ−1duH(dθ),

Γ(n + c)

Γ(mk− σ)Γ(n − mk+ c + σ)umk−σ−1(1 − u)n−mk+c+σ−1du.

Intuitively, the posterior is obtained as follows. Firstly, the posterior of µ must be a CRM since

both the prior of µ and the likelihood of each Zi|µ factorize over disjoint subsets of Θ. Secondly,

µ must have fixed atoms at each θ∗

Z1,...,Znat precisely θ∗

“likelihood” umk(1 − u)n−mk(since there are mkoccurrences of the atom θ∗

to the “prior” Λ0(du×dθ∗

there are no other atoms among Z1,...,Zn. We can think of this as n observations of 0 among n

iid Bernoulli variables, so a “likelihood” of (1 − u)nis multiplied into Λ0(without normalization),

giving the updated L´ evy measure in (6a).

Let us inspect the distributions (6) of the fixed and random atoms in the posterior µ in turn. The

random mass at θ∗

σ,n − mk+ c + σ). This differs from the usual beta process in the subtraction of σ from mkand

addition of σ to n − mk+ c. This is reminiscent of the Pitman-Yor generalization to the Dirichlet

process [11, 12, 13], where a discount parameter is subtracted from the number of customers seated

around each table, and added to the chance of sitting at a new table. On the other hand, the L´ evy

measure of the random atoms of µ is still a L´ evy measure corresponding to an SBP with updated

parameters

α?← αΓ(1 + c)Γ(n + c + σ)

c?← c + n,

(6a)

Fnk(du) =

(6b)

ksince otherwise the probability that there will be atoms among

kis zero. The posterior mass at θ∗

kis obtained by multiplying a Bernoulli

kamong Z1,...,Zn)

k) in (3) and normalizing, giving us (6b). Finally, outside of these K atoms

khas a distribution Fnkwhich is simply a beta distribution with parameters (mk−

Γ(n + 1 + c)Γ(c + σ),σ?← σ

H?← H.

(7)

3

Page 4

Note that the update depends only on n, not on Z1,...,Zn. In summary, the posterior of µ is simply

an independent sum of an SBP with updated parameters and of fixed atoms with beta distributed

masses. Observe that the posterior µ is not itself a SBP. In other words, the SBP is not conjugate

to Bernoulli process observations. This is different from the beta process and again reminiscent

of Pitman-Yor processes, where the posterior is also a sum of a Pitman-Yor process with updated

parameters and fixed atoms with random masses, but not a Pitman-Yor process [11]. Fortunately,

the non-conjugacy of the SBP does not preclude efficient inference. In the next subsections we de-

scribe an Indian buffet process and a stick-breaking construction corresponding to the SBP. Efficient

inference techniques based on both representations for the beta process can be straightforwardly

generalized to the SBP [1, 16, 21].

3.2The Stable-beta Indian Buffet Process

We can derive an Indian buffet process (IBP) corresponding to the SBP by deriving, for each n,

the distribution of Zn+1conditioned on Z1,...,Zn, with µ marginalized out. This derivation is

straightforward and follows closely that for the beta process [16]. For each of the atoms θ∗

posterior of µ(θ∗

kthe

k) given Z1,...,Znis beta distributed with meanmk−σ

n+c. Thus

p(Zn+1(θ∗

k) = 1|Z1,...,Zn) = E[µ(θ∗

k)|Z1,...,Zn] =mk− σ

n + c

(8)

Metaphorically speaking, customer n + 1 tries dish k with probabilitymk−σ

atoms. Let θ ∈ Θ\{θ∗

n+c. Now for the random

1,...,θ∗

K}. In a small neighborhood dθ around θ, we have:

p(Zn+1(dθ) = 1|Z1,...,Zn) = E[µ(dθ)|Z1,...,Zn] =

?1

=α

Γ(1 − σ)Γ(c + σ)H(dθ)

=αΓ(1 + c)Γ(n + c + σ)

Γ(n + 1 + c)Γ(c + σ)H(dθ)

?1

0

uΛn(du × dθ)

=

0

uα

Γ(1 + c)

Γ(1 − σ)Γ(c + σ)u−1−σ(1 − u)n+c+σ−1duH(dθ)

Γ(1 + c)

u−σ(1 − u)n+c+σ−1du

?1

0

(9)

Since Zn+1is completely random and H is smooth, the above shows that on Θ\{θ∗

Zn+1is simply a Poisson process with rate measure αΓ(1+c)Γ(n+c+σ)

Poisson(αΓ(1+c)Γ(n+c+σ)

H. In the IBP metaphor, this corresponds to customer n+1 trying new dishes, with each dish associ-

ated with a new draw from H. The resulting Indian buffet process is as described in the introduction.

It is automatically infinitely exchangeable since it was derived from the conditional distributions of

the hierarchical model (4).

Multiplying the conditional probabilities of each Zngiven previous ones together, we get the joint

probability of Z1,...,Znwith µ marginalized out:

?

i=1

where there are K atoms (dishes) θ∗

and h is the density of H. (10) is to be contrasted with (4) in [1]. The Kh! terms in [1] are absent

as we have to distinguish among these Khdishes in assigning each of them a distinct atom (this

also contributes the h(θ∗

Z1,...,Znalso indicates the infinite exchangeability of the stable-beta IBP.

1,...,θ∗

K}

Γ(n+1+c)Γ(c+σ)H. In particular, it will have

Γ(n+1+c)Γ(c+σ)) new atoms, each independently and identically distributed according to

p(Z1,...,Zn) = exp

−α

n

?

Γ(1+c)Γ(i−1+c+σ)

Γ(i+c)Γ(c+σ)

? K

k=1

?

Γ(mk−σ)Γ(n−mk+c+σ)Γ(1+c)

Γ(1−σ)Γ(c+σ)Γ(n+c)

αh(θ∗

k), (10)

1,...,θ∗

Kamong Z1,...,Znwith atom k appearing mktimes,

k) terms). The fact that (10) is invariant to permuting the ordering among

3.3Stick-breaking constructions

In this section we describe stick-breaking constructions for the SBP generalizing those for the beta

process. The first is based on the size-biased ordering of atoms induced by the IBP [16], while

4

Page 5

the second is based on the inverse L´ evy measure method [22], and produces a sequence of random

atoms of strictly decreasing masses [21].

The size-biased construction is straightforward: we use the IBP to generate the atoms (dishes) in the

SBP; each time a dish is newly generated the atom is drawn from H and its mass from Fnk. This

leads to the following procedure:

for n = 1,2,...:

for k = 1,...,Jn:

Jn∼ Poisson(αΓ(1+c)Γ(n−1+c+σ)

vnk∼ Beta(1 − σ,n − 1 + c + σ),

∞

?

Γ(n+c)Γ(c+σ)

),

ψnk∼ H,

(11)

µ =

n=1

Jn

?

k=1

vnkδψnk.

The inverse L´ evy measure is a general method of generating from a Poisson process with non-

uniform rate measure. It essentially transforms the Poisson process into one with uniform rate,

generates a sample, and transforms the sample back.

SBP because the inverse transform has no analytically tractable form. The L´ evy measure Λ0of

the SBP factorizes into a product Λ0(du×dθ) = L(du)H(dθ) of a σ-finite measure L(du) =

α

that we can generate a sample {vl,ψl}∞

pling the masses {vl}∞

associating each vlwith an iid draw ψl∼ H [19]. Now consider the mapping T : (0,1) → (0,∞)

given by

?1

T is bijective and monotonically decreasing. The Mapping Theorem for Poisson processes [19]

shows that {vl}∞

Lebesgue measure on (0,∞). A sample {tl}∞

el ∼ Exponential(1) and setting tl =?l

v1,v2,... is a decreasing sequence of masses. Deriving the density of vlgiven vl−1, we get:

p(vl|vl−1) =??dtl

In general these densities do not simplify and we have to resort to solving for T−1(tl) numerically.

There are two cases for which they do simplify. For c = 1, σ = 0, the density function reduces to

p(vl|vl−1) = αvα−1

[21]. In the stable process case when c = 1 − σ and σ ?= 0, the density of vlsimplifies to:

p(vl|vl−1) = α

= α(1 − σ)v−σ−1

Doing a change of values to yl= v−σ

l

, we get:

?

That is, each ylis exponentially distributed with rate α1−σ

of the parameters we do not have an analytic stick breaking form. However note that the weights

generated using this method are still going to be strictly decreasing.

This method is more involved for the

Γ(1+c)

Γ(1−σ)Γ(c+σ)u−σ−1(1−u)c+σ−1du over (0,1) and a probability measure H over Θ. This implies

l=1of the random atoms of µ and their masses by first sam-

l=1∼ PoissonP(L) from a Poisson process on (0,1) with rate measure L, and

T(u) =

u

L(du) =

?1

u

α

Γ(1 + c)

Γ(1 − σ)Γ(c + σ)u−σ−1(1 − u)c+σ−1du.

(12)

l=1∼ PoissonP(L) if and only if {T(vl)}∞

l=1∼ PoissonP(L) where L is

l=1∼ PoissonP(L) can be easily drawn by letting

i=1eifor all l. Transforming back with vl = T−1(tl),

we have {vl}∞

l=1∼ PoissonP(L). As t1,t2,... is an increasing sequence and T is decreasing,

dvl

??p(tl|tl−1) = α

Γ(1+c)

Γ(1−σ)Γ(c+σ)v−σ−1

l

(1−vl)c+σ−1exp

?

−

?vl−1

vl

L(du)

?

. (13)

l

/vα

l−1, leading to the stick-breaking construction of the single parameter IBP

?

l

exp

σ

(v−σ

l

Γ(2−σ)

Γ(1−σ)Γ(1)v−σ−1

l

× exp

?

−?vl−1

vl

α

Γ(2−σ)

Γ(1−σ)Γ(1)u−σ−1du

− v−σ

?

−α(1−σ)

l−1)

?

.

(14)

p(yl|yl−1) = α1−σ

σexp

− α1−σ

σ(yl− yl−1)

and offset by yl−1. For general values

?

.

(15)

σ

3.4Power-law Properties

The SBP has a number of appealing power-law properties. In this section we shall assume σ > 0

since the case σ = 0 reduces the SBP to the usual beta process with less interesting power-law

properties. Derivations are given in the appendix.

5

Page 6

10

0

10

number of customers

2

10

4

10

6

10

0

10

1

10

2

10

3

10

4

10

5

mean number of dishes tried

!=1, c=1

"=0.8

"=0.5

"=0.2

"=0

10

0

10

2

10

4

10

0

10

1

10

2

10

3

10

4

number of customers trying each dish

number of dishes

!=1, c=1, "=0.5

Figure 1: Power-law properties of the stable-beta Indian buffet process.

Firstly, the total number of dishes tried by n customers is O(nσ). The left panel of Figure 1 shows

this for varying σ. Secondly, the number of customers trying each dish follows a Zipf’s law [23].

This is shown in the right panel of Figure 1, which plots the number of dishes Kmversus the

number of customers m trying each dish (that is, Kmis the number of dishes k for which mk= m).

Asymptotically we can show that the proportion of dishes tried by m customers is O(m−1−σ). Note

that these power-laws are similar to those observed for Pitman-Yor processes. One aspect of the

SBP which is not power-law is the number of dishes each customer tries. This is simply Poisson(α)

distributed. It seems difficult obtain power-law behavior in this aspect within a CRM framework,

because of the fundamental role played by the Poisson process.

4Word Occurrence Models with Stable-beta Processes

In this section we use the SBP as a model for word occurrences in document corpora. Let n be

the number of documents in a corpus. Let Zi({θ}) = 1 if word type θ occurs in document i and

0 otherwise, and let µ({θ}) be the occurrence probability of word type θ among the documents

in the corpus. We use the hierarchical model (4) with a SBP prior1on µ and with each document

modeled as a conditionally independent Bernoulli process draw. The joint distribution over the word

occurrences Z1,...,Zn, with µ integrated out, is given by the IBP joint probability (10).

We applied the word occurrence model to the 20newsgroups dataset. Following [16], we modeled

the training documents in each of the 20 newsgroups as a separate corpus with a separate SBP. We

use the popularity of each word type across all 20 newsgroups as the base distribution2: for each

word type θ let nθbe the number of documents containing θ and let H({θ}) ∝ nθ.

In the first experiment we compared the SBP to the beta process by fitting the parameters α,c and

σ of both models to each newsgroup by maximum likelihood (in beta process case σ is fixed at

0) . We expect the SBP to perform better as it is better able to capture the power-law statistics of

the document corpora (see Figure 2). The ML values of the parameters across classes did not vary

much, taking values α = 142.6 ± 40.0,c = 4.1 ± 0.9 and σ = 0.47 ± 0.1. In comparison, the

parameters values obtained by the beta process are α = 147.3±41.4 and c = 25.9±8.4. Note that

the estimated values for c are significantly larger than for the SBP to allow the beta process to model

the fact that many words occur in a small number of documents (a consequence of the power-law

1Words are discrete objects. To get a smooth base distribution we imagine appending each word type with

a U[0,1] variate. This does not affect the modelling that follows.

2The appropriate technique, as proposed by [16], would be to use a hierarchical SBP to tie the word occur-

rence probabilities across the newsgroups. However due to difficulties dealing with atomic base distributions

we cannot define a hierarchical SBP easily (see discussion).

6

Page 7

100200

number of documents

300400 500

2000

4000

6000

8000

10000

12000

14000

cumulative number of words

BP

SBP

DATA

10

0

10

1

10

2

10

0

10

1

10

2

10

3

number of documents per word

number of words

BP

SBP

DATA

Figure 2: Power-law properties of the 20newsgroups dataset. The faint dashed lines are the distribu-

tions of words in the documents in each class, the solid curve is the mean of these lines. The dashed

lines are the means of the word distributions generated by the ML parameters for the beta process

(pink) and the SBP (green).

Table 1: Classification performance of SBP and beta process (BP). The jth column (denoted 1:j)

shows the cumulative rank j classification accuracy of the test documents. The three numbers after

the models are the percentages of training, validation and test sets respectively.

assigned to classes:1 1:21:31:41:5

BP - 20/20/60

SBP - 20/20/60

BP - 60/20/20

SBP - 60/20/20

78.7(±0.5)

79.9(±0.5)

85.5(±0.6)

85.5(±0.4)

87.4(±0.2)

87.6(±0.1)

91.6(±0.3)

91.9(±0.4)

91.3(±0.2)

91.5(±0.2)

94.2(±0.3)

94.4(±0.2)

95.1(±0.2)

93.7(±0.2)

95.6(±0.4)

95.6(±0.3)

96.2(±0.2)

95.1(±0.2)

96.6(±0.3)

96.6(±0.3)

statistics of word occurrences; see Figure 2). We also plotted the characteristics of data simulated

from the models using the estimated ML parameters. The SBP has a much better fit than the beta

process to the power-law properties of the corpora.

In the second experiment we tested the two models on categorizing test documents into one of the

20 newsgroups. Since this is a discriminative task, we optimized the parameters in both models to

maximize the cumulative ranked classification performance. The rank j classification performance

is defined to be the percentage of documents where the true label is among the top j predicted classes

(asdeterminedbytheIBPconditionalprobabilitiesofthedocumentsundereachofthe20newsgroup

classes). As the cost function is not differentiable, we did a grid search over the parameter space,

using 20 values of α,c and σ each, and found the parameters maximizing the objective function on

a validation set separate from the test set. To see the effect of sample size on model performance we

tried splitting the documents in each newsgroup into 20% training, 20% validation and 60% test sets,

and into 60% training, 20% validation and 20% test sets. We repeated the experiment five times with

different random splits of the dataset. The ranked classification rates are shown in Table 1. Figure 3

shows that the SBP model has generally higher classification performances than the beta process.

5Discussion

Wehaveintroducedanovelstochasticprocesscalledthestable-betaprocess. Thestable-betaprocess

is a generalization of the beta process, and can be used in nonparametric Bayesian featural models

with an unbounded number of features. As opposed to the beta process, the stable-beta process has

a number of appealing power-law properties. We developed both an Indian buffet process and a

stick-breaking construction for the stable-beta process and applied it to modeling word occurrences

in document corpora. We expect the stable-beta process to find uses modeling a range of natural

phenomena with power-law properties.

7

Page 8

12345

−2

0

2

4

6x 10

−3

SBP−BP

class order

Figure 3: Differences between the classification rates of the SBP and the beta process. The perfor-

mance of the SBP was consistently higher than that of the beta process for each of the five runs.

We derived the stable-beta process as a completely random measure with L´ evy measure (3). It

would be interesting and illuminating to try to derive it as an infinite limit of finite models, however

we were not able to do so in our initial attempts. A related question is whether there is a natural

definition of the stable-beta process for non-smooth base distributions. Until this is resolved in the

positive, we are not able to define hierarchical stable-beta processes generalizing the hierarchical

beta processes [16].

Another avenue of research we are currently pursuing is in deriving better stick-breaking construc-

tions for the stable-beta process. The current construction requires inverting the integral (12), which

is expensive as it requires an iterative method which evaluates the integral numerically within each

iteration.

Acknowledgement

We thank the Gatsby Charitable Foundation for funding, Romain Thibaux, Peter Latham and Tom

Griffiths for interesting discussions, and the anonymous reviewers for help and feedback.

ADerivation of Power-law Properties

We will make large n and K assumptions here, and make use of Stirling’s approximation Γ(n+1) ≈

√2πn(n/e)n, which is accurate in the larger n regime. The expected number of dishes is,

?

i=1

?

i=1

E[K] =

n

?

e−σ+1(1 +σ−1

i=1

αΓ(1+c)Γ(n+c+σ)

Γ(n+1+c)Γ(c+σ)∈ O

n

?

√

2π(i+c+σ−1)((i+c+σ−1)/e)i+c+σ−1

√

2π(i+c)((i+c)/e)i+c

?

i=1

?

=O

n

?

i+c)i+c(i + c + σ − 1)σ−1

= O

?

n

?

e−σ+1eσ−1iσ−1

?

= O(nσ). (16)

We are interested in the joint distribution of the statistics (K1,...,Kn), where Kmis the number

of dishes tried by exactly m customers and where there are a total of n customers in the restaurant.

As there are

Qn

(K1,...,Kn), we have (ignoring constant terms and collecting terms in (10) with mk= m),

?n

Conditioning on K =?n

m (and even larger n), this probability simplifies to,

?√

K!

m=1Km!

?n

m=1

?

n!

m!(n−m)!

?Kmconfigurations of the IBP with the same statistics

?

m=1Kmas well, we see that (K1,...,Kn) is multinomial with the prob-

ability of a dish having m customers being proportional to the term in large parentheses. For large

p(K1,...,Kn|n) ∝

K!

Qn

m=1Km!

m=1

n!

m!(n−m)!

Γ(m−σ)Γ(n−m+c+σ)Γ(1+c)

Γ(1−σ)Γ(c+σ)Γ(n+c)

?Km.

(17)

O(Γ(m−σ)

Γ(m+1)) = O

2π(m−1−σ)((m−1−σ)/e)m−1−σ

√2πm(m/e)m

?

= O?m−1−σ?.

(18)

8

Page 9

References

[1] T. L. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian buffet process.

In Advances in Neural Information Processing Systems, volume 18, 2006.

[2] Z. Ghahramani, T. L. Griffiths, and P. Sollich. Bayesian nonparametric latent feature models

(with discussion and rejoinder). In Bayesian Statistics, volume 8, 2007.

[3] D. Knowles and Z. Ghahramani. Infinite sparse factor analysis and infinite independent com-

ponents analysis. In International Conference on Independent Component Analysis and Signal

Separation, volume 7 of Lecture Notes in Computer Science. Springer, 2007.

[4] D. G¨ or¨ ur, F. J¨ akel, and C. E. Rasmussen. A choice model with infinitely many latent features.

In Proceedings of the International Conference on Machine Learning, volume 23, 2006.

[5] D. J. Navarro and T. L. Griffiths. Latent features in similarity judgment: A nonparametric

Bayesian approach. Neural Computation, in press 2008.

[6] E. Meeds, Z. Ghahramani, R. M. Neal, and S. T. Roweis. Modeling dyadic data with binary

latent factors. In Advances in Neural Information Processing Systems, volume 19, 2007.

[7] F. Wood, T. L. Griffiths, and Z. Ghahramani. A non-parametric Bayesian method for inferring

hidden causes. In Proceedings of the Conference on Uncertainty in Artificial Intelligence,

volume 22, 2006.

[8] S. Goldwater, T.L. Griffiths, and M. Johnson. Interpolating between types and tokens by es-

timating power-law generators. In Advances in Neural Information Processing Systems, vol-

ume 18, 2006.

[9] Y. W. Teh. A hierarchical Bayesian language model based on Pitman-Yor processes. In Pro-

ceedings of the 21st International Conference on Computational Linguistics and 44th Annual

Meeting of the Association for Computational Linguistics, pages 985–992, 2006.

[10] E. Sudderth and M. I. Jordan. Shared segmentation of natural scenes using dependent Pitman-

Yor processes. In Advances in Neural Information Processing Systems, volume 21, 2009.

[11] M. Perman, J. Pitman, and M. Yor. Size-biased sampling of Poisson point processes and

excursions. Probability Theory and Related Fields, 92(1):21–39, 1992.

[12] J. Pitman and M. Yor. The two-parameter Poisson-Dirichlet distribution derived from a stable

subordinator. Annals of Probability, 25:855–900, 1997.

[13] H. Ishwaran and L. F. James. Gibbs sampling methods for stick-breaking priors. Journal of

the American Statistical Association, 96(453):161–173, 2001.

[14] T. S. Ferguson. A Bayesian analysis of some nonparametric problems. Annals of Statistics,

1(2):209–230, 1973.

[15] N. L. Hjort. Nonparametric Bayes estimators based on beta processes in models for life history

data. Annals of Statistics, 18(3):1259–1294, 1990.

[16] R. Thibaux and M. I. Jordan. Hierarchical beta processes and the Indian buffet process. In

Proceedings of the International Workshop on Artificial Intelligence and Statistics, volume 11,

pages 564–571, 2007.

[17] M. Perman. Random Discrete Distributions Derived from Subordinators. PhD thesis, Depart-

ment of Statistics, University of California at Berkeley, 1990.

[18] J. F. C. Kingman. Completely random measures. Pacific Journal of Mathematics, 21(1):59–78,

1967.

[19] J. F. C. Kingman. Poisson Processes. Oxford University Press, 1993.

[20] Y. Kim. Nonparametric Bayesian estimators for counting processes. Annals of Statistics,

27(2):562–588, 1999.

[21] Y. W. Teh, D. G¨ or¨ ur, and Z. Ghahramani. Stick-breaking construction for the Indian buffet pro-

cess. In Proceedings of the International Conference on Artificial Intelligence and Statistics,

volume 11, 2007.

[22] R. L. Wolpert and K. Ickstadt. Simulations of l´ evy random fields. In Practical Nonparametric

and Semiparametric Bayesian Statistics, pages 227–242. Springer-Verlag, 1998.

[23] G. Zipf. Selective Studies and the Principle of Relative Frequency in Language. Harvard

University Press, Cambridge, MA, 1932.

9