Page 1

Indian Buffet Processes with Power-law Behavior

Yee Whye Teh and Dilan G¨ or¨ ur

Gatsby Computational Neuroscience Unit, UCL

17 Queen Square, London WC1N 3AR, United Kingdom

{ywteh,dilan}@gatsby.ucl.ac.uk

Abstract

The Indian buffet process (IBP) is an exchangeable distribution over binary ma-

trices used in Bayesian nonparametric featural models. In this paper we propose

a three-parameter generalization of the IBP exhibiting power-law behavior. We

achieve this by generalizing the beta process (the de Finetti measure of the IBP) to

the stable-beta process and deriving the IBP corresponding to it. We find interest-

ing relationships between the stable-beta process and the Pitman-Yor process (an-

other stochastic process used in Bayesian nonparametric models with interesting

power-law properties). We derive a stick-breaking construction for the stable-beta

process, and find that our power-law IBP is a good model for word occurrences in

document corpora.

1 Introduction

The Indian buffet process (IBP) is an infinitely exchangeable distribution over binary matrices with

a finite number of rows and an unbounded number of columns [1, 2]. It has been proposed as a

suitable prior for Bayesian nonparametric featural models, where each object (row) is modeled with

a potentially unbounded number of features (columns). Applications of the IBP include Bayesian

nonparametric models for ICA [3], choice modeling [4], similarity judgements modeling [5], dyadic

data modeling [6] and causal inference [7].

Inthispaperweproposeathree-parametergeneralizationoftheIBPwithpower-lawbehavior. Using

the usual analogy of customers entering an Indian buffet restaurant and sequentially choosing dishes

from an infinitely long buffet counter, our generalization with parameters α > 0, c > −σ and

σ ∈ [0,1) is simply as follows:

• Customer 1 tries Poisson(α) dishes.

• Subsequently, customer n + 1:

– tries dish k with probabilitymk−σ

– tries Poisson(αΓ(1+c)Γ(n+c+σ)

n+c, for each dish that has previously been tried;

Γ(n+1+c)Γ(c+σ)) new dishes.

where mkis the number of previous customers who tried dish k. The dishes and the customers

correspond to the columns and the rows of the binary matrix respectively, with an entry of the matrix

being one if the corresponding customer tried the dish (and zero otherwise). The mass parameter α

controls the total number of dishes tried by the customers, the concentration parameter c controls

the number of customers that will try each dish, and the stability exponent σ controls the power-law

behavior of the process. When σ = 0 the process does not exhibit power-law behavior and reduces

to the usual two-parameter IBP [2].

Many naturally occurring phenomena exhibit power-law behavior, and it has been argued that using

models that can capture this behavior can improve learning [8]. Recent examples where this has led

to significant improvements include unsupervised morphology learning [8], language modeling [9]

1

Page 2

and image segmentation [10]. These examples are all based on the Pitman-Yor process [11, 12, 13],

a generalization of the Dirichlet process [14] with power-law properties. Our generalization of the

IBP extends the ability to model power-law behavior to featural models, and we expect it to lead to

a wealth of novel applications not previously well handled by the IBP.

The approach we take in this paper is to first define the underlying de Finetti measure, then to derive

the conditional distributions of Bernoulli process observations with the de Finetti measure integrated

out. This automatically ensures that the resulting power-law IBP is infinitely exchangeable. We call

the de Finetti measure of the power-law IBP the stable-beta process. It is a novel generalization of

the beta process [15] (which is the de Finetti measure of the normal two-parameter IBP [16]) with

characteristics reminiscent of the stable process [17, 11] (in turn related to the Pitman-Yor process).

We will see that the stable-beta process has a number of properties similar to the Pitman-Yor process.

In the following section we first give a brief description of completely random measures, a class of

random measures which includes the stable-beta and the beta processes. In Section 3 we introduce

the stable-beta process, a three parameter generalization of the beta process and derive the power-

law IBP based on the stable-beta process. Based on the proposed model, in Section 4 we construct

a model of word occurrences in a document corpus. We conclude with a discussion in Section 5.

2 Completely Random Measures

In this section we give a brief description of completely random measures [18]. Let Θ be a measure

space with Ω its σ-algebra. A random variable whose values are measures on (Θ,Ω) is referred

to as a random measure. A completely random measure (CRM) µ over (Θ,Ω) is a random mea-

sure such that µ(A)⊥ ⊥µ(B) for all disjoint measurable subsets A,B ∈ Ω. That is, the (random)

masses assigned to disjoint subsets are independent. An important implication of this property is

that the whole distribution over µ is determined (with usually satisfied technical assumptions) once

the distributions of µ(A) are given for all A ∈ Ω.

CRMs can always be decomposed into a sum of three independent parts: a (non-random) measure,

an atomic measure with fixed atoms but random masses, and an atomic measure with random atoms

and masses. CRMs in this paper will only contain the second and third components. In this case we

can write µ in the form,

µ =

N

?

k=1

ukδφk+

M

?

l=1

vlδψl,

(1)

where uk,vl> 0 are the random masses, φk∈ Θ are the fixed atoms, ψl∈ Θ are the random atoms,

and N,M ∈ N∪{∞}. To describe µ fully it is sufficient to specify N and {φk}, and to describe the

joint distribution over the random variables {uk},{vl},{ψl} and M. Each ukhas to be independent

from everything else and has some distribution Fk. The random atoms and their weights {vl,ψl}

are jointly drawn from a 2D Poisson process over (0,∞] × Θ with some nonatomic rate measure

Λ called the L´ evy measure. The rate measure Λ has to satisfy a number of technical properties; see

[18, 19] for details. If?

is described by Λ and {φk,Fk}N

µ ∼ CRM(Λ,{φk,Fk}N

Θ

?

(0,∞]Λ(du×dθ) = M∗< ∞ then the number of random atoms M in µ

is Poisson distributed with mean M∗, otherwise there are an infinite number of random atoms. If µ

k=1as above, we write,

k=1).

(2)

3The Stable-beta Process

In this section we introduce a novel CRM called the stable-beta process (SBP). It has no fixed atoms

while its L´ evy measure is defined over (0,1) × Θ:

Λ0(du × dθ) = α

where the parameters are: a mass parameter α > 0, a concentration parameter c > −σ, a stability

exponent 0 ≤ σ < 1, and a smooth base distribution H. The mass parameter controls the overall

mass of the process and the base distribution gives the distribution over the random atom locations.

Γ(1 + c)

Γ(1 − σ)Γ(c + σ)u−σ−1(1 − u)c+σ−1duH(dθ)

(3)

2

Page 3

The mean of the SBP can be shown to be E[µ(A)] = αH(A) for each A ∈ Ω, while var(µ(A)) =

α1−σ

of the SBP around its mean. The stability exponent also governs the power-law behavior of the SBP.

When σ = 0 the SBP does not have power-law behavior and reduces to a normal two-parameter beta

process [15, 16]. When c = 1 − σ the stable-beta process describes the random atoms with masses

< 1 in a stable process [17, 11]. The SBP is so named as it can be seen as a generalization of both

the stable and the beta processes. Both the concentration parameter and the stability exponent can

be generalized to functions over Θ though we will not deal with this generalization here.

1+cH(A). Thus the concentration parameter and the stability exponent both affect the variability

3.1 Posterior Stable-beta Process

Consider the following hierarchical model:

µ ∼ CRM(Λ0,{}),

Zi|µ ∼ BernoulliP(µ)

iid, for i = 1,...,n. (4)

The random measure µ is a SBP with no fixed atoms and with L´ evy measure (3), while Zi ∼

BernoulliP(µ)isaBernoulliprocesswithmeanµ[16]. ThisisalsoaCRM:inasmallneighborhood

dθ around θ ∈ Θ it has a probability µ(dθ) of having a unit mass atom in dθ; otherwise it does not

have an atom in dθ. If µ has an atom at θ the probability of Zihaving an atom at θ as well is µ({θ}).

If µ has a smooth component, say µ0, Ziwill have random atoms drawn from a Poisson process

with rate measure µ0. In typical applications to featural models the atoms in Zigive the features

associated with data item i, while the weights of the atoms in µ give the prior probabilities of the

corresponding features occurring in a data item.

We are interested in both the posterior of µ given Z1,...,Zn, as well as the conditional distribu-

tion of Zn+1|Z1,...,Znwith µ marginalized out. Let θ∗

Z1,...,Znwith atom θ∗

given Z1,...,Znis still a CRM, but now including fixed atoms given by θ∗

L´ evy measure and the distribution of the mass at each fixed atom θ∗

1,...,θ∗

Kbe the K unique atoms among

koccurring mktimes. Theorem 3.3 of [20] shows that the posterior of µ

1,...,θ∗

K. Its updated

kare,

k=1),

µ|Z1,...,Zn∼ CRM(Λn,{θ∗

k,Fnk}K

(5)

where

Λn(du × dθ) =α

Γ(1 + c)

Γ(1 − σ)Γ(c + σ)u−σ−1(1 − u)n+c+σ−1duH(dθ),

Γ(n + c)

Γ(mk− σ)Γ(n − mk+ c + σ)umk−σ−1(1 − u)n−mk+c+σ−1du.

Intuitively, the posterior is obtained as follows. Firstly, the posterior of µ must be a CRM since

both the prior of µ and the likelihood of each Zi|µ factorize over disjoint subsets of Θ. Secondly,

µ must have fixed atoms at each θ∗

Z1,...,Znat precisely θ∗

“likelihood” umk(1 − u)n−mk(since there are mkoccurrences of the atom θ∗

to the “prior” Λ0(du×dθ∗

there are no other atoms among Z1,...,Zn. We can think of this as n observations of 0 among n

iid Bernoulli variables, so a “likelihood” of (1 − u)nis multiplied into Λ0(without normalization),

giving the updated L´ evy measure in (6a).

Let us inspect the distributions (6) of the fixed and random atoms in the posterior µ in turn. The

random mass at θ∗

σ,n − mk+ c + σ). This differs from the usual beta process in the subtraction of σ from mkand

addition of σ to n − mk+ c. This is reminiscent of the Pitman-Yor generalization to the Dirichlet

process [11, 12, 13], where a discount parameter is subtracted from the number of customers seated

around each table, and added to the chance of sitting at a new table. On the other hand, the L´ evy

measure of the random atoms of µ is still a L´ evy measure corresponding to an SBP with updated

parameters

α?← αΓ(1 + c)Γ(n + c + σ)

c?← c + n,

(6a)

Fnk(du) =

(6b)

ksince otherwise the probability that there will be atoms among

kis zero. The posterior mass at θ∗

kis obtained by multiplying a Bernoulli

kamong Z1,...,Zn)

k) in (3) and normalizing, giving us (6b). Finally, outside of these K atoms

khas a distribution Fnkwhich is simply a beta distribution with parameters (mk−

Γ(n + 1 + c)Γ(c + σ),σ?← σ

H?← H.

(7)

3

Page 4

Note that the update depends only on n, not on Z1,...,Zn. In summary, the posterior of µ is simply

an independent sum of an SBP with updated parameters and of fixed atoms with beta distributed

masses. Observe that the posterior µ is not itself a SBP. In other words, the SBP is not conjugate

to Bernoulli process observations. This is different from the beta process and again reminiscent

of Pitman-Yor processes, where the posterior is also a sum of a Pitman-Yor process with updated

parameters and fixed atoms with random masses, but not a Pitman-Yor process [11]. Fortunately,

the non-conjugacy of the SBP does not preclude efficient inference. In the next subsections we de-

scribe an Indian buffet process and a stick-breaking construction corresponding to the SBP. Efficient

inference techniques based on both representations for the beta process can be straightforwardly

generalized to the SBP [1, 16, 21].

3.2The Stable-beta Indian Buffet Process

We can derive an Indian buffet process (IBP) corresponding to the SBP by deriving, for each n,

the distribution of Zn+1conditioned on Z1,...,Zn, with µ marginalized out. This derivation is

straightforward and follows closely that for the beta process [16]. For each of the atoms θ∗

posterior of µ(θ∗

kthe

k) given Z1,...,Znis beta distributed with meanmk−σ

n+c. Thus

p(Zn+1(θ∗

k) = 1|Z1,...,Zn) = E[µ(θ∗

k)|Z1,...,Zn] =mk− σ

n + c

(8)

Metaphorically speaking, customer n + 1 tries dish k with probabilitymk−σ

atoms. Let θ ∈ Θ\{θ∗

n+c. Now for the random

1,...,θ∗

K}. In a small neighborhood dθ around θ, we have:

p(Zn+1(dθ) = 1|Z1,...,Zn) = E[µ(dθ)|Z1,...,Zn] =

?1

=α

Γ(1 − σ)Γ(c + σ)H(dθ)

=αΓ(1 + c)Γ(n + c + σ)

Γ(n + 1 + c)Γ(c + σ)H(dθ)

?1

0

uΛn(du × dθ)

=

0

uα

Γ(1 + c)

Γ(1 − σ)Γ(c + σ)u−1−σ(1 − u)n+c+σ−1duH(dθ)

Γ(1 + c)

u−σ(1 − u)n+c+σ−1du

?1

0

(9)

Since Zn+1is completely random and H is smooth, the above shows that on Θ\{θ∗

Zn+1is simply a Poisson process with rate measure αΓ(1+c)Γ(n+c+σ)

Poisson(αΓ(1+c)Γ(n+c+σ)

H. In the IBP metaphor, this corresponds to customer n+1 trying new dishes, with each dish associ-

ated with a new draw from H. The resulting Indian buffet process is as described in the introduction.

It is automatically infinitely exchangeable since it was derived from the conditional distributions of

the hierarchical model (4).

Multiplying the conditional probabilities of each Zngiven previous ones together, we get the joint

probability of Z1,...,Znwith µ marginalized out:

?

i=1

where there are K atoms (dishes) θ∗

and h is the density of H. (10) is to be contrasted with (4) in [1]. The Kh! terms in [1] are absent

as we have to distinguish among these Khdishes in assigning each of them a distinct atom (this

also contributes the h(θ∗

Z1,...,Znalso indicates the infinite exchangeability of the stable-beta IBP.

1,...,θ∗

K}

Γ(n+1+c)Γ(c+σ)H. In particular, it will have

Γ(n+1+c)Γ(c+σ)) new atoms, each independently and identically distributed according to

p(Z1,...,Zn) = exp

−α

n

?

Γ(1+c)Γ(i−1+c+σ)

Γ(i+c)Γ(c+σ)

? K

k=1

?

Γ(mk−σ)Γ(n−mk+c+σ)Γ(1+c)

Γ(1−σ)Γ(c+σ)Γ(n+c)

αh(θ∗

k), (10)

1,...,θ∗

Kamong Z1,...,Znwith atom k appearing mktimes,

k) terms). The fact that (10) is invariant to permuting the ordering among

3.3 Stick-breaking constructions

In this section we describe stick-breaking constructions for the SBP generalizing those for the beta

process. The first is based on the size-biased ordering of atoms induced by the IBP [16], while

4

Page 5

the second is based on the inverse L´ evy measure method [22], and produces a sequence of random

atoms of strictly decreasing masses [21].

The size-biased construction is straightforward: we use the IBP to generate the atoms (dishes) in the

SBP; each time a dish is newly generated the atom is drawn from H and its mass from Fnk. This

leads to the following procedure:

for n = 1,2,...:

for k = 1,...,Jn:

Jn∼ Poisson(αΓ(1+c)Γ(n−1+c+σ)

vnk∼ Beta(1 − σ,n − 1 + c + σ),

∞

?

Γ(n+c)Γ(c+σ)

),

ψnk∼ H,

(11)

µ =

n=1

Jn

?

k=1

vnkδψnk.

The inverse L´ evy measure is a general method of generating from a Poisson process with non-

uniform rate measure. It essentially transforms the Poisson process into one with uniform rate,

generates a sample, and transforms the sample back.

SBP because the inverse transform has no analytically tractable form. The L´ evy measure Λ0of

the SBP factorizes into a product Λ0(du×dθ) = L(du)H(dθ) of a σ-finite measure L(du) =

α

that we can generate a sample {vl,ψl}∞

pling the masses {vl}∞

associating each vlwith an iid draw ψl∼ H [19]. Now consider the mapping T : (0,1) → (0,∞)

given by

?1

T is bijective and monotonically decreasing. The Mapping Theorem for Poisson processes [19]

shows that {vl}∞

Lebesgue measure on (0,∞). A sample {tl}∞

el ∼ Exponential(1) and setting tl =?l

v1,v2,... is a decreasing sequence of masses. Deriving the density of vlgiven vl−1, we get:

p(vl|vl−1) =??dtl

In general these densities do not simplify and we have to resort to solving for T−1(tl) numerically.

There are two cases for which they do simplify. For c = 1, σ = 0, the density function reduces to

p(vl|vl−1) = αvα−1

[21]. In the stable process case when c = 1 − σ and σ ?= 0, the density of vlsimplifies to:

p(vl|vl−1) = α

= α(1 − σ)v−σ−1

Doing a change of values to yl= v−σ

l

, we get:

?

That is, each ylis exponentially distributed with rate α1−σ

of the parameters we do not have an analytic stick breaking form. However note that the weights

generated using this method are still going to be strictly decreasing.

This method is more involved for the

Γ(1+c)

Γ(1−σ)Γ(c+σ)u−σ−1(1−u)c+σ−1du over (0,1) and a probability measure H over Θ. This implies

l=1of the random atoms of µ and their masses by first sam-

l=1∼ PoissonP(L) from a Poisson process on (0,1) with rate measure L, and

T(u) =

u

L(du) =

?1

u

α

Γ(1 + c)

Γ(1 − σ)Γ(c + σ)u−σ−1(1 − u)c+σ−1du.

(12)

l=1∼ PoissonP(L) if and only if {T(vl)}∞

l=1∼ PoissonP(L) where L is

l=1∼ PoissonP(L) can be easily drawn by letting

i=1eifor all l. Transforming back with vl = T−1(tl),

we have {vl}∞

l=1∼ PoissonP(L). As t1,t2,... is an increasing sequence and T is decreasing,

dvl

??p(tl|tl−1) = α

Γ(1+c)

Γ(1−σ)Γ(c+σ)v−σ−1

l

(1−vl)c+σ−1exp

?

−

?vl−1

vl

L(du)

?

. (13)

l

/vα

l−1, leading to the stick-breaking construction of the single parameter IBP

?

l

exp

σ

(v−σ

l

Γ(2−σ)

Γ(1−σ)Γ(1)v−σ−1

l

× exp

?

−?vl−1

vl

α

Γ(2−σ)

Γ(1−σ)Γ(1)u−σ−1du

− v−σ

?

−α(1−σ)

l−1)

?

.

(14)

p(yl|yl−1) = α1−σ

σexp

− α1−σ

σ(yl− yl−1)

and offset by yl−1. For general values

?

.

(15)

σ

3.4Power-law Properties

The SBP has a number of appealing power-law properties. In this section we shall assume σ > 0

since the case σ = 0 reduces the SBP to the usual beta process with less interesting power-law

properties. Derivations are given in the appendix.

5