Page 1

University of Warwick institutional repository: http://go.warwick.ac.uk/wrap

This paper is made available online in accordance with

publisher policies. Please scroll down to view the document

itself. Please refer to the repository record for this item and our

policy information available from the repository home page for

further information.

To see the final version of this paper please visit the publisher’s website.

Access to the published version may require a subscription.

Author(s): C Yau, O Papaspiliopoulos, GO Roberts and C Holmes

Article Title: Bayesian nonparametric hidden Markov models with

application to the analysis of copy-number-variation in mammalian

genomes

Year of publication: 2009

Link to published article:

http://www2.warwick.ac.uk/fac/sci/statistics/crism/research/2009/paper

09-12

Publisher statement: none

Page 2

CRiSM Paper No. 09-12, www.warwick.ac.uk/go/crism

Bayesian Nonparametric Hidden Markov Models with application to

the analysis of copy-number-variation in mammalian genomes

C. Yau∗, O. Papaspiliopoulos†, G. O. Roberts‡and C. Holmes∗

Abstract

We consider the development of Bayesian Nonparametric methods for product parti-

tion models such as Hidden Markov Models and change point models. Our approach

uses a Mixture of Dirichlet Process (MDP) model for the unknown sampling distri-

bution (likelihood) for the observations arising in each state and a computationally

efficient data augmentation scheme to aid inference. The method uses novel MCMC

methodology which combines recent retrospective sampling methods with the use of

slice sampler variables. The methodology is computationally efficient, both in terms

of MCMC mixing properties, and robustness to the length of the time series be-

ing investigated. Moreover, the method is easy to implement requiring little or no

user-interaction. We apply our methodology to the analysis of genomic copy number

variation.

Keywords : Retrospective sampling, block Gibbs sampler, local/global clustering, partition models,

partial exchangeability

1 Introduction

Hidden Markov Models and other conditional Product Partition Models such as change point models

or spatial tessellation processes form an important class of statistical regression methods dating back

to Baum (1966); Barry & Hartigan (1992). Here we consider Bayesian nonparametric extensions

where the sampling density (likelihood) within a state or partition is given by a Mixture of Dirichlet

Process (Antoniak, 1974; Escobar, 1988).

Conventional constructions of the MDP make inference extremely challenging computationally

due to the joint dependence structure induced on the observations. We develop a data augmenta-

∗Department of Statistics, University of Oxford, yau@stats.ox.ac.uk,cholmes@stats.ox.ac.uk

†Department of Economics, UPF, omiros.papaspiliopoulos@upf.edu

‡Department of Statistics, Warwick University, Gareth.O.Roberts@warwick.ac.uk

1

Page 3

CRiSM Paper No. 09-12, www.warwick.ac.uk/go/crism

tion scheme based on the retrospective simulation work (Papaspiliopoulos & Roberts, 2008) which

alleviates this problem and facilitates computationally efficient inference, by inducing partial ex-

changeability of observations within states. This allows for example for the forward-backward

sampling and marginal likelihood sampling of state transition paths in an HMM.

Our work here is motivated by the problem of analysing of genomic copy number variation in

mammalian genomes (Colella et al., 2007). This is a challenging and important scientific problem

in genetics, typified by series of observations of length O(105). In developing our methodology

therefore, we have paid close attention to ensure that methods scale well with the size of the data.

Moreover, our approach gives good MCMC mixing properties and needs little or no algorithm

tuning.

Research on Bayesian semi-parametric modelling using Diricihlet mixtures is now widespread

throught the statical literature (M¨ uller et al., 1996; Gelfand & Kottas, 2003; M¨ uller et al., 2005;

Quintana & Iglesias, 2003; Burr & Doss, 2005; Teh et al., 2006; Griffin & Steel, 2007, 2004; Rodriguez

et al., 2008; B.Dunson, 2005; Dunson et al., 2007) Inference for Dirichlet mixture models has been

made feasible since the seminal development of Gibbs sampling techniques in Escobar (1988). This

work constructed a marginal algorithm where the DP itself is analytically integrated out (see also

Liu, 1996; Green & Richardson, 2001; Jain & Neal, 2004). The marginal method is more complicated

to implement for non-conjugate models (though see MacEachern & M¨ uller, 1998; Neal, 2000).

The alternative (and in principle more flexible) methodology is the conditional method, which

does not require analytical integration of the DP. This approach was suggested in Ishwaran &

Zarepour (2000); Ishwaran & James (2001, 2003) where finite-dimensional truncations are employed

to circumvent the impossibe task of storing the entire Dirichlet process state (which would require

infinite storage capacity). In addition to its flexibility, a major advantage of the conditional approach

is that in principle it allows inference for the latent random measure P. The requirement to use

finite truncations of the DP was removed in recent work (Papaspiliopoulos & Roberts, 2008). In

this paper we shall essentially generalise the approach of this paper to our HMM-MDP context.

Furthermore, we shall introduce a further innovation using the slice sampler construction of Walker

(2007).

The paper in structured as follows. The motivating genetic problem is introduced in detail in

Subsection 1.1. The HMM-MDP model is defined in Section 2 while the corresponding computa-

tional methodology is described in Section 3. The different models and methods are tested and

compared in Section 4 on various simulated data sets. The genomic copy number variation analysis

is presented in Section 5, and brief conclusions are given in Section 6.

1.1 Motivating Application

The development of the Bayesian nonparametric HMM reported here was motivated by on-going

work by two of the authors in the analysis of genomic copy number variation (CNV) (see Colella

2

Page 4

CRiSM Paper No. 09-12, www.warwick.ac.uk/go/crism

100 200 300 400500 600700800900 1000

−2

−1

0

1

2

Probe Number

log ratio

Duplication

Deletion

Figure 1: Example Array CGH dataset. This data sets shows a copy number gain (duplication)

and a copy number loss (deletion) which are characterised by relative upward and downward shifts

in the log intensity ratio respectively.

et al. (2007)). Copy number variants are regions of the genome that can occur at variable copy

number in the population. In diploid organisms, such as humans, somatic cells normally contain

two copies of each gene, one inherited from each parent. However, abnormalities during the process

of DNA replication and synthesis can lead to the loss or gain of DNA fragments, leading to variable

gene copy numbers that may initiate or promote disease conditions. For example, the loss or gain of

a number of tumor suppressor genes and oncogenes are known to promote the initiation and growth

of cancers.

This has been enabled by microarray technology that has enabled copy number variation across

the genome to be routinely profiled using array comparative genomic hybridisation (aCGH) meth-

ods. These technologies allow DNA copy number to be measure at millions of genomic locations

simultaneously allowing copy number variants to be mapped with high resolution. Copy number

variation discovery, as a statistical problem, essentially amounts to detecting segmental changes in

the mean levels of the DNA hybridisation intensity along the genome (see Figure 1). However, these

measurements are extremely sensitive to variations in DNA quality, DNA quantity and instrumental

noise and this has lead to the development of a number of statistical methods for data analysis.

One popular approach for tackling this problem utilises Hidden Markov Models where the hidden

states correspond to the unobserved copy number states at each probe location, and the observed

data are the hybridisation intensity measurements from the microarrays (see Shah et al. (2006);

Marioni et al. (2006); Colella et al. (2007); Stjernqvist et al. (2007); Andersson et al. (2008)).

Typically the distributions of the observations are assumed to be Gaussian or, in order to add

robustness, a mixture of two Gaussians or a Gaussian and uniform distribution, where the second

mixture component acts to capture outliers such as in Shah et al. (2006) and Colella et al. (2007).

However, many data sets contain non-Gaussian noise distributions on the measurements, as pointed

out in Hu et al. (2007), particularly if the experimental conditions are not ideal. As a consequence,

3

Page 5

CRiSM Paper No. 09-12, www.warwick.ac.uk/go/crism

existing methods can be extremely sensitive to outliers, skewness or heavy tails in the actual noise

process that might lead to large numbers of false copy number variants being detected. As genomic

technologies evolve from being pure research tools to diagnostic devices, more robust techniques are

required. Bayesian nonparametrics offers an attractive solution to these problems and lead us to

investigate the models we describe here.

2HMM-MDP model formulation

The observed data will be a realization of a stochastic process {yt}T

and the dependence structure in the process are specified hierarchically and semi-parametrically.

Let f(y|m,z) be a density with parameters m and z; {st}T

state-space S = {1,...,n}, transition matrix Π = [πi,j]i,j∈S and initial distribution π0; Hθbe a

distribution indexed by some parameters θ, and α > 0. Then, the model is specified hierarchically

as follows:

t=1. The marginal distribution

t=1be a Markov chain with discrete

yt| st,kt,m,z ∼ f(yt|mst,zkt), t = 1,...,T

P(st= i | st−1= j) = πi,j,i,j ∈ S

?

zj| θ ∼ Hθ,j ≥ 1

p(kt,ut| w) =

j:wj>ut

δj(·) =

∞

?

j=1

1[ut< wj]δj(·) (1)

w1= v1, wj= vj

j−1

?

i=1

(1 − vi),j ≥ 2

vj∼ Be(1,α),j ≥ 1,

where m = {mj,j ∈ S}, s = (s1,...,sT), y = (y1,...,yT), u = (u1,...,uT), k = (k1,...,kT),

w = (w1,w2,...), v = (v1,v2,...), z = (z1,z2,...) and δx(·) denotes the Dirac delta measure centred

at x.

The model has two characterising features, structural changes in time and flexible sampling

distribution at each regime. The structural changes are induced by the hidden Markov model

(HMM), {mst}T

model in which f(y | m,z) is mixed with respect to a random discrete probability measure P(dz).

The last four lines in the hierarchy identify P with the Dirichlet process prior (DPP) with base

measure Hθand the concentration parameter α. Such mixture models are known as mixtures of

Dirichlet process (MDP).

We have chosen a particular representation for the Dirichlet process prior (DPP) in terms of

the allocation variables k, the stick-breaking weights v, the mixture parameters z and the auxiliary

t=1. The conditional distribution of y given the HMM state is specified as a mixture

4