Page 1

A Bayesian model of shape and appearance for subcortical brain

segmentation

Brian Patenaudea,b, Stephen M. Smitha, David N. Kennedyc, and Mark Jenkinsona,*

aFMRIB Centre, Department of Clinical Neurology, University of Oxford, UK

bDepartment of Psychiatry and Behavioral Sciences, Stanford University School of Medicine,

USA

cDepartment of Psychiatry, University of Massachusetts Medical School, USA

Abstract

Automatic segmentation of subcortical structures in human brain MR images is an important but

difficult task due to poor and variable intensity contrast. Clear, well-defined intensity features are

absent in many places along typical structure boundaries and so extra information is required to

achieve successful segmentation. A method is proposed here that uses manually labelled image

data to provide anatomical training information. It utilises the principles of the Active Shape and

Appearance Models but places them within a Bayesian framework, allowing probabilistic

relationships between shape and intensity to be fully exploited. The model is trained for 15

different subcortical structures using 336 manually-labelled T1-weighted MR images. Using the

Bayesian approach, conditional probabilities can be calculated easily and efficiently, avoiding

technical problems of ill-conditioned covariance matrices, even with weak priors, and eliminating

the need for fitting extra empirical scaling parameters, as is required in standard Active

Appearance Models. Furthermore, differences in boundary vertex locations provide a direct,

purely local measure of geometric change in structure between groups that, unlike voxel-based

morphometry, is not dependent on tissue classification methods or arbitrary smoothing. In this

paper the fully-automated segmentation method is presented and assessed both quantitatively,

using Leave-One-Out testing on the 336 training images, and qualitatively, using an independent

clinical dataset involving Alzheimer’s disease. Median Dice overlaps between 0.7 and 0.9 are

obtained with this method, which is comparable or better than other automated methods. An

implementation of this method, called FIRST, is currently distributed with the freely-available

FSL package.

Keywords

Segmentation; Classification; Bayesian; Subcortical structures; Shape model

Introduction

It is important that medical image segmentation methods are accurate and robust in order to

sensitively study both normal and pathological brains. Achieving this in the subcortical areas

of the brain, given the typical low contrast-to-noise, is a great challenge for automated

methods. When trained human specialists perform manual segmentations they draw on prior

knowledge of shape, image intensities and shape-to-shape relationships. We present here a

© 2011 Elsevier Inc. All rights reserved.

*Corresponding author. Fax: +44 1865 222717. mark@fmrib.ox.ac.uk (M. Jenkinson).

NIH Public Access

Author Manuscript

Neuroimage. Author manuscript; available in PMC 2012 August 13.

Published in final edited form as:

Neuroimage. 2011 June 1; 56(3): 907–922. doi:10.1016/j.neuroimage.2011.02.046.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 2

formulation of a computationally efficient shape and appearance model based on a Bayesian

framework that incorporates both intra-and inter-structure variability information, while also

taking account of the limited size of the training set with respect to the dimensionality of the

data. The method is capable of performing segmentations of individual or multiple

subcortical structures as well as analysing differences in shape between different groups,

showing the location of changes in these structures, rather than just changes in the overall

volume.

The Active Shape Model (ASM) is an automated segmentation method that has been widely

used in the field of machine vision and medical image segmentation over the past decade

(Cootes et al., 1995). Standard ASMs model the distribution of corresponding anatomical

points (vertices/control points) and then parameterize the mean shape and most likely

variations of this shape across a training set. Images are segmented using the model built

from the training data, which specifies the range of likely shapes. In the original

formulation, if the dimensionality of the shape representation exceeds the size of the training

data then the only permissible shapes are linear combinations of the original training data,

although some methods for generalising this have been presented in the literature (Heimann

and Meinzer, 2009).

Intensity models are also useful in segmentation, and the Active Appearance Model (AAM)

is an extension of the ASM framework that incorporates such intensity information (Cootes

et al., 1998). As with the standard shape model, the intensity distribution is modelled as a

multivariate Gaussian and is parameterized by its mean and eigenvectors (modes of

variation). The AAM relates the shape and intensity models to each other with a weighting

matrix estimated from the training set. Fitting shapes to new images is done by minimising

the squared difference between the predicted intensities, given a shape deformation, and the

observed image intensities. Again, many modifications of this basic formulation have also

been proposed (Heimann and Meinzer, 2009).

In addition to the ASM and AAM methods there are many other approaches taken by fully-

automated segmentation methods for subcortical structures. Some of these methods are

specific to particular structures (e.g. hippocampus), others can be applied to general

structures and still others can be applied to multiple structures simultaneously. The

approaches can be surface-based, volumetric-based or both, and utilise methods such as:

region competition (Chupin et al., 2007, 2009); homotopic region deformation (Lehéricy et

al., 2009); level-sets within a Bayesian framework (Cremers et al., 2006) or with local

distribution models (Yan et al., 2004); 4D shape priors (Kohlberger et al., 2006);

probabilistic boosting trees (Wels et al., 2008); label, or classifier, fusion (Heckemann et al.,

2006); label fusion with templates (Collins and Pruessner, 2010); label fusion with graph

cuts (Wolz et al., 2010); wavelets with ASM (Davatzikos et al., 2003); multivariate

discriminant methods (Arzhaeva et al., 2006); medial representations or deformable M-reps

(Levy et al., 2007; Styner et al., 2003); probabilistic boosting trees (Tu et al., 2008); large

diffeomorphic mapping (Lee et al., 2009b); and non-linear registration combined with AAM

(Babalola et al., 2007).

The most common volumetric-based approaches to segmentation are based on non-linear

warping of an atlas, or atlases, to new data (Collins and Evans, 1997; Fischl et al., 2002;

Pohl et al., 2006). Traditionally, a single average atlas has been used to define the structure

segmentations (as in (Collins and Evans, 1997; Gouttard et al., 2007)) whereas recent

methods (Gousias et al., 2008; Heckemann et al., 2006) propagate information from multiple

atlases and fuse the results. Additional information such as voxel-wise intensity and shape

priors can also be utilised (Fischl et al., 2002; Khan et al., 2008). When using a single atlas,

only a very limited amount of information on shape variation from the training data can be

Patenaude et al.

Page 2

Neuroimage. Author manuscript; available in PMC 2012 August 13.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 3

retained. In place of this shape information, registration methods define the likelihood of a

given shape via the space of allowable transformations and regularisation-based penalization

applied to them. This potentially biases the segmented shapes to favour smooth variations

about the average template. Alternatively, methods that use multiple atlases or additional

voxel-wise shape priors are able to retain more variational information from the training

data.

Surface-based methods, on the other hand, tend to explicitly use learned shape variation as a

prior in the segmentation (Colliot et al., 2006; Pitiot et al., 2004; Tsai et al., 2004). In brain

image segmentation various ways of representing shapes and relationships have been

proposed, including fuzzy models (Colliot et al., 2006), level-sets (Tsai et al., 2004), and

simplex meshes (Pitiot et al., 2004). In addition, an array of different approaches has been

taken to couple the intensities in the image to the shape, usually in the form of energies and/

or forces, which often require arbitrary weighting parameters to be set.

Our approach takes the deformable-model-based AAM and poses it in a Bayesian

framework. This framework is advantageous as it naturally allows probability relationships

between shapes of different structures and between shape and intensity to be utilised and

investigated, while also accounting for the limited amount of training data in a natural way.

It is still based on using a deformable model that restricts the topology (unlike level-sets or

voxel-wise priors), which is advantageous since the brain structures we are interested in

have a fixed topology, as confirmed by our training data. Another benefit of the deformable

model is that point correspondence between structures is maintained. This allows vertex-

wise structural changes to be detected between groups of subjects, facilitating investigations

of normal and pathological variations in the brain. Moreover, this type of analysis is purely

local, based directly on the geometry/location of the structure boundary and is not dependent

on tissue-type classification or smoothing extents, unlike voxel-based morphometry

methods.

One difficulty of working with standard shape and appearance models is the limited amount

and quality of training data (Heimann and Meinzer, 2009). This means that the models

cannot represent variations in shape and intensity that are not explicitly present in the

training data, and that leads to restrictions in permissible shapes, and difficulties in

establishing robust shape–intensity relationships. The problem is particularly acute when the

number of training sets is substantially less than the dimensionality of the model (number of

vertices times number of intensity samples per vertex) which is certainly the case in this

application (e.g., we have 336 training sets, but models with 10,000 or more parameters).

Although a number of approaches have been proposed to alleviate these problems, we find

that both of these problems are dealt with automatically by formulating the model in a

Bayesian framework. For example, one approach for removing shape restrictions that has

been proposed previously (Cremers et al., 2002) requires the addition of a regularisation

term in the shape covariance matrix, and we find that this same term arises naturally in our

Bayesian formulation.

Using the AAM in a Bayesian framework also eliminates the need for arbitrary empirical

weightings between intensity and shape. This is due to the use of conditional probabilities

(e.g., probability of shape conditional on intensity), which underpin the method and can be

calculated extremely efficiently, without any additional regularisation required. These

conditional probabilities also allow the expected intensity distribution to change with the

proposed shape; see Fig. 3 for an example of why this is important. Furthermore, this

conditional probability formulation is very general and can be used to relate any subparts of

the model (e.g., different shapes). Therefore, the method proposed in this paper cannot only

Patenaude et al.

Page 3

Neuroimage. Author manuscript; available in PMC 2012 August 13.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 4

be used to model and segment each structure independently, it can also be used in more

flexible ways that incorporate joint shape information.

The following sections of this paper explain the details of the Bayesian Appearance Model

(BAM), including our training set, provide validation experiments, and give an example

application of vertex analysis for finding structural changes between disease and control

cohorts.

Initial model construction

Training data

The training data used in this work consists of 336 pairs of images: the original T1-weighted

MR images of the brain and their corresponding manually-labelled counterparts. This

dataset comprises six distinct groups of data and spans both normal and pathological brains

(including cases of schizophrenia and Alzheimer’s disease). The size, age, group, and

resolution for each group is given in Table 1, and the T1-weighted image and manual labels

of a single subject from the training set are depicted in Fig. 1. From the set of labelled

structures we chose to model 15 structures: brainstem, the left/right amygdala, caudate

nucleus, hippocampus, nucleus accumbens, putamen, pallidum and thalamus. Note that the

brainstem model includes the fourth ventricle as it becomes very thin and excessively

variable in the training data.

Linear subcortical registration

Prior to building the shape and intensity model, all data is registered to a common space

based on the non-linear MNI152 template with 1×1×1 mm resolution (this is a template

created by iteratively registering 152 subjects together with a non-linear registration method;

Fonov et al. (2011)). The same registration procedure must be applied to both training

images and any new images (to be segmented) so that they correctly align the new image

and the model. Note that the shape variance we are modelling is the residual shape variance

that exists after registration of all structures (jointly) to this template. Therefore the model is

specific to the registration procedure, and in this case it removes the joint pose whilst

retaining the relative pose of the individual structures.

A two-stage linear registration is performed in order to achieve a more robust and accurate

pre-alignment of the subcortical structures. The first stage is an affine registration of the

whole-head to the nonlinear MNI152 template using 12 degrees of freedom (DOF). The

second stage, initialized by the result of the first stage, uses a subcortical mask or weighting

image, defined in MNI space, to achieve a more accurate and robust 12 DOF registration to

the MNI152 template. This subcortical mask contains a binary value in each voxel, and is

used to determine whether that voxel is included or excluded from the calculation of the

similarity function (correlation ratio) within this second stage registration. The mask itself

was generated from the (filled) average surface shape of the 15 structures being modelled

(using a 128-subject subset of the training subjects and the first stage affine registration

alignments). The purpose of this mask is to exclude regions outside of the subcortical

structures, allowing the registration to concentrate only on the subcortical alignment.

All linear registrations were performed using FLIRT (Jenkinson et al., 2002). For a new

image, the two stage registration is first performed to get the image in alignment with the

MNI152 template. Following this, the inverse transformation is applied to the model in order

to bring it into the native space of this new image. This is advantageous as it allows the

subsequent segmentation steps to be performed in the native space with the original (non-

interpolated) voxel intensities.

Patenaude et al.

Page 4

Neuroimage. Author manuscript; available in PMC 2012 August 13.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 5

Fitting the deformable model to the training labels

In our Bayesian appearance model, the individual shapes are modelled by deformable

meshes that consist of sets of vertices connected by edges, and which are each topologically

equivalent to a tessellated sphere. To build the model, a mesh is fit to each shape separately

in each image of the training set, and the variation is modelled by a multivariate Gaussian

distribution of the concatenated vector of vertex coordinates and intensity samples (more

details in the Probabilistic model Section).

The first step in building the model from the training data is to generate the mesh

parameterizations of all the manually labelled T1-weighted images (training label images)

while retaining point correspondence between meshes. Initially, each T1-weighted image in

the dataset (training intensity image) is registered to the non-linear-MNI152 (1 mm isotropic

resolution) standard template using the linear registration procedure described previously.

Then, for each structure and for each subject, a 3D mesh is fit to the binary image

representing the manual label for that structure. The 3D mesh is initialized to the most

typical shape (across subjects) and then deformed to fit the individual binary image. The

desired cross-subject vertex correspondence is optimised by within-surface motion

constraints and minimal smoothing within the 3D deformable model (Kass et al., 1988;

Lobregt and Viergever, 1995; Smith, 2002).

The deformation process iteratively updates vertex locations according to a weighted sum of

displacements: one image displacement and two regularisation displacements. The image

displacement, dn is in the direction of the surface normal, sn, while the regularisation

displacements, dt and dAmax, are both within the surface, along the directions st and sAmax.

These directions are defined as:

(1a)

(1b)

(1c)

where n̂̂ is the local surface normal (unit vector) for the vertex v0, N is the number of

neighbouring vertices, vi is the ith neighbouring vertex, s is the difference vector between

the current vertex position and the average of its neighbours (in a regular grid there would

be no difference), sn and st are the normal and tangential components, respectively, of this

difference vector, and sAmax is the vector that bisects the largest adjacent triangle (see Fig.

2).

The external image displacement is along the inward/outward vertex normal and is given by,

(2)

where label is the binary image value, sn is the surface normal as defined by Eq. (1b) and the

coefficients −5 and 1 were chosen based on empirical stability in order to fit the surface

closely to the labelled edge. Note that the choice of form for the displacement here is

specific to the training set and structures modelled, and only affects this initial model-

building phase.

Patenaude et al.

Page 5

Neuroimage. Author manuscript; available in PMC 2012 August 13.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 6

The first regularisation displacement is tangential to the surface and is applied in order to

favour an even point density across the surface — Eq. (3). This regularisation displacement

is identical to that used by BET (Smith, 2002), except that its weighting is initially set to

zero and adaptively updated, in order to try to retain better vertex correspondence across

subjects. If the surface self-intersects during the deformation process, the process is reset

and the regularisation displacement is given more weight. Self-intersection is detected by

testing for triangle–triangle intersection between any pair of triangles within the surface.

This iterative update aims to apply a minimal amount of within-surface motion, since

within-surface motion is likely to reduce the cross-subject correspondence of vertices,

although for large changes in shape some within-surface motion is still necessary. Since the

true correspondence within the surface is unknown we assume that a mapping with minimal

within-surface change gives a better approximation to the true correspondence.

A second regularisation displacement is also applied in order to increase the stability of the

deformation process. This displacement moves the vertices so as to favour equal area

triangular patches and it acts along the line that bisects the largest neighbouring triangle —

Eq. (4). These regularisation displacements are defined as:

(3)

(4)

where st is the surface tangent vector (Eq. (1c)), sAmax is the bisector of the largest

neighbouring triangle (Fig. 2), and Amax is the largest area of the adjacent triangular patches.

Note that the deformable model displacements described here are only required when fitting

surfaces to the manual labels as part of the model building process. Once the models are

built, the fitting of structures in new images is done by optimising the probabilities based on

the intensities and the model (see later). Therefore these displacements are not used in any

way when performing the subsequent segmentations except insofar as they define the model.

Furthermore, the different stages in building the model were all carefully manually inspected

to ensure that the surface lay within one voxel of the manually defined boundary, with the

magnitude of the displacements tuned as necessary to obtain good fits to the individual

datasets. This manual tuning was only occasionally necessary and, in general, the weights

used for the displacements had little effect on the model fit compared to the inter-subject

variation in the training data. Therefore, because the displacements and weights were only

used to create the model, and this is dominated by the inter-subject variation, the influence

of the weights on the final segmentation results was not considered further.

Finally, appearance is modelled using normalised intensities that are sampled, for each

subject, from the corresponding training intensity image along the surface normal at each

vertex of the training mesh (both inside and outside the surface). There are 13 samples taken

per vertex, sampled at an interval of half the standard-space voxel resolution (i.e., 0.5 mm);

this was chosen based on the empirically observed performance of the Bayesian Appearance

Model, although we have not found it to be highly sensitive to these parameters. The

normalisation of the intensities is done in two stages: (1) by applying a global scaling and

offset to the intensities in the whole image such that the 2nd and 98th percentiles are

rescaled to 0 and 255 respectively, and (2) by subtracting the mode of the intensities within a

specified structure. In the second stage it is usually the mode of the intensities in the

structure being modelled that is subtracted, however there is also the option of using a

nearby structure instead. The effect of the structure used to normalise the intensity is

explored in the Results and discussion Section.

Patenaude et al.

Page 6

Neuroimage. Author manuscript; available in PMC 2012 August 13.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 7

Probabilistic model

Our model is trained from a limited set of mesh vertices and intensity samples that

correspond to the limited set of volumetric training data. We treat all training data as column

vectors, either of concatenated vertex coordinates or coordinates and corresponding intensity

samples. For example, for a 2D rectangle parameterized by the corner vertices = {(−2, 0),

(−2, 3), (3, 3), (3, 0)} the training vector for shape alone would be xi =[−2 0−2 3 3 3 3 0]T. It

is essential that vertex correspondence (consistent ordering) is maintained across all the

training data. To include intensity data, the training vector would have the concatenated

intensity samples, in a consistent order, added after the vertex coordinates. We will find it

helpful later on to partition the vectors into separate shape and intensity parts.

Initially, we present the model in its most general form, without restricting the training

vectors to be of any particular form. This is useful since it allows us to use the model later

on for shape, shape–intensity or multiple shape (and intensity) modelling. To begin with we

assume that we have a limited set of training data, denoted by = {x1…xns}, where each

observation comes from an underlying multivariate Gaussian distribution:

(5)

where ns is the number of training samples, k is the dimensionality of a vector xi, μ μ is the

mean and λ λ is a k×k positive-definite precision matrix. The precision matrix is equal to the

inverse of the covariance matrix, Σ Σ. Nk is the k-dimensional multivariate Gaussian

distribution.

For the purposes of shape and intensity modelling we want to know what the distribution is

for a new observation, given what information is available in the training data. To do this we

start by using the multivariate Gaussian model and derive the distribution of the mean and

precision given the previously observed training data. We then perform a marginalisation

over the mean and precision to obtain a distribution for the new observation given the

training data. That is:

(6)

where xobs is a new observation (image) sampled from the same distribution as that of the

datasets in , and θ represents a set of hyperparameters associated with the prior

distribution for μ μ and λ λ.

The final form for this distribution can be arrived at by straightforward probabilistic

calculations and the selection of a non-informative prior (Appendix A shows the detailed

calculations). The resulting form is:

(7)

where Stk is a multivariate Student distribution in k dimensions (defined in Appendix A, Eq.

(A.10)), x̄̄ is the sample mean of the training vectors, S=ZZT is the un-normalised sample

covariance matrix of the training vectors, with Z being the matrix formed from all the

demeaned training (column) vectors, and ε is a scalar hyperparameter relating to prior

(unseen) variance.

Patenaude et al.

Page 7

Neuroimage. Author manuscript; available in PMC 2012 August 13.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 8

Specification of the prior variance

The prior distributions specify information, or beliefs, about the shape and intensities before

the training data are seen. As we want the training data to specify as much information as

possible, and we do not wish to introduce any bias, we used non-informative priors. See

Appendix A for full details.

The general non-informative prior leaves a single hyperparameter, ε2, to be specified. This

can be interpreted as the variance associated with the variation in shape (and/or intensity)

that is not captured by the limited training set. One way of specifying this value is to use

empirical Bayes and estimate it from the limits of the variances contained in the training

data (e.g., a fraction of the lowest eigenvalues or of the total observed variance). Note that

the addition of the scaled identity matrix is analogous to ridge regression (Hastie et al.,

2001), where a scaled identity matrix is added to the covariance matrix estimate, and has

been used in the context of shape models previously (Cremers et al., 2002). This broadens

the distribution, reflecting unseen variations in the population, and here it arises naturally as

a consequence of needing to specify an unbiased prior distribution of shape (and/or

intensity).

Although an exact prescription for setting the ε parameter is not provided by this

framework, sensible limits can easily be established (e.g., in terms of the fraction of total

variance). Furthermore, its effect is generally small and reduces as the size of the training set

increases. In fact, it is not even required if there are more training datasets than parameters.

This is in contrast to the weightings between forces or shape–intensity relations, as used by

other approaches (Cootes and Taylor, 2001; Heimann and Meinzer, 2009; Lobregt and

Viergever, 1995; Montagnat et al., 2001; Pitiot et al., 2004; Shen et al., 2011), as they

typically have a significant effect on the results, independent of the size of the training set,

without necessarily obvious ways to interpret or set their values.

Conditional distributions

Within this framework we can also utilise shape–intensity or shape–shape relationships

which are useful, for example, in finding the shape boundary given the observed image

intensities (which is the typical segmentation problem) or to hierarchically determine the

boundary of one shape given the boundary of another. It is the conditional probability

distributions across partitions of the full multivariate Gaussian model which give

information about these relationships.

A partition of x is a subset, xj, corresponding to a particular attribute j (e.g. shape, intensity,

etc.). In the case of our training data, each partition will still have the same number of

samples ns, and in our application we partition the data into either shape and intensity or into

different shapes. The conditional distribution for shape given intensity is essential for

segmenting new images.

The partitioning is defined such that

(8a)

(8b)

Patenaude et al. Page 8

Neuroimage. Author manuscript; available in PMC 2012 August 13.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 9

(8c)

(8d)

where kj is the dimensionality of the jth partition, so that can be partitioned in the same

manner, with = ( , ), where = {x̃̃ij…x̃̃nsj}, with the first subscript denoting image

number (within the training dataset) and the second the shape partition. For example, the

first partition, , could be the concatenated vertex coordinates (shape) across all the training

images and the second, , could be the corresponding intensity samples across all the

training images. Alternatively, for shape–shape relationships (e.g., the caudate given the

known location and shape of the thalamus), the first partition could consist of the

concatenated vertex coordinates of the structure of interest (caudate) and the second

partition consist of the concatenated vertex coordinates of the predictive structure

(thalamus).

The conditional distribution can be written as

(9a)

using standard algebraic manipulations of the multivariate Student distribution (see

Bernardo and Smith, 2000), and with

(9b)

(9c)

(9d)

For a partitioned covariance matrix we define the prior covariance, β β, to be a piecewise-

scaled identity matrix

where is the error variance corresponding to the jth partition. Thus for each partition a

different error variance, , may be used, giving λ λ= (ns−1)(S + 2β β)−1 as in Eq. (7).

Parameterization of the model

In our method it is extremely useful to parameterize the space of shapes (and/or intensities)

using the mean and eigenvectors of the training set, which is naturally related to the

Patenaude et al.

Page 9

Neuroimage. Author manuscript; available in PMC 2012 August 13.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 10

multivariate Gaussian model. This has significant implications for simplifying the

expressions and massively reducing computational cost.

The eigenvectors are related to the un-normalised sample covariance matrix (see Eq. (7)) by

S=ZZT=UDDTUT, using the SVD of Z=UDVT. We therefore parameterize our data in the

same way as for the standard shape and appearance models, which is

(10)

where μ = x̄̄ is the mean vector across the training set; the matrix Dε is a diagonal matrix

such that consists of the eigenvalues (correspondingly ordered) of S+2ε2I; the scalar

; and b is the model parameter vector that weights the linear

combination of eigenvectors, U. The elements of b are scaled so that they indicate the

number of standard deviations along each mode, such that

(11)

which is used to simplify many equations. Note that this parameterization is most usually

applied to the shape partition alone but can also be applied to shape and intensity, or just

intensity.

Bayesian appearance models

To formulate the Bayesian Appearance Model we take the general mathematical framework

developed in the previous sections and apply it to intensity and shape partitions: xI and xs.

The shape partition is modelled using Eq. (10), so that given the intensities from a new

image, the vector bs (new shape instance) can be estimated from the posterior conditional

distribution p(xs |xI, ).

In the following sections we describe how the model is fit to new data. As discussed earlier,

when fitting to the new data, the model is first registered into the native space using the

inverse transform. This transform is applied only to the average shape and eigenvectors. One

slightly unusual consequence of the joint registration of all shapes is that not all pose

(orientation) is removed from every structure. That is, pose differences are contained in the

eigenvectors although, in practice, the residual pose differences are small because the initial

linear registration removes the overall joint pose of all the structures. This residual pose

represents changes in the relative pose between different structures and may be of interest.

However, there is no restriction within the framework that prevents the individual pose from

being completely removed, if desired.

Posterior as a similarity function

The conditional posterior probability, p(xs |xI, ), measures the probability and hence the

goodness of fit for a proposed segmentation. We want to maximise this probability with

respect to the shape model parameters bs. Simplifying this posterior, and expressing it in

terms of its logarithm, we can obtain the following

Patenaude et al.

Page 10

Neuroimage. Author manuscript; available in PMC 2012 August 13.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 11

(12)

where C is a constant,

dimensionalities of the intensity and shape partitions (kI =13nvert and ks =3nvert for our

implementation, where nvert is the number of vertices and 13 is chosen to sample an

intensity profile at each vertex that extends for ±3 mm in steps of 0.5 mm, as discussed at

the end of Fitting the deformable model to the training labels Section) and xI is the vector

containing the observed intensities. The quantities λ λcII (the un-scaled conditional precision

matrix) and μ μI|s (the conditional mean given shape) are defined in Appendix B, where it is

also shown how the different terms in this posterior can be calculated efficiently.

, kI, and ks are the respective

Optimization

To maximise the similarity function (the posterior probability) a conjugate gradient descent

search scheme is employed (Press et al., 1995). Since we parameterize the shape space by

the mode parameters, bs, we are searching along the eigenvectors, or modes. As the

eigenvectors are ordered by descending variance, the lower modes contribute less

information to the model (or even just noise) compared to the upper modes. Consequently,

since lower modes increase the dimensionality of the search space, making the optimization

more difficult, as well as increasing computational time, it can be beneficial to truncate the

number of modes considered. We have found that there is an optimal number of modes to

include in order to get the best performance, which is confirmed empirically (see the Results

and discussion Section).

Computational simplifications

To make this model computationally feasible in practice it is necessary to eliminate all

operations on k×k matrices. This is because the calculation of the inverse of the full k×k

covariance matrix is extremely computationally expensive (as typically k >10, 000), whereas

by exploiting the low rank (ns =336) of the sample covariance matrix, the computations can

be considerably simplified. This makes the calculations feasible in terms of time and

memory requirements. In this way we combine the benefits of having well-defined (full

rank) inverses, due to the additional scaled-identity matrix, and the computational efficiency

of working with the low-rank sample covariance.

Evaluations of the relevant terms for the posterior probability, including conditional

probabilities, can be simplified so that only ns × ns, ns × k1, ns ×1 and k1 ×1 matrices need to

be used. Appendix B details the main computational simplifications.

Boundary correction

In the preceding sections we model shape with a mesh model, parameterized by the

coordinates of its vertices. However, it is often necessary, or desirable, to create volumetric

(voxelated) masks that represent the structure segmentation. Therefore, there is a need to

convert between the mesh-based and volumetric representations of a segmented structure.

We create a volumetric output from the mesh by the following steps: (i) identifying the

voxels through which the mesh passes (i.e., partially filled voxels); (ii) marking these voxels

in a volumetric image as the boundary voxels; (iii) filling the interior of this boundary. Once

this is done we need a way to classify whether each boundary voxel should remain part of

the segmentation or not (if a binary segmentation is required). For this step we have

explored several classification methods and found that one based on the results of a 3-class

Patenaude et al.

Page 11

Neuroimage. Author manuscript; available in PMC 2012 August 13.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 12

classification of the intensities (grey matter, white matter and CSF) using the FSL/FAST

method (Zhang et al., 2001) performs the best overall. A rectangular ROI that encompasses

the structure of interest (extended by two voxels) is used as input to the FAST method,

which models the intensity distribution as a Gaussian mixture-model in addition to a Markov

Random Field. We use this method of boundary correction for most of the rest of this paper.

It is also worth noting that other boundary correction methods could easily be used with our

implementation, as this is not core to the Bayesian Appearance Model in any way.

There are three exceptions to the use of the FAST-based boundary correction method. These

are for the putamen, pallidum and thalamus where a slightly different surface model was fit.

In these cases the surfaces were fit to a slightly eroded version of manual labels (at the

surface parameterization phase) such that the eroded edge was inset by 0.5 mm from the

outer boundary of the manual labels. For these structures applying no thresholding and

simply including all the boundary voxels in the final volumetric output gives good results.

This is referred to as a boundary correction of “none” in Table 3 and is compared, for these

structures, with the FAST-based boundary correction. Note that this different surface model

was also tried for the other structures but was found to be significantly worse, and

sometimes problematic, compared to the standard surface fitting.

Vertex analysis

The output from a subcortical segmentation method can be used in many ways; one

application is to look for differences in these structures between different groups (e.g.,

disease versus healthy controls). Many such group difference studies have been carried out

based on volumetric measures of the structures of interest (e.g., caudate, hippocampus, etc.).

However, volumetric studies do not show where the changes are occurring in the structure,

and this may be of critical importance when investigating relationships with neighbouring

structures, connectivity with more distant structures, or changes in the different subcortical

nuclei within the structure.

Local changes can be directly investigated by analysing vertex locations, which looks at the

differences in mean vertex position between two (or more) groups. This type of analysis

does not require boundary correction to be performed, as it works directly with the vertex

coordinates (in continuous space) of the underlying meshes. The Vertex Analysis is

performed by carrying out a multivariate test on the three-dimensional coordinates of

corresponding vertices. Each vertex is analysed independently, with appropriate multiple-

comparison correction methods (e.g., FDR or surface-based cluster corrections) performed

later, as is the case for standard volumetric image analysis.

The coordinates for each vertex can either be analysed directly in standard space (where the

models are generated) or can be mapped back into any other space (e.g., group mean).

Moreover, changes in pose (global rotation and translation) between different subjects are

often of no interest, and these can be removed by a rigid alignment of the individual meshes,

if desired. For this purpose we have implemented an alignment based on minimising the sum

of square error surface distance for the mesh coordinates.

Once the coordinates are appropriately transformed, a multivariate F-test is performed for

each vertex separately using the MultiVariate General Linear Model (MVGLM) with Pillai’s

Trace as the test statistic. The details can be found in Appendix C. By implementing a

general linear model it is also possible to investigate more complex changes, not just the

mean difference between two groups. For instance, it is possible to jointly look at the effect

of age and disease with a MANCOVA design, which is easily implemented within the

MVGLM. Whatever the design, the F-statistics that are formed are sensitive to changes of

the coordinates in any direction. To differentiate between the different directions (e.g.,

Patenaude et al.

Page 12

Neuroimage. Author manuscript; available in PMC 2012 August 13.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 13

atrophy versus expansion) it is necessary to use the orientations of the vectors, particularly

the difference of the mean locations, to disambiguate the direction of the change.

Certain localised changes in brain structures can also be detected using other methods, such

as voxel-based morphometry (VBM). However, with VBM the inference is based on

locally-averaged grey-matter segmentations and is therefore sensitive to the inaccuracies of

tissue-type classification and arbitrary smoothing extents. The vertex analysis method,

however, utilises the joint shape and appearance model to robustly determine the structural

boundary, and so is purely local, being based directly on the geometry/location of the

structure boundary. Therefore it has the potential to localise changes more precisely, as there

is no additional smoothing and it directly measures changes in geometry.

Results and discussion

In this section we test the Bayesian Appearance Model both qualitatively and quantitatively

on real MRI data. This includes a set of leave-one-out (LOO) cross-validation tests

comparing volumetric overlap with the manual labelled images from the training set. In

addition, vertex analysis is tested by comparing cohorts of patients with matched healthy

controls.

All the results shown here use a C++ implementation of the Bayesian Appearance Model

that is distributed as part of FSL (Woolrich et al., 2009), where it is called FIRST (FMRIB’s

Integrated Registration and Segmentation Tool). The run time for FIRST on a modern

desktop, with default parameters, is approximately 5 min (for linear registration) plus 1 min

per structure being segmented.

Before showing the testing and validation results we will illustrate the changes captured by

the Bayesian Appearance Model. Fig. 3 shows the change in shape and conditional intensity

distribution as the shape parameters vary along one of the individual modes of variation. It

shows a graphical depiction of ±3 standard deviations along the first mode of variation for

the left thalamus and the conditional intensity mean associated with it; the shape is shown

overlaid on the MNI152 template in each case. The first mode is predominantly associated

with translation, where the translation typically correlates with an increased ventricle size,

which is reflected in the fact that for enlarged ventricles (lower thalamus position) there is

an enlarged dark band above the thalamus in the conditional mean intensity, corresponding

to the extra surrounding CSF. In contrast, when the thalamus is in a higher position,

correlated with smaller ventricles, the conditional mean intensity is brighter, consistent with

the higher prevalence of white matter nearer to the thalamic border. The intensities in this

latter case are close to the white-matter mean values as the partial volume fraction of the

CSF for the datasets with the smallest ventricles is very low across much of the superior

thalamus border. Furthermore, the conditional distribution captures the changing variance of

the intensities for different shapes.

Validation and accuracy

Leave-one-out tests—To quantitatively evaluate the accuracy of the algorithm, a Leave-

One-Out (LOO) cross-validation method was used across the 336 training sets. For all

evaluations the manual segmentations were regarded as the gold standard, and the

segmentation performance was measured using the standard Dice overlap metric:

(13)

Patenaude et al. Page 13

Neuroimage. Author manuscript; available in PMC 2012 August 13.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 14

where TP is the true positive (overlap) volume, FP is the false positive volume and FN is the

false negative volume.

The LOO tests were run independently for all 15 subcortical structures. For each structure,

the LOO test selects each image, in turn, and segments it using a model built from the

remaining 335 images. All tests used the following empirically-tuned parameters: the prior

constants εi and εs were both set to 0.0001% of the total variance in the training set; the

number of modes included in the fitting were chosen based on the mean Dice overlaps

across modes of variation (as seen in Fig. 6) and correspond to the number of modes

indicated in Table 3; and the FAST-based boundary correction was applied to all structures

except the thalamus and pallidum where all boundary voxels were included. Fig. 4 shows an

example of the segmentation of one image (for all of the 15 structures) from the LOO test,

while Fig. 5 shows the results of the quantitative Dice overlap measure, and Table 3 shows

the mean Dice values broken down by group.

The Dice results obtained are generally comparable to or better than those reported by other

methods (Babalola et al., 2008; Carmichael et al., 2005; Fischl et al., 2002; van Ginneken et

al., 2007; Morey et al., 2009; Pohl et al., 2007). For example, in Morey et al. (2009) similar

results overall were found for ASEG (Fischl et al., 2002) and our Bayesian Appearance

Model for the hippocampus and the amygdala, although in that paper a simpler method of

boundary correction was used, rather than the FAST-based approach which we have more

recently found to be superior. In Pohl et al. (2007) a hierarchical model is used to segment

the amygdala and hippocampus, in a mixed pathology dataset, with superior performance for

the amygdala but nearly identical performance for the hippocampus. Results of evaluating

caudate segmentations across many different methods can be found as part of the ongoing

“Caudate Segmentation Evaluation” (http://www.cause07.org; van Ginneken et al., 2007)

which give values less than 80%, while our Dice results are often above 80%, although their

results are averages over five different metrics and the data involves paediatric images

(younger than any in our training set), and so cannot be directly compared with the results

given here. In general, accurately measuring the relative performance of different methods

over a rich set of subcortical structures will require conducting a specific and thorough study

using the same data for all methods, which is beyond the scope of this paper.

The results of the Bayesian Appearance Model show that the best structures in terms of Dice

overlap are the putamen and the thalamus, while in contrast, the worst structures are the

nucleus accumbens and the amygdala, being between 0.1 and 0.2 lower in their median Dice

overlap. However, the spread of results differs greatly between all the structures, with the

accumbens, amygdala, hippocampus and pallidum showing a greater spread, with some

relatively poor performances, whereas the putamen and thalamus show quite a tight range of

high Dice overlaps.

For small structures, such as the nucleus accumbens, or structures with large surface-area-to-

volume ratios, such as the caudate, the Dice overlap measure heavily penalises small

differences in surface error, since an average error of one voxel at the boundary will

substantially affect the volume overlap. By comparison, structures such as the thalamus,

which are larger and have a lower surface-area-to-volume ratio, are less affected by small

differences in boundary position. Therefore it is expected that, for example, the nucleus

accumbens would perform worse than the thalamus due to its small size. However, these

tendencies do not explain all of the features of the results, and the variability of the

structure’s shape and intensity characteristics also play an important role.

There are many factors that might affect the ability of the method to perform well, such as:

the ability of the multivariate Gaussian model to correctly represent the true shape/intensity

Patenaude et al.

Page 14

Neuroimage. Author manuscript; available in PMC 2012 August 13.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 15

probabilities; the ability of the optimization method to converge on the globally optimal

solution; mismatches between the real underlying ground truth and the manual labellings;

errors at the boundary due to the necessity to produce binary segmentations for comparison

with the binary manual labels; the number of modes used in the fitting; and so on. Of these

factors, there are some that we have direct control over (e.g. the number of modes used) and

these can play an important part in the fitting process. The effects that these have are

investigated next.

Effect of the number of modes of variation—To start with we investigate the

dependence of the results on the number of modes included in the fitting process. This was

done by repeating the entire LOO test using fitting that was restricted to 20, 30, 40, 50, 60,

70, 80, 90 and 100 modes. Fig. 6 shows these results as a Dice overlap measure for each

structure and for the different number of modes (eigenvectors) that were included.

The number of modes of variation retained in the fitting has an impact on the amount of

detail that may be captured using the model, but the results did not vary strongly with the

number of modes included. The main structures where it had a noticeable effect were the

amygdala and brainstem. In both cases, including more modes increased the Dice overlap,

but each appeared to reach a plateau with around 80 to 100 modes. Therefore, the results are

not highly sensitive to the choice of the number of modes used for each structure.

Consequently, we believe that the values specified in Table 2 can be used quite generally

and provide a good compromise between including enough variation to capture the

structural detail, and avoiding too many modes which can make the optimization difficult

and significantly increase computational cost.

Results on sub-groups—The second investigation of the sensitivity of the method was

made by looking at the performance for the separate sub-groups in the training set, since

they have different image resolution, signal-to-noise ratio, and contrast-to-noise. The results

are shown in Table 3 (demographics and image resolution for each group are shown in Table

1). The overlap is shown using 80 modes of variation since, for all structures, the Dice

overlap (in Fig. 6) reached a plateau by this number. The prior constants and boundary

correction were the same as specified previously.

It can be seen in Table 3 that there is little difference in the Dice results between the groups,

despite the range of resolutions and general image quality. In particular, group 4, which has

a slice thickness of 3 mm, does not perform noticeably worse than other groups, where the

slice thickness is between 1.0 and 1.5 mm. There is also little effect evident when changing

the structure used to provide the intensity normalisation (the intensity reference structure). In

most cases the structure acts as its own reference, by subtracting the mode of the intensity

distribution. However, for small structures, using a larger structure like the thalamus, as a

local reference can reduce the number of poor performing outlier segmentations, even

though the mean performance is largely unaffected. In contrast, comparing the FAST-based

boundary correction method with no correction (including all the boundary voxels) for the

pallidum, putamen and thalamus, showed a small systematic improvement in favour of no

correction. This was not the case for the other structures (data not shown).

Sensitivity to priors—The third investigation of sensitivity of the method involved

varying the parameters εs and εI. These were both chosen to be fractions of the total

estimated shape and intensity variances respectively. They was chosen empirically, based on

examination of the model’s performance over a range of the parameter values, all of which

were small fractions of the total variance, so that the model depends almost entirely on the

training data. This empirical approach to setting the parameter values is common to many

medical imaging problems (Friston et al., 2002). Although the choice of ε is not a proven,

Patenaude et al.

Page 15

Neuroimage. Author manuscript; available in PMC 2012 August 13.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript