Page 1

Tina Memo No. 2001-014

Presented at a joint meeting of the Royal Statistical Society and the BMVA, 2001.

IVC Special Edition: The use of Probabilistic Models in Computer Vision., 21, 851-864, 2003.

Additional Appendix added to cover the problems of data fusion from Bayesian modules.

Bayesian and Non-Bayesian Probabilistic

Models for Image Analysis

P.A. Bromiley, N.A. Thacker, M.L.J. Scott,

M. Pokri´ c, A.J. Lacey and T.F. Cootes

Last updated

25 / 5 / 2007

Imaging Science and Biomedical Engineering,

School of Cancer and Imaging Sciences,

University of Manchester, Stopford Building,

Oxford Road, Manchester M13 9PT, U.K.

Page 2

Bayesian and Non-Bayesian Probabilistic Models for Image

Analysis

P.A. Bromiley1, N.A. Thacker2M.L.J. Scott,

M. Pokri´ c, A.J. Lacey and T.F. Cootes

Imaging Science and Biomedical Engineering,

School of Cancer and Imaging Sciences,

University of Manchester, Stopford Building,

Oxford Road, Manchester M13 9PT, U.K.

Abstract

Bayesian approaches to data analysis are popular in machine vision, and yet the main advantage of

Bayes theory, the ability to incorporate prior knowledge in the form of the prior probabilities, may

lead to problems in quantitative tasks. In this paper we demonstrate examples of Bayesian and non-

Bayesian techniques with the use of selected examples from the area of magnetic resonance image (MRI)

analysis. Issues raised by these examples are used to illustrate common difficulties in Bayesian methods

and to motivate an approach based on frequentist methods. We believe this approach to be more suited

to quantitative data analysis, and provide a general theory for the use of these methods in learning

(Bayes risk) systems and for data fusion. Proofs are given for the more novel aspects of the theory.

We conclude with a discussion of the strengths and weaknesses, and the fundamental suitability, of

Bayesian and non-Bayesian approaches for MRI analysis in particular, and for machine vision systems

in general. Overall we advise caution regarding the common assertion that the best approaches to all

machine vision problems are necessarily Bayesian.

1 Introduction

This paper discusses the use of Bayes theorem in decision systems which make use of image data. We concern

ourselves only with the use of the equation

P(Hi|data) =

P(data|Hi)P(Hi)

?

iP(data|Hi)P(Hi)

(1)

for the interpretation of mutually exclusive hypotheses Hi, as the basis for algorithmic design. We compare such

approaches with alternatives based on frequentist statistics. In the broadest possible terms, these two schools

of statistical inference vary in their definition of a probability. Frequentist statistics demands that a probability

should represent a genuine reflection of the frequency of occurrence of some event, whereas Bayesian statistics

defines a probability as a degree of belief, allowing the incorporation of prior knowledge in the form of the prior

probabilities.

Bayes theorem is a cornerstone of modern probabilistic data analysis. It is used as a way of constructing prob-

abilistic decision systems so that prior knowledge can be incorporated into the data analysis in order to “bias”

the interpretation of the data in the direction of expectation. The prior probabilities therefore have the great-

est influence when the data under analysis is unable to adequately support any model hypothesis. Under these

circumstances direct application of a purely data driven solution may not be directly possible. The use of Bayes

theorem can appear to provide spectacular improvements in the interpretation of data. However, despite their

popularity and widespread acceptance, there are often significant practical problems in the application of Bayesian

techniques.

Firstly, Bayesian approaches use information regarding the distribution of a whole group of data to influence

the interpretation of a single data set. Few would countenance modifying the estimate of a random variable by

averaging with some weighted combination of the group mean, as it is well known that such a procedure would

introduce bias, yet analogous procedures are accepted as reasonable in Bayesian approaches to data analysis.

Another aspect of this problem is the suppression of infrequently occurring data or “novelty”. Clearly, skepticism

concerning the use of Bayesian approaches in areas such as medical data analysis, where pathological cases are

often unique, would have some justification. We explain the effects of bias and novelty in Bayesian estimation

in Section 2, using multi-dimensional MR volumetric measurement as an example.

1paul.bromiley@manchester.ac.uk

2neil.thacker@manchester.ac.uk

Page 3

Many researchers have concentrated on the source of prior probabilities [16, 3]. Ideally this prior information could

be established uniquely for a particular task, preferably emerging from a mathematical analysis. However, if Bayes

theory is to make predictions regarding the likely ratio of real world events it must be accepted that the data

samples used in Bayesian approaches must enforce the priors i.e. must represent a stratified random sample of

the types of data under analysis3. Thus, in the absence of a deterministic physical mechanism giving rise to the

data set there can be no theoretical justification for belief in the existence of a unique prior. If the circumstances

in which the system is used change the expected prior distributions must also change. This situation is often

encountered in data analysis problems, particularly those involving biological data sets where data are selected on

the basis of rules that can vary with time.

In order for Bayes theory to be applied correctly the likelihood distributions (P(data|Hi)) of all possible interpreta-

tions Hiof the data must be known. Unfortunately, in many practical circumstances, particularly in research, these

distributions are not well known a-priori. Much of the computer vision community accepts the need to construct

systems which learn in order to tackle difficult scene interpretation tasks. It could be regarded as a weakness if

the computational framework used in a learning system demands that all possible interpretations of the data are

available before a useful statistical inference can be drawn.

Beyond simple classification, any decisions based on Bayesian classification results should also be made on the

basis of the Bayes risk, in order to minimise the cost of the decision. The aim in clinical support systems, for

example, is to provide treatment which improves the prognosis of the patient, rather than just to provide the

correct diagnosis. To illustrate the problems of learning, priors and Bayes risk, in Section 3 we give a specific

example of a Bayesian system designed to evaluate the degree of atrophy in the brain arising from a variety of

dementing diseases, and explain some possible solutions. We conclude that Bayesian decision systems cannot form

useful components of learning systems without modifications that distance them from the original theory. We

explain how both this and the previous example illustrate the difficulty of using Bayes theory for quantitative

analysis, even on relatively simple problems.

If Bayes theory cannot be used directly to provide a useful diagnostic classification, a more appropriate method

for presenting results for clinical interpretation must be found. We present one potential solution in the form of

a single-model statistical decision system which has its origins in the so-called “frequentist” approaches to data

analysis. The use of only one likelihood distribution avoids the need to specify prior probabilities and sidesteps

issues of complexity. This can result in an approach to data analysis which is more in line with the need to

construct systems which can learn incrementally, and yet still be capable of generating useful results at early

stages of training. In Section 4 we illustrate a single model statistical analysis technique for the problem of

change detection, under circumstances where the statistical model can be bootstrapped from the image data. This

represents a significant step towards learning. We further describe a practical mechanism for data fusion within

this framework and in particular provide a derivation for problems of arbitrary dimensionality. These systems

generate data which is more quantitative than that generated by Bayesian methods, yet the data remains suitable

for use in a Bayes risk analysis. Therefore, we believe that such systems have advantages over Bayesian algorithms

and have great potential for use in computer vision.

2 Bias and Novelty in Multi-dimensional MR Image Segmentation

Magnetic resonance imaging represents a very flexible way of generating images from biological tissues. From the

point of view of conventional computer vision, analysis of these images is relatively simple as, for a given protocol,

particular tissues generate a narrow range of fixed values in the data. Bayesian probability theory has been applied

by several groups in order to devise a frequency-based representation of different tissues [13]. The conditional

probabilities provide an estimate of the probability that a given grey level was generated by a particular part

of the model. The conditional probability for a particular tissue class given the data can be derived using the

knowledge of image intensity probability density distributions for each class and their associated priors.

The data density distributions are often assumed to be Gaussian, but for many clinical scans there is a high

probability that individual values may be generated by fractional contributions from two distinct tissue components

(partial voluming). In [19] we adopted a multi-dimensional model for the probability density functions which can

take account of this effect. The conditional probability that a grey level g is due to a certain mechanism k (either

a pure or mixture tissue component) can be calculated using Bayes theory,

P(k|g) =

dk(g)fk

f0+?

i(di(g)fi) +?

i

?

jdij(g)fij

(2)

3Strong Bayesians would argue that we can deal with degrees of belief, but then we have to accept that the computed probabilities

do not necessarily correspond to any objective prediction of likely outcome.

3

Page 4

(a)(b)(c) (d)

Figure 1: Image Sequences: IRTSE (a), VE(PD) (b), VE(T2) (c), and FLAIR (d).

where dk(g), di(g), and dij(g) are the multi-dimensional probability density functions for tissue component k, pure

tissue i, and a mixture of tissues i and j respectively. The corresponding priors, fk, f0, fiand fij, are expressed as

frequencies (i.e. the number of voxels which belong to a particular tissue type) whether pure tissues or a mixture

of tissues. Note that in the first departure from a pure (fully justifiable) Bayesian model a fixed extra term fo

(making an arbitrary assumption of uniform distribution) is included to model infrequently occurring outlier data

[12].

The parameters of the model, such as covariance matrices, mean vectors and priors, can be iteratively adjusted by

maximising the likelihood of the data distribution using Expectation Maximisation (EM) algorithm [27] (Appendix

A). Once the data density models are obtained, the conditional probabilities can be calculated and probability

maps derived for each tissue type, which estimate the most likely tissue volume fraction within each voxel.

The probabilistic segmentation algorithm has been implemented and tested on co-registered MRI brain images

of different modalities chosen for their good tissue separation and availability in a clinical environment. The use

of multi-spectral data enables decorrelation of statistical distributions and better estimation of partial volumes.

The images used were variable echo proton density (VE(PD)), variable echo T2 (VE(T2)), inversion recovery

turbo spin-echo (IRTSE), and fluid attenuated inversion recovery (FLAIR) (see Fig. 1). These images provide

good separation between air and bone, fat, soft tissue (such as skin and muscle), cerebro-spinal fluid (CSF), grey

matter (GM), and white matter (WM). Fig. 2 shows a scatter plot of the IRTSE and VE(PD) images, together

with the model after 10 iterations of the EM algorithm. The final model agrees well with the original data. The

partial volume distributions link the otherwise compact pure tissue distributions along the lines between them in

accordance with the Bloch equations, which describe the signal generation process in MR. The final segmentation

result is represented by probability maps for each tissue class and can be seen in Fig. 3. The probability maps range

from 0 to 1 and can be used for boundary location extraction (e.g. a probability of 0.5 represents the boundary

location between two tissues) or volume visualisation [15].

AirSoft

tissue

1032.0

4076.9

1219.3

520.4

46.2

0

Fatty

tissue

21.9

1219.3

1517.8

0

0

0

Cranial

Fluid

126.7

520.4

0

445.9

759.2

84.1

Grey

Matter

371.5

46.2

0

759.2

6465.8

4105.6

White

Matter

0

0

0

84.1

4105.6

3548.7

and bone

16012.2

1032.0

21.9

126.7

371.5

0

Air/bone

Soft Tissue

Fatty tissue

Cranial Fluid

Grey Matter

White Matter

Table 1: Typical priors assigned to each class (pure and mixed tissues). Zero values are fixed to eliminate biologi-

cally implausible combinations.

Table 1 gives the priors (i.e. number of voxels) assigned to each class (pure and mixture of tissues) to model

different tissue types. As these values represent genuine frequencies of tissues they will change depending upon the

region selected, and so too will any estimates of tissue proportion. For example, if the intention was to segment

only the brain tissues (i.e. CSF, WM and GM) a region of interest could be chosen which contained only these

tissues.

Fig. 4 illustrates how change in the region chosen to generate the priors may lead to significant change in the

4

Page 5

(a) (b)

(c)

Soft tissue

GM

WM

Bone/air

CSF

Fat

(d)

Figure 2: Scatter plots of IRTSE vs. VE(PD): original data (a); model using initial values of parameters (b);

model after 10 iterations of EM algorithm (c); and schematic showing the origins of the data clusters (d).

(a)(b) (c)

(d)(e) (f)

Figure 3: Probability maps for bone and air (a), fat (b), soft tissue (c), CSF (d), GM (e), and WM (f).

subsequent interpretation of the same data. Two overlapping regions of interest were defined, generating two sets

of prior probabilities for the same region (their intersection). A subtraction of the two grey matter probability

maps produced for this region shows significant differences between them: it can be seen from the histogram that

the probabilities differ at approximately the 10% level.

Given the arbitrary nature of the process of region selection, it is evident that the prior information is not unique

5