Page 1
Tina Memo No. 2001-014
Presented at a joint meeting of the Royal Statistical Society and the BMVA, 2001.
IVC Special Edition: The use of Probabilistic Models in Computer Vision., 21, 851-864, 2003.
Additional Appendix added to cover the problems of data fusion from Bayesian modules.
Bayesian and Non-Bayesian Probabilistic
Models for Image Analysis
P.A. Bromiley, N.A. Thacker, M.L.J. Scott,
M. Pokri´ c, A.J. Lacey and T.F. Cootes
Last updated
25 / 5 / 2007
Imaging Science and Biomedical Engineering,
School of Cancer and Imaging Sciences,
University of Manchester, Stopford Building,
Oxford Road, Manchester M13 9PT, U.K.
Page 2
Bayesian and Non-Bayesian Probabilistic Models for Image
Analysis
P.A. Bromiley1, N.A. Thacker2M.L.J. Scott,
M. Pokri´ c, A.J. Lacey and T.F. Cootes
Imaging Science and Biomedical Engineering,
School of Cancer and Imaging Sciences,
University of Manchester, Stopford Building,
Oxford Road, Manchester M13 9PT, U.K.
Abstract
Bayesian approaches to data analysis are popular in machine vision, and yet the main advantage of
Bayes theory, the ability to incorporate prior knowledge in the form of the prior probabilities, may
lead to problems in quantitative tasks. In this paper we demonstrate examples of Bayesian and non-
Bayesian techniques with the use of selected examples from the area of magnetic resonance image (MRI)
analysis. Issues raised by these examples are used to illustrate common difficulties in Bayesian methods
and to motivate an approach based on frequentist methods. We believe this approach to be more suited
to quantitative data analysis, and provide a general theory for the use of these methods in learning
(Bayes risk) systems and for data fusion. Proofs are given for the more novel aspects of the theory.
We conclude with a discussion of the strengths and weaknesses, and the fundamental suitability, of
Bayesian and non-Bayesian approaches for MRI analysis in particular, and for machine vision systems
in general. Overall we advise caution regarding the common assertion that the best approaches to all
machine vision problems are necessarily Bayesian.
1 Introduction
This paper discusses the use of Bayes theorem in decision systems which make use of image data. We concern
ourselves only with the use of the equation
P(Hi|data) =
P(data|Hi)P(Hi)
?
iP(data|Hi)P(Hi)
(1)
for the interpretation of mutually exclusive hypotheses Hi, as the basis for algorithmic design. We compare such
approaches with alternatives based on frequentist statistics. In the broadest possible terms, these two schools
of statistical inference vary in their definition of a probability. Frequentist statistics demands that a probability
should represent a genuine reflection of the frequency of occurrence of some event, whereas Bayesian statistics
defines a probability as a degree of belief, allowing the incorporation of prior knowledge in the form of the prior
probabilities.
Bayes theorem is a cornerstone of modern probabilistic data analysis. It is used as a way of constructing prob-
abilistic decision systems so that prior knowledge can be incorporated into the data analysis in order to “bias”
the interpretation of the data in the direction of expectation. The prior probabilities therefore have the great-
est influence when the data under analysis is unable to adequately support any model hypothesis. Under these
circumstances direct application of a purely data driven solution may not be directly possible. The use of Bayes
theorem can appear to provide spectacular improvements in the interpretation of data. However, despite their
popularity and widespread acceptance, there are often significant practical problems in the application of Bayesian
techniques.
Firstly, Bayesian approaches use information regarding the distribution of a whole group of data to influence
the interpretation of a single data set. Few would countenance modifying the estimate of a random variable by
averaging with some weighted combination of the group mean, as it is well known that such a procedure would
introduce bias, yet analogous procedures are accepted as reasonable in Bayesian approaches to data analysis.
Another aspect of this problem is the suppression of infrequently occurring data or “novelty”. Clearly, skepticism
concerning the use of Bayesian approaches in areas such as medical data analysis, where pathological cases are
often unique, would have some justification. We explain the effects of bias and novelty in Bayesian estimation
in Section 2, using multi-dimensional MR volumetric measurement as an example.
1paul.bromiley@manchester.ac.uk
2neil.thacker@manchester.ac.uk
Page 3
Many researchers have concentrated on the source of prior probabilities [16, 3]. Ideally this prior information could
be established uniquely for a particular task, preferably emerging from a mathematical analysis. However, if Bayes
theory is to make predictions regarding the likely ratio of real world events it must be accepted that the data
samples used in Bayesian approaches must enforce the priors i.e. must represent a stratified random sample of
the types of data under analysis3. Thus, in the absence of a deterministic physical mechanism giving rise to the
data set there can be no theoretical justification for belief in the existence of a unique prior. If the circumstances
in which the system is used change the expected prior distributions must also change. This situation is often
encountered in data analysis problems, particularly those involving biological data sets where data are selected on
the basis of rules that can vary with time.
In order for Bayes theory to be applied correctly the likelihood distributions (P(data|Hi)) of all possible interpreta-
tions Hiof the data must be known. Unfortunately, in many practical circumstances, particularly in research, these
distributions are not well known a-priori. Much of the computer vision community accepts the need to construct
systems which learn in order to tackle difficult scene interpretation tasks. It could be regarded as a weakness if
the computational framework used in a learning system demands that all possible interpretations of the data are
available before a useful statistical inference can be drawn.
Beyond simple classification, any decisions based on Bayesian classification results should also be made on the
basis of the Bayes risk, in order to minimise the cost of the decision. The aim in clinical support systems, for
example, is to provide treatment which improves the prognosis of the patient, rather than just to provide the
correct diagnosis. To illustrate the problems of learning, priors and Bayes risk, in Section 3 we give a specific
example of a Bayesian system designed to evaluate the degree of atrophy in the brain arising from a variety of
dementing diseases, and explain some possible solutions. We conclude that Bayesian decision systems cannot form
useful components of learning systems without modifications that distance them from the original theory. We
explain how both this and the previous example illustrate the difficulty of using Bayes theory for quantitative
analysis, even on relatively simple problems.
If Bayes theory cannot be used directly to provide a useful diagnostic classification, a more appropriate method
for presenting results for clinical interpretation must be found. We present one potential solution in the form of
a single-model statistical decision system which has its origins in the so-called “frequentist” approaches to data
analysis. The use of only one likelihood distribution avoids the need to specify prior probabilities and sidesteps
issues of complexity. This can result in an approach to data analysis which is more in line with the need to
construct systems which can learn incrementally, and yet still be capable of generating useful results at early
stages of training. In Section 4 we illustrate a single model statistical analysis technique for the problem of
change detection, under circumstances where the statistical model can be bootstrapped from the image data. This
represents a significant step towards learning. We further describe a practical mechanism for data fusion within
this framework and in particular provide a derivation for problems of arbitrary dimensionality. These systems
generate data which is more quantitative than that generated by Bayesian methods, yet the data remains suitable
for use in a Bayes risk analysis. Therefore, we believe that such systems have advantages over Bayesian algorithms
and have great potential for use in computer vision.
2 Bias and Novelty in Multi-dimensional MR Image Segmentation
Magnetic resonance imaging represents a very flexible way of generating images from biological tissues. From the
point of view of conventional computer vision, analysis of these images is relatively simple as, for a given protocol,
particular tissues generate a narrow range of fixed values in the data. Bayesian probability theory has been applied
by several groups in order to devise a frequency-based representation of different tissues [13]. The conditional
probabilities provide an estimate of the probability that a given grey level was generated by a particular part
of the model. The conditional probability for a particular tissue class given the data can be derived using the
knowledge of image intensity probability density distributions for each class and their associated priors.
The data density distributions are often assumed to be Gaussian, but for many clinical scans there is a high
probability that individual values may be generated by fractional contributions from two distinct tissue components
(partial voluming). In [19] we adopted a multi-dimensional model for the probability density functions which can
take account of this effect. The conditional probability that a grey level g is due to a certain mechanism k (either
a pure or mixture tissue component) can be calculated using Bayes theory,
P(k|g) =
dk(g)fk
f0+?
i(di(g)fi) +?
i
?
jdij(g)fij
(2)
3Strong Bayesians would argue that we can deal with degrees of belief, but then we have to accept that the computed probabilities
do not necessarily correspond to any objective prediction of likely outcome.
3
Page 16
Mean
Variance
Figure 15: The Normal Distribution
A Model Parameter Update using Expectation Maximisation
The EM algorithm is implemented in such a way that the expectation step recalculates multi-dimensional prob-
ability densities, dk(g), for pure and mixtures of tissues using the current parameters values. Once the probability
density functions have been calculated the conditional probabilities, P(k|g), can be derived and used for re-
estimation of model parameters in the maximisation step of the algorithm in a maximum likelihood (i.e. least
squares) manner.
The model parameters for pure tissues i and for mixtures of tissues i and j which are iteratively updated are: the
priors f′
i,f′
ij; the mean vector M′
i; and the covariance matrix C′
i. They take the forms
f′
i=
V
?
v
P(i|gv) (7)
f′
ij= f′
ji=1
2
V
?
v
(P(ij|gv) + P(ji|gv)) (8)
M′
i=1
V
V
?
v
P(i|gv)gv
(9)
and
C′
i=1
V
V
?
v
P(i|gv)(gv− Mi) ⊗ (gv− Mi)T
(10)
where gvis the observed intensity value in voxel v, and V is the total volume of all data analysed.
Using this representation it is possible to obtain the most probable volumetric measurement Vi for each tissue i
given the observed data gvin voxel V ,
Vi(gv) = P(i|gv) +
?
i
P(ij|gv)
BThe Sampled Normal/Gaussian Distribution
The normal distribution (see Fig. 15) is described by the probability density function
f(x) =
1
σ√2πe−(x−µ)2
(2σ2)
(11)
where µ is the mean and σ2the variance. Given a sample of data X1...XN, approximate values for µ and σ are
given by¯ X =?Xi
σ2is unknown. Replacing it with the estimator of σ2gives the estimator of the variance of the calculated mean,
?
statistics its error is ±√N.
The errors on the perfusion parameters can be treated as analogous to the errors on the parameters of the Gaussian
distribution. The CBV is equivalent to the area under the graph (N), so has an error proportional to ±√CBV .
TTM is equivalent to µ, and MTT to σ, so the error on the TTM is proportional to ±MTT
Nand
?
(Xi−¯
(N−1)
X)2
respectively. The variance on the estimator of µ isσ2
N, but the true value of
(Xi−¯
(N−1)×N. The estimator of the area under the graph is the sum of the data points, and since this obeys Poisson
X)2
√CBV.
16
Page 17
Figure 16: The sample space for the probability renormalisation in 3D, showing the element of integration (the
shaded region) used to relate this to the 2D problem. The contour of constant probability is shown by the curved
surface in the upper corner of the unit cube.
C Probability Renormalisation
Given n quantities each having a uniform probability distribution pi=1,n, the product p =?n
malised to have a uniform probability distribution Fn(p) using
i=1pican be renor-
Fn(p) = p
n−1
?
i=0
(−lnp)i
i!
= p + p
n−1
?
i=1
(−lnp)i
i!
(12)
The quantities pican be plotted on the axes of an n dimensional sample space, bounded by the unit hypercube. Since
they are uniform, and assuming no spatial correlation, the sample space will be uniformly populated. Therefore,
the transformation to Fn(p) such that this quantity has a uniform probability distribution can be achieved using
the probability integral transform, replacing any point in the sample space p with the integral of the volume under
the contour of constant p passing through this point, which obeys?n
in terms of the volume of a hyper-region of one lower dimension by integrating over one dimension (let this be
called x)
?1
i=1pi= p = constant. This can be expressed
Fn(p) = p +
p
Fn−1(p
x)dx (13)
This is equivalent to dividing the integration into two regions using a plane perpendicular to the x axis which
intersects the axis at x = p. Fig. 16 shows the element of integration that would be used in the 3D case, to relate
the volume of the unit cube under the contour of constant probability to the 2D case.
Now, in the simplest case of n = 1, clearly Fn(p) = p, as no renormalisation is required. The solution for higher
dimensions can then be derived by iterative application of Equation 13. This involves integration of terms in
(p/x)[−ln(p/x)]nwhich enter in the n=3 and higher cases. This integration can be performed using a simple
substitution x = pu, dx = pdu
?1
p
(p
x)[−ln(p
x)]ndx = p
?1/p
1
(1
u)[lnu]ndu = p[
1
n + 1[lnu]n+1]n+1=
p
n + 1[−lnp]n+1
(14)
Iterative application of Equation 13 therefore produces the series
Fn(p) = p − plnp + p(lnp)2
2
− p(lnp)3
6
+ p(lnp)4
24
.... (15)
which can be written as
Fn(p) = p
n−1
?
i=0
(−lnp)i
i!
. (16)
17
Page 18
D Difficulties with Data Fusion using Bayesian Modules.
Let us imagine that a researcher wishes to build a modular vision system where a module delivers some evidence
regarding a particular hypothesis C, based upon data from independent sources (X and Y ) which are then to
be combined in a fusion process, such as that described in the paper above. Aside from the above mechanism of
combination of hypothesis probabilities we can also combine the likelihoods by taking the product
P(X,Y |C) = P(X|C)P(Y |C)
However the researcher chooses instead to build modules which delivers MAP estimates (P(C|X) ∝ P(X|C)P(C)
and P(C|Y ) ∝ P(Y |C)P(C)). This is a common tactic in computer vision in order to justify the use of prior
knowledge and so improve the apparent performance of a module. Staying within a MAP framework, the corre-
sponding fused output is P(C|X,Y ). However, we cannot compute this quantity using only the outputs from the
MAP modules as
P(C|X,Y ) ∝ P(X|C)P(Y |C)P(C) = P(C|X)P(C|Y )/P(C)
The process of data fusion therefore requires an understanding of the role that the prior probabilities had in the
construction of the output from a module. Unfortunately, if the prior knowledge was implicitly included in a way
which did not specify the probabilities then this is going to be difficult to address.
This analysis has assumed that both data sets were interpreted using the same fixed prior P(C). If the priors
are inconsistent (as illustrated in the tissue segmentation example in the paper) then we have no right to perform
any simple combination. The situation is equivalent to putting contradictory information in a database of facts or
using sets of equations which are based upon conflicting assumptions in an analytic analysis. In both cases there
is an immediate failure of logical consistency and any attempted inference must be expected to lead to a multitude
of invalid and contradictory conclusions. When performing data analysis it is common to invoke the requirement
of independence in order to avoid such problems, and we might suggest the use of independent priors. Initially this
might look like a possible solution, but the concept of independence follows from characteristics of data. A true
prior must, by definition, be the same for any set of data. By introducing a data sample into the specification of
our prior knowledge we just transform the prior into another likelihood term, we are therefore fixing the Bayesian
approach by abandoning it. This example goes to the very heart of the issue. The standard interpretation of prior
probabilities appears to provide a way of incorporating prior knowledge which avoids the constraints which we
would otherwise demand for any other source of information. Somewhere along the line, we must expect to have
to identify and deal with the consequences.
One of these consequences is that fusion of data from modules which deliver MAP estimates requires us to know
precisely what effect any prior knowledge has on the output, and in general we must undo this effect and work
back to the likelihood before we can legitimately fuse the data6dditional problems of bias introduced by the use
of MAP estimators is discussed in other documents on our web pages.. If you think it through, the idea that we
can generally throw arbitrary prior probabilities into vision modules in order to improve performance appears has
no theoretical credibility in the context of larger systems. However, the mathematical notation used to describe
Bayesian systems does not tell us that there should be a problem, and with many publications describing Bayesian
algorithms continuing to appear in journals and at conferences this conclusion appears to fly in the face of current
received wisdom. This analysis demands that vision researchers develop a new level of critical thinking when
formulating approaches for algorithm construction. We should reject the tendency to accept without question the
introduction of arbitrary assumptions into algorithms just because someone presents them as priors.
6A
18