Content uploaded by Robert M. Dorazio
Author content
All content in this area was uploaded by Robert M. Dorazio on Mar 03, 2023
Content may be subject to copyright.
Academic Press is an imprint of Elsevier
84 Theobald’s Road, London WC1X 8RR, UK
Redarweg 29, PO Box 211, 1000 AE Amsterdam, The Netherlands
Linacre House, Jordan Hill, Oxford OX2 8DP, UK
30 Corporate Drive, Suite 400, Burlington, MA 01803, USA
525 B Street, Suite 1900, San Diego, CA 92101-4495, USA
First edition 2008
Copyright c
2008 Elsevier Ltd. All rights reserved
No part of this publication may be reproduced, stored in a retrieval system
or transmitted in any form or by any means electronic, mechanical, photocopying,
recording or otherwise without the prior written permission of the publisher
Permissions may be sought directly from Elsevier’s Science & Technology Rights
Department in Oxford, UK: phone (+44) (0) 1865 843830; fax (+44) (0) 1865 853333;
email: permissions@elsevier.com. Alternatively you can submit your request online by
visiting the Elsevier web site at http://elsevier.com/locate/permissions, and selecting
Obtaining permission to use Elsevier material
Notice
No responsibility is assumed by the publisher for any injury and/or damage to persons
or property as a matter of products liability, negligence or otherwise, or from any use
or operation of any methods, products, instructions or ideas contained in the material
herein. Because of rapid advances in the medical sciences, in particular, independent
verification of diagnoses and drug dosages should be made
ISBN: 978-0-12-374097-7
For information on all Academic Press publications
visit our website at elsevierdirect.com
Printed and bound in the United States of America
08 09 10 11 12 10 9 8 7 6 5 4 3 2 1
2
ESSENTIALS OF STATISTICAL
INFERENCE
In the previous chapter we described our philosophy related to statistical inference.
Namely, we embrace a model-based view of inference that focuses on the
construction of abstract, but hopefully realistic and useful, statistical models
of things we can and cannot observe. These models always contain stochastic
components that express one’s assumptions about the variation in the observed data
and in any latent (unobserved) parameters that may be part of the model. However,
statistical models also may contain deterministic or structural components that are
usually specified in terms of parameters related to some ecological process or theory.
Often, one is interested in estimating these parameters given the available data.
We have not yet described how one uses statistical models to estimate parameters,
to conduct an inference (hypothesis test, model selection, model evaluation), or
to make predictions. These subjects are the focus of this chapter. Inference, by
definition, is an inductive process where one attempts to make general conclusions
from a collection of specific observations (data). Statistical theory provides the
conceptual and methodological framework for expressing uncertainty in these
conclusions. This framework allows an analyst to quantify his/her conclusions or
beliefs probabilistically, given the evidence in the data.
Statistical theory provides two paradigms for conducting model-based inference:
the classical (frequentist) approach and the Bayesian view. As noted in the previous
chapter, we find both approaches to be useful and do not dwell here on the profound,
foundational issues that distinguish the two modes of inference. Our intent in this
chapter is to describe the basis of both approaches and to illustrate their application
using inference problems that are likely to be familiar to many ecologists.
We do not attempt to present an exhaustive coverage of the subject of statistical
inference. For that, many excellent texts, such as Casella and Berger (2002), are
available. Instead, we provide what we regard as essentials of statistical inference in
a manner that we hope is accessible to many ecologists (i.e., without using too much
mathematics). The topics covered in this chapter are used throughout the book,
and it is our belief that they will be useful to anyone engaged in scientific research.
27
28 Essentials of Statistical Inference
2.1 PRELIMINARIES
2.1.1 Statistical Concepts
Before we can begin a description of model-based inference, we need a foundation for
specifying statistical models. That foundation begins with the idea of recognizing
that any observation may be viewed as a realization (or outcome) of a stochastic
process. In other words, chance plays a part in what we observe.
Perhaps the simplest example is that of a binary observation, which takes one of
two mutually exclusive values. For example, we might observe the death or survival
of an animal exposed to a potentially lethal toxicant in an experimental setting.
Similarly, we might observe the outcomes, ‘mated’ or ‘did not mate’, in a study
of reproductive behavior. Other binary outcomes common in surveys of animal
abundance or occurrence are ‘present/absent’ and ‘detected/not detected.’ In all
of these examples, there is an element of chance in the observed outcome, and we
need a mechanism for specifying this source of uncertainty.
Statistical theory introduces the idea of a random variable to represent the role
of chance. For example, let Ydenote a random variable for a binary outcome, such
as death or survival. We might codify a particular outcome, say y, using y= 0 for
death and y= 1 for survival. Notice that we have used uppercase to denote the
random variable Yand lowercase to denote an observed value yof that random
variable. This notation is a standard practice in the field of statistics. The random
variable Yis a theoretical construct, whereas yis real data.
A fully-specified statistical model provides a precise, unambiguous description of
its random variables. By that, we mean that the model specifies the probability
(or probability density) for every observable value of a random variable, i.e., for
every possible outcome. Let’s consider a model of the binary random variable Y
as an example. Suppose p= Pr(Y= 1) denotes the probability of success (e.g.,
survival) in a single trial or unit of observation. Since Yis binary, we need only
specify Pr(Y= 0) to complete the model. A fully-specified model requires
Pr(Y= 1) + Pr(Y= 0) = 1.
Given our definition of p, this equation implies Pr(Y= 0) = 1 −p. Thus, we can
express a statistical model of the observable outcomes (y= 0 or y= 1) succinctly
as follows:
Pr(Y=y) = py(1 −p)1−y,(2.1.1)
where p∈[0,1] is a formal parameter of the model.
Equation (2.1.1) is an example of a special kind of function in statistics, a
probability mass function (pmf), which is used to express the probability distribution
PRELIMINARIES 29
of a discrete-valued random variable. In fact, Eq. (2.1.1) is the pmf of a Bernoulli
distributed random variable. A conventional notation for pmfs is f(y|θ), which is
intended to indicate that the probability of an observed value ydepends on the
parameter(s) θused to specify the distribution of the random variable Y. Thus,
using our example of a binary random variable, we would say that
f(y|p) = py(1 −p)1−y(2.1.2)
denotes the pmf for an observed value of the random variable Y, which has a
Bernoulli distribution with parameter p. As summaries of distributions, pmfs honor
two important restrictions:
f(y|θ)≥0 (2.1.3)
X
y
f(y|θ)=1,(2.1.4)
where the summation is taken over every observable value of the random
variable Y.
The probability distribution of a continuous random variable, which includes an
infinite set of observable values, is expressed using a probability density function
(pdf). An example of a continuous random variable might be body weight, which is
defined on the set of all positive real numbers. The notation used to indicate pdfs
is identical to that used for pmfs; thus, f(y|θ) denotes the pdf of the continuous
random variable Y. Of course, pdfs honor a similar set of restrictions:
f(y|θ)≥0 (2.1.5)
Z∞
−∞
f(y|θ) dy= 1.(2.1.6)
In practice, the above integral need only be evaluated over the range of observable
values (or support) of the random variable Ybecause f(y|θ) is zero elsewhere (by
definition). We describe a variety of pdfs in the next section.
Apart from their role in specifying models, pmfs and pdfs can be used to compute
important summaries of a random variable, such as its mean or variance. For
example, the mean or expected value of a discrete random variable Ywith pmf
f(y|θ) is
E(Y) = X
y
yf (y|θ).
Similarly, the mean of a continuous random variable is defined as
E(Y) = Z∞
−∞
yf (y|θ) dy.
30 Essentials of Statistical Inference
The expectation operator E(·) is actually defined quite generally and applies to
functions of random variables. Therefore, given a function g(Y), its expectation is
computed as
E[g(Y)] = X
y
g(y)f(y|θ)
if Yis discrete-valued and as
E[g(Y)] = Z∞
−∞
g(y)f(y|θ) dy
if Yis a continuous random variable. One function of particular interest defines
the variance,
Var(Y) = E[(Y−E(Y))2].
In other words, the variance is really just a particular kind of expectation. These
formulae may seem imposing, but we will see in the next section that the means and
variances associated with some common distributions can be expressed in simpler
forms. We should not forget, however, that these simpler expressions are actually
derived from the general definitions provided above.
2.1.2 Common Distributions and Notation
The construction of fully-specified statistical models requires a working knowledge
of some common distributions. In Tables 2.1 and 2.2 we provide a list of discrete
and continuous distributions that will be used throughout the book.
For each distribution we provide its pmf (or pdf) expressed as a function of yand
the (fixed) parameters of the distribution. We also use these tables to indicate our
choice of notation for the remainder of the book. We find it convenient to depart
slightly from statistical convention by using lower case to denote both a random
variable and its observed value. For example, we use y∼N(µ, σ2) to indicate that
a random variable Yis normally distributed with mean µ, variance σ2, and pdf
f(y|µ, σ). We also find it convenient to represent pmfs and pdfs using both bracket
and shorthand notations. For example, we might represent the pmf of a binomially
distributed random variable Yin either of 3 equivalent ways:
•f(y|N, p)
•[y|N, p]
•Bin(y|N, p).
PRELIMINARIES 31
Table 2.1. Common distributions for modeling discrete random variables.
Distribution Notation Probability mass function Mean and variance
Poisson y∼Po(λ)f(y|λ) = exp(−λ)λy/y! E(y) = λ
[y|λ] = Po(y|λ)y∈ {0,1,...}Var(y) = λ
Bernoulli y∼Bern(p)f(y|p) = py(1 −p)1−yE(y) = p
[y|p] = Bern(y|p)y∈ {0,1}Var(y) = p(1 −p)
Binomial y∼Bin(N, p)f(y|N , p) = N
ypy(1 −p)N−yE(y) = Np
[y|N, p]
= Bin(y|N, p)
y∈ {0,1,...,N}Var(y) = Np(1 −p)
Multinomial y∼Multin(N, p)f(y|N , p) =
N
y1···ykpy1
1py2
2...pyk
k
E(yj) = Npj
[y|N, p]
= Multin(y|N, p)
×(1 −p)N−yVar(yj) = Npj(1 −pj)
yj∈ {0,1,...,N}Cov(yi, yj) = −Npipj
Negative-
binomial
y∼NegBin(λ, α)f(y|λ, α) =
Γ(y+α)
y! Γ(α)λ
α+λyα
α+λαE(y) = λ
[y|λ, α]
= NegBin(y|λ, α)
y∈ {0,1,...}Var(y) = λ+λ2/α
Beta-
binomial
y∼BeBin(N, α, β )f(y|N, α, β)
=N
yΓ(α+y)Γ(N+β−y)
Γ(α+β+N)
E(y) = Nα/(α+β)
[y|N, α, β ]
= BeBin(y|N, α, β )
×Γ(α+β)
Γ(α)Γ(β)Var(y)
=Nαβ(α+β+N)
(α+β)2(α+β+1)
y∈ {0,1,...,N}
The bracket and shorthand notations are useful in describing hierarchical models
that contain several distributional assumptions. The bracket notation is also useful
when we want to convey probabilities or probability densities without specific
reference to a particular distribution. For example, we might use [y|θ] to denote the
probability of ygiven a parameter θwithout reference to a particular distribution.
Similarly, we might use [y] to denote the probability of ywithout specifying a
particular distribution or its parameters.
We adhere to the common practice of using a regular font for scalars and a
bold font for vectors or matrices. For example, in our notation y= (y1, y2, . . . , yn)0
indicates a n×1 vector of scalars. A prime symbol is used to indicate the transpose of
a matrix or vector. There is one exception in which we deviate from this notational
convention. We sometimes use θwith regular font to denote a model parameter
32 Essentials of Statistical Inference
Table 2.2. Common distributions for modeling continuous random variables.
Distribution Notation Probability density function Mean and variance
Normal y∼N(µ, σ2)f(y|µ, σ)
=1
√2πσ exp −(y−µ)2
2σ2
E(y) = µ
[y|µ, σ] = N(y|µ, σ2)y∈RVar(y) = σ2
Multivariate y∼N(µ,Σ)f(y|µ,Σ)
= (2π)−p/2|Σ|−1/2
E(y) = µ
normal [y|µ,Σ] = N(y|µ,Σ)×exp −1
2(y−µ)0Σ−1Var(y) = Σ
(y−µ)) y∈Rp
Uniform y∼U(a, b)f(y|a, b)=1/(b−a) E(y)=(a+b)/2
[y|a, b] = U(y|a, b)y∈[a, b] Var(y)=(b
−a)2/12
Beta y∼Be(α, β)f(y|α, β ) = Γ(α+β)
Γ(α)Γ(β)yα−1
×(1 −y)β−1
E(y) = α/(α+β)
[y|α, β] = Be(y|α, β )y∈[0,1] Var(y) = αβ/{(α
+β)2(α+β+ 1)}
Dirichlet y∼Dir(α)f(y|α)
=Γ(α1+···+αk)
Γ(α1)···Γ(αk)yα1−1
1···
E(yj)
=αj/Pk
l=1 αl
[y|α] = Dir(y|α)yαk−1
kyj∈[0,1];
Pk
j=1 yj= 1
Var(yj)
=αj(−αj+Plαl)
(Plαl)2(1+Plαl)
Gamma y∼Gamma(α, β)f(y|α, β )
=βα
Γ(α)yα−1exp(−βy)
E(y) = α/β
[y|α, β] = Gamma(y|α, β )y∈R+Var(y) = α/β2
that may be a scalar or a vector depending on the context. To avoid confusion, in
these cases, we state explicitly that θis possibly vector-valued.
2.1.3 Probability Rules for Random Variables
Earlier we noted that a statistical model is composed of one or more random
variables. In fact, in most inferential problems it would be highly unusual to observe
the value of only one random variable because it is difficult to learn much from a
sample of size one. Therefore, we need to understand the ‘rules’ involved in modeling
multiple outcomes.
PRELIMINARIES 33
Since outcomes are modeled as observed values of random variables, it should
come as no surprise that the laws of probability provide the foundation for modeling
multiple outcomes. To illustrate, let’s consider the joint distribution of only two
random variables. Let (y , z) denote a vector of two discrete random variables. The
joint pmf of (y, z) is defined as follows:
f(y, z) = Pr(Y=y, Z =z),
where we suppress the conditioning on the parameter(s) needed to characterize the
distribution of (y, z). The notation for the joint pdf of two continuous random
variables is identical (i.e., f(y, z)). Now suppose we want to calculate a marginal
pmf or marginal pdf for each random variable. If yand zare discrete-valued, the
marginal pmf is calculated by summation:
f(y) = X
z
f(y, z)
f(z) = X
y
f(y, z).
If yand zare continuous random variables, their marginal pdfs are computed by
integration:
f(y) = Z∞
−∞
f(y, z) dz
f(z) = Z∞
−∞
f(y, z) dy
Statistical models are often formulated in terms of conditional outcomes;
therefore, we often need to calculate conditional probabilities, such as Pr(Y=
y|Z=z) or Pr(Z=z|Y=y). Fortunately, conditional pmfs and conditional pdfs
are easily calculated from joint and marginal distribution functions. In particular,
the conditional pmf (or pdf) of ygiven zis
f(y|z) = f(y, z)
f(z).
Likewise, the conditional pmf (or pdf) of zgiven yis
f(z|y) = f(y, z)
f(y).
The above formulae may not seem useful now, but they will be used extensively
in later chapters, particularly in the construction of hierarchical models. However,
34 Essentials of Statistical Inference
one immediate use is in evaluating the consequences of independence. Suppose
random variables yand zare assumed to be independent; thus, knowing the value
of zgives us no additional information about the value of yand vice versa. Then,
by definition, the joint pmf (or pdf) of the the vector (y, z ) equals the product of
the marginal pmfs (pdfs) as follows:
f(y, z) = f(y)f(z).
Now, recalling the definition of the conditional pmf (pdf ) of ygiven zand
substituting the above expression yields
f(y|z) = f(y , z)
f(z)
=f(y)f(z)
f(z)
=f(y).
Therefore, the conditional probability (or probability density) of ygiven zis
identical to the marginal probability (or probability density) of y. This result
confirms that knowledge of zprovides no additional information about ywhen
the random variables yand zare independent.
One well-known application of the laws of probability is called the law of total
probability. To illustrate, let’s consider two discrete random variables, yand z,
and suppose zhas ndistinct values, z1, z2, . . . , zn. The law of total probability
corresponds to the expression required for calculating the marginal probability f(y)
given only the conditional and marginal pmfs, f(y|z) and f(z), respectively. We
know how to calculate f(y) from the joint pmf f(y, z):
f(y) =
zn
X
z=z1
f(y, z)
By definition of conditional probability, f(y , z) = f(y|z)f(z). Therefore,
substituting this expression into the right-hand side of the above equation yields
f(y) =
zn
X
z=z1
f(y|z)f(z)
=f(y|z1)f(z1) + f(y|z2)f(z2) + · · · +f(y|zn)f(zn)
which is the law of total probability. In this equation f(y) can be thought of
as a weighted average (or expectation) of the conditional probabilities where the
weights correspond to the marginal probabilities f(z). The law of total probability
is often used to remove latent random variables from hierarchical models so that
the parameters of the model can be estimated. We will see several examples of this
type of marginalization in subsequent chapters.
THE ROLE OF APPROXIMATING MODELS 35
2.2 THE ROLE OF APPROXIMATING MODELS
Inference begins with the data observed in a sample. These data may include field
records from an observational study (a survey), or they may be outcomes of a
designed experiment. In either case the observed data are manifestations of at least
two processes: the sampling or experimental procedure, i.e., the process used to
collect the data, and the ecological process that we hope to learn about.
Proponents of model-based inference base their conclusions on one or more
approximating models of the data. These models should account for the
observational (or data-gathering) process and for the underlying ecological process.
Let’s consider a simple example. Suppose we randomly select nindividual animals
from a particular location and measure each animal’s body mass yi(i= 1, . . . , n)
for the purpose of estimating the mean body mass of animals at that location.
In the absence of additional information, such as the age or size of an individual,
we might be willing to entertain a relatively simple model of each animal’s mass,
e.g., yi∼N(µ, σ2). This model has two parameters, µand σ, and we cannot
hope to estimate them from the body mass of a single individual; therefore, an
additional modeling assumption is required. Because animals were selected at
random to obtain a representative sample of those present, it seems reasonable
to assume mutual independence among the nmeasurements of body mass. Given
this additional assumption, an approximating model of the sample data is
[y1, y2, . . . , yn|µ, σ] =
n
Y
i=1
N(yi|µ, σ2).(2.2.1)
Therefore, the joint pdf of observed body masses is modeled as a product of identical
marginal pdfs (in this case N(y|µ, σ2)). We describe such random variables as
independent and identically distributed (abbreviated as iid) and use the notation,
yi
iid
∼N(µ, σ2), as a compact summary of these assumptions.
Now that we have formulated a model of the observed body masses, how is the
model used to estimate the mean body mass of animals that live at the sampled
locations? After all, that is the real inferential problem in our example. To answer
this question, we must recognize the connection between the parameters of the
model and the scientifically relevant estimand, the mean body mass of animals in the
population. In our model-based view each animal’s body mass is a random variable,
and we can prove that E(yi) = µunder the assumptions of the approximating model.
Therefore, in estimating µwe solve the inference problem.
This equivalence may seem rather obvious; however, if we had chosen a different
approximating model of body masses, the mean body mass would necessarily be
specified in terms of that model’s parameters. For example, suppose the sample of
nanimals includes both males and females but we don’t observe the sex of each
36 Essentials of Statistical Inference
individual. If we know that males and females of similar ages have different average
masses and if we have reason to expect an uneven sex ratio, then we might consider
a mixture of 2 normal distributions as an approximating model:
yi
iid
∼pN(µm, σ2) + (1 −p)N(µf, σ2),
where pdenotes the unknown proportion of males in the population and µmand
µfdenote the mean body masses of males and females, respectively. Under the
assumptions of this model we can show that E(yi) = pµm+ (1 −p)µf; therefore, to
solve the inference problem of estimating mean body mass, we must estimate the
model parameters, p,µm, and µf.
We have used this example to illustrate the crucial role of modeling in inference
problems. The parameters of a model specify the theoretical properties of random
variables, and we can use those properties to deduce how a model’s parameters
are related to one or more scientifically relevant estimands. Often, but not always,
these estimands may be formulated as summaries of observations, such as the sample
mean. In these cases it is important to remember that such summary statistics may
be related to the parameters of a model, but the two are not generally equivalent.
2.3 CLASSICAL (FREQUENTIST) INFERENCE
In the previous section we established the role of modeling in inference. Here we
describe classical procedures for estimating model parameters and for using the
estimates to make some kind of inference or prediction.
Let y= (y1, . . . , yn) denote a sample of nobservations. Suppose we develop an
approximating model of ythat contains a (possibly vector-valued) parameter θ. The
model is a formal expression of the processes that are assumed to have produced the
observed data. In classical inference the model parameter θis assumed to have a
fixed, but unknown, value. The observed data yare regarded as a single realization
of the stochastic processes specified in the model. Similarly, any summary of y,
such as the sample mean ¯y, is viewed as a random outcome.
Now suppose we have a procedure or method for estimating the value of θgiven
the information in the sample, i.e, given y. In classical statistics such procedures are
called estimators and the result of their application to a particular data set yields
an estimate ˆ
θof the fixed parameter θ. Of course, different estimators can produce
different estimates given the same set of data, and considerable statistical theory
has been developed to evaluate the operating characteristics of different estimators
(e.g., bias, mean squared error, etc.) in different inference problems. However,
regardless of the estimator chosen for analysis, classical inference views the estimate
ˆ
θas a random outcome because it is a function of y, which also is regarded as a
random outcome.
CLASSICAL (FREQUENTIST) INFERENCE 37
To make inferences about θ, classical statistics appeals to the idea of hypothetical
outcomes under repeated sampling. In other words, classical statistics views ˆ
θ
as a single outcome that belongs to a distribution of estimates associated with
hypothetical repetitions of an experiment or survey. Under this view, the fixed
value θand the assumptions of the model represent a mechanism for generating a
random, hypothetical sequence of data sets and parameter estimates:
(y1,ˆ
θ1),(y2,ˆ
θ2),(y3,ˆ
θ3), . . .
Therefore, probability statements about θ(i.e., inferences) are made with
respect to the distribution of estimates of θthat could have been obtained in
repeated samples.
For this reason those who practice classical statistics are often referred to as
frequentists. In classical statistics the role of probability in computing inferences
is based on the relative frequency of outcomes in repeated samples (experiments
or surveys). Frequentists never use probability directly as an expression of degrees
of belief in the magnitude of θ. Probability statements are based entirely on the
hypothetical distribution of ˆ
θgenerated under the model and repeated sampling.
We will have more to say about the philosophical and practical differences that
separate classical statistics and Bayesian statistics later in this chapter.
2.3.1 Maximum Likelihood Estimation
We have not yet described an example of a model-based estimator, i.e., a procedure
for estimating θgiven the observed data y. One of the most widely adopted
examples in all of classical statistics is the maximum likelihood estimator (MLE),
which can be traced to the efforts of Daniel Bernoulli and Johann Heinrich Lambert
in the 18th century (Edwards,1974). However, credit for the MLE is generally
awarded to the brilliant scientist, Ronald Aylmer Fisher, who in the early 20th
century fully developed the MLE for use in inference problems (Edwards,1992).
Among his many scientific contributions, Fisher invented the concept of likelihood
and described its application in point estimation, hypothesis testing, and other
inference problems. The concept of likelihood also has connections to Bayesian
inference. Therefore, we have chosen to limit our description of classical statistics
to that of likelihood-based inference.
Let’s assume, without loss of generality, that the observed data yare modeled
as continuous random variables and that f(y|θ) denotes the joint pdf of ygiven a
model indexed by the parameter θ. In many cases the observations are mutually
38 Essentials of Statistical Inference
independent so that the joint pdf can be expressed as a product of individual pdfs
as follows:
f(y|θ) =
n
Y
i=1
g(yi|θ).
However, the concept of likelihood applies equally well to samples of dependent
observations, so we will not limit our notation to cases of independence.
To define the MLE of θ, Fisher viewed the joint pdf of yas a function of θ,
L(θ|y)≡f(y|θ),
which he called the likelihood function. The MLE of θ, which we denote by ˆ
θ, is
defined as the particular value of θthat maximizes the likelihood function L(θ|y).
Heuristically, one can think of ˆ
θas the value of θthat is most likely given the
data because ˆ
θassigns the highest chance to the observations in y. Fisher always
intended the likelihood L(θ|y) to be interpreted as a measure of relative support for
different hypotheses (i.e., different values of θ); he never considered the likelihood
function to have an absolute scale, such as a probability. This distinction provides
an important example of the profound differences between the classical and Bayesian
views of inference, as we shall see later (Section 2.4).
The MLE of θis, by definition, the solution of an optimization problem. In some
cases this problem can be solved analytically using calculus, which allows the MLE
to be expressed in closed form as a function of y. If this is not possible, numerical
methods of optimization must be used to compute an approximation of ˆ
θ; however,
these calculations can usually be done quite accurately and quickly with modern
computers and optimization algorithms.
2.3.1.1 Example: estimating the probability of occurrence (analytically)
Suppose a survey is designed to estimate the average occurrence of an animal species
in a region. As part of the survey, the region is divided into a lattice of sample units
of uniform size and shape, which we will call ‘locations’, for simplicity. Now imagine
that nof these locations are selected at random and that we are able to determine
with certainty whether the species is present (y= 1) or absent (y= 0) at each
location.1
On completion of the survey, we have a sample of binary observations y=
(y1, . . . , yn). Because the sample locations are selected at random, it seems
1Observations of animal occurrence are rarely made with absolute certainty when sampling
natural populations; however, we assume certainty in this example to keep the model simple.
CLASSICAL (FREQUENTIST) INFERENCE 39
reasonable to assume that the observations are mutually independent. If we further
assume that the probability of occurrence is identical at each location, this implies
a rather simple model of the data:
yi
iid
∼Bern(ψ),(2.3.1)
where ψdenotes the probability of occurrence.
We could derive the MLE of ψbased on Eq. (2.3.1) alone; however, we can take a
mathematically equivalent approach by considering the implications of Eq. (2.3.1)
for a summary of the binary data. Let v=Pn
i=1 yidenote the total number of
sample locations where the species is present. This definition allows the information
in yto be summarized as the frequency of ones, v, and the frequency of zeros,
n−v. Given the assumptions in Eq. (2.3.1), the probability of any particular set
of observations in yis
ψv(1 −ψ)n−v(2.3.2)
for v= 0,1, . . . , n. However, to formulate the model in terms of v, we must account
for the total number of ways that vones and n−vzeros could have been observed,
which is given by the combinatorial
n
v=n!
v!(n−v)!.(2.3.3)
Combining Eqs. (2.3.2) and (2.3.3) yields the total probability of observing vones
and n−vzeros independent of the ordering of the binary observations in y:
f(v|ψ) = n
vψv(1 −ψ)n−v.(2.3.4)
The astute reader will recognize that f(v|ψ) is just the pmf of a binomial distribution
with index nand parameter ψ(see Table 2.1).
This simple model provides the likelihood of ψgiven v(and the sample size n)
L(ψ|v) = ψv(1 −ψ)n−v,(2.3.5)
where the combinatorial term has been omitted because n
vdoes not involve ψ.
(Note that the combinatorial term may be ignored given Fisher’s definition of
likelihood because n
vdoes not involve ψand only contributes a multiplicative
constant to the likelihood.) The MLE of ψis the value of ψthat maximizes
40 Essentials of Statistical Inference
L(ψ|v). Because L(ψ|v) is non-negative for admissible values of ψ, the MLE of
ψalso maximizes
log L(ψ|v) = vlog ψ+ (n−v) log(1 −ψ).(2.3.6)
This results stems from the fact that the natural logarithm is a one-to-one, monotone
function of its argument. The MLE of ψis the solution of either of the following
equations,
dL(ψ|z)
dψ= 0 or d log L(ψ|z)
dψ= 0
and equals ˆ
ψ=v/n. Therefore, the MLE of ψis equivalent to the sample mean of
the binary observations: ¯y= (1/n)Pn
i=1 yi=v/n.
2.3.1.2 Example: estimating the probability of occurrence (numerically)
In the previous example we were able to derive ˆ
ψin closed form; however, in
many estimation problems the MLE cannot be obtained as the analytic solution
of a differential equation (or system of differential equations for models with ≥2
parameters). In such problems the MLE must be estimated numerically.
To illustrate, let’s consider the previous example and behave as though we could
not have determined that ˆ
ψ=v/n. Suppose we select a random sample of n= 5
locations and observe a species to be present at only v= 1 of those locations. How
do we compute the MLE of ψnumerically?
One possibility is a brute-force calculation. Because ψis bounded on the interval
[0,1], we can evaluate the log-likelihood function in Eq. (2.3.6) for an arbitrarily
large number of ψvalues that span this interval (e.g., , 2, . . . , 1−, where > 0 is
an arbitrarily small, positive number). Then, we identify ˆ
ψas the particular value
of ψwith the highest log-likelihood (see Figure 2.1). Panel 2.1 contains the Rcode
needed to do these calculations and yields ˆ
ψ= 0.2, which is assumed to be correct
within ±10−6(= ). In fact, the answer is exactly correct, since v/n = 1/5=0.2.
Another possibility for computing a numerical approximation of ˆ
ψis to employ an
optimization algorithm. For example, Rcontains two procedures, nlm and optim,
which can be used to find the minima of functions of arbitrary form. Of course,
these procedures can also be used to find maxima by simply changing the sign of the
objective function. For example, maximizing a log-likelihood function is equivalent
to minimizing anegative log-likelihood function. In addition to defining the function
to be minimized, R’s optimization algorithms require a starting point. Ideally, the
starting point should approximate the solution. Panel 2.2 contains an example
of Rcode that uses optim to do these calculations and produces an estimate of
ˆ
ψthat is very close to the correct answer. Note, however, that Ralso reports a
warning message upon fitting the model. During the optimization optim apparently
CLASSICAL (FREQUENTIST) INFERENCE 41
Log likelihood
= probability of occurrence
0.0 0.2 0.4 0.6 0.8 1.0
–35
–30
–25
–20
–15
–10
–5
0
ψ
Figure 2.1. Log likelihood for a binomial outcome (v= 1 successes in n= 5 trials) evaluated
over the admissible range of ψvalues. Dashed vertical line indicates the MLE.
> v=1
> n=5
> eps=1e-6
> psi = seq(eps, 1-eps, by=eps)
> logLike = v*log(psi) + (n-v)*log(1-psi)
> psi[logLike==max(logLike)]
[1] 0.2
Panel 2.1. R code for brute-force calculation of MLE of ψ.
attempted to evaluate the function dbinom using values of ψthat were outside the
interval [0,1]. In this case the message can be ignored, but it’s an indication that
computational improvements are possible. We will return to this issue later.
2.3.1.3 Example: estimating parameters of normally distributed data
In Section 2.2 we described an example where body mass was measured for each of
nrandomly selected animals. As a possible model, we assumed that the body
masses in the sample were independent and identically distributed as follows:
yi
iid
∼N(µ, σ2). Here, we illustrate methods for computing the MLE of the parameter
vector (µ, σ2).
42 Essentials of Statistical Inference
Given our modeling assumptions, the joint pdf of the observed body masses yis
a product of identical marginal pdfs, N(yi|µ, σ2), as noted in Eq. (2.2.1). Therefore,
the likelihood function for these data is
L(µ, σ2|y) = (2πσ2)−n/2exp −1
2σ2
n
X
i=1
(yi−µ)2!.(2.3.7)
To find the MLE, we need to solve the following set of simultaneous equations:
∂log L(µ, σ2)
∂µ = 0
∂log L(µ, σ2)
∂σ2= 0.
It turns out that an analytical solution exists in closed form. The MLE is
(ˆµ, ˆσ2) = (¯y, n−1
ns2), where ¯yand s2denote the sample mean and variance,
respectively, of the observed body masses. Note that ˆσ2is strictly less than s2,
the usual (unbiased) estimator of σ2; however, the difference becomes negligible as
sample size nincreases.
Suppose we could not have found the MLE of (µ, σ2) in closed form. In this
case we need to compute a numerical approximation. We could try the brute-force
approach used in the earlier example, but in this example we would need to evaluate
the log-likelihood function over 2 dimensions. Furthermore, the parameter space is
unbounded (R×R+), so we would need to restrict evaluations of the log-likelihood
to be in the vicinity of the MLE, which we do not know!
It turns out that the brute-force approach is seldom feasible, particularly as
the number of model parameters becomes large. Therefore, numerical methods of
> v=1
> n=5
> neglogLike = function(psi) -dbinom(v, size=n, prob=psi, log=TRUE)
> fit = optim(par=0.5, fn=neglogLike, method=‘BFGS’)
Warning messages: 1: NaNs produced in: dbinom(x, size, prob, log) 2:
NaNs produced in: dbinom(x, size, prob, log)
>
> fit$par
[1] 0.2000002
>
Panel 2.2. R code for numerically maximizing the likelihood function to approximate ˆ
ψ.
CLASSICAL (FREQUENTIST) INFERENCE 43
optimization are often used to compute an approximation of the MLE. To illustrate,
suppose we observe the following body masses of 10 animals:
y= (8.51,4.03,8.20,4.19,8.72,6.15,5.40,8.66,7.91,8.58)
which have sample mean ¯y= 7.035 and sample variance s2= 3.638. Panel 2.3
contains Rcode for approximating the MLE and yields the estimates ˆµ= 7.035
and ˆσ2= 3.274, which are the correct answers.
2.3.2 Properties of MLEs
Maximum likelihood estimators have several desirable properties. In this section we
describe these properties and illustrate their consequences in inference problems.
In particular, we show how MLEs are used in the construction of confidence
intervals.
2.3.2.1 Invariance to reparameterization
If ˆ
θis the MLE of θ, then for any one-to-one function τ(θ), the MLE of τ(θ)
is τ(ˆ
θ). This invariance to reparameterization can be extremely helpful in computing
> y = c(8.51, 4.03, 8.20, 4.19, 8.72, 6.15, 5.40, 8.66, 7.91, 8.58)
>
> neglogLike = function(param) {
+ mu = param[1]
+ sigma = exp(param[2])
+-sum(dnorm(y,mean=mu,sd=sigma, log=TRUE))
+}
>
> fit = optim(par=c(0,0), fn=neglogLike, method=‘BFGS’)
> fit$par
[1] 7.0350020 0.5930949
>
> exp(fit$par[2])^2
[1] 3.274581
>
Panel 2.3. R code for numerically maximizing the likelihood function to approximate (ˆµ, log ˆσ).
44 Essentials of Statistical Inference
MLEs by numerical approximation. For example, let γ=τ(θ) and suppose we can
compute ˆγthat maximizes L1(γ|y) easily; then, by the property of invariance we
can deduce that ˆ
θ=τ−1(ˆγ) maximizes L2(θ|y) without actually computing the
solution of dL2(θ|y)/dθ= 0, which may involve numerical difficulties.
To illustrate, let’s consider the problem of estimating the probability of occurrence
ψthat we described in Section 2.3.1.1. The logit function, a one-to-one transfor-
mation of ψ, is defined as follows:
logit (ψ) = log ψ
1−ψ
and provides a mapping from the domain of ψ([0,1]) to the entire real line. Let
θ= logit(ψ) denote a reparameterization of ψ. We can maximize the likelihood of θ
given vto obtain ˆ
θand then calculate ˆ
ψby inverting the transformation as follows:
ˆ
ψ= logit−1(ˆ
θ)
= 1/(1 + exp(−ˆ
θ)).
We will use this inversion often; therefore, in the remainder of this book we let
expit(θ) denote the function logit−1(θ) as a matter of notational convenience.
Panel 2.4 contains Rcode for computing ˆ
θby numerical optimization and for
computing ˆ
ψby inverting the transformation. Notice that the definition of the R
function neglogLike is identical to that used earlier (Panel 2.2) except that we
have substituted theta for psi as the function’s argument and expit(theta) for
psi in the body of the function. Therefore, the extra coding required to compute
ψon the logit scale is minimal. Notice also in Panel 2.4 that in maximizing
the likelihood function of θ,Rdid not produce the somewhat troubling warning
messages that appeared in maximizing the likelihood function of ψ(cf. Panel 2.2).
The default behavior of R’s optimization functions, optim and nlm, is to provide
an unconstrained minimization wherein no constraints are placed on the magnitude
of the argument of the function being minimized. In other words, if the function’s
argument is a vector of pcomponents, their value is assumed to lie anywhere in Rp.
In our example the admissible values of θinclude the entire real line; in contrast,
the admissible values of ψare confined to a subset of the real line ([0,1]).
The lesson learned from this example is that when using unconstrained
optimization algorithms to maximize a likelihood function of pparameters, one
should typically try to formulate the likelihood so that the parameters are defined
in Rp. The invariance of MLEs always allows us to back-transform the parameter
estimates if that is necessary in the context of the problem.
CLASSICAL (FREQUENTIST) INFERENCE 45
2.3.2.2 Consistency
Suppose the particular set of modeling assumptions summarized in the joint pdf
f(y|θ) is true, i.e., the approximating model of the data ycorrectly describes the
process that generated the data. Under these conditions, we can prove that ˆ
θ,
the MLE of θ, converges to θas the sample size nincreases, which we denote
mathematically as follows: ˆ
θ→θas n→ ∞.
Although the assumptions of an approximating model are unlikely to hold exactly,
it is reassuring to know that with enough data, the MLE is guaranteed to provide
the ‘correct’ answer.
> v=1
> n=5
> expit = function(x) 1/(1+exp(-x))
>
> neglogLike = function(theta) -dbinom(v, size=n, prob=expit(theta), log=TRUE)
> fit = optim(par=0, fn=neglogLike, method=’BFGS’)
>
> fit$par
[1] -1.386294
>
> expit(fit$par)
[1] 0.2
Panel 2.4. R code for numerically maximizing the likelihood function to estimate ˆ
θ= logit( ˆ
ψ).
2.3.2.3 Asymptotic normality
As in the previous section, suppose the particular set of modeling assumptions
summarized in the joint pdf f(y|θ) is true. If, in addition, a set of ‘regularity
conditions’ that have to do with technical details2are satisfied, we can prove the
following limiting behavior of the MLE of θ:
In a hypothetical set of repeated samples with θfixed and with n→ ∞,
(ˆ
θ−θ)|θ∼N(0,[I(ˆ
θ)]−1),(2.3.8)
where I(ˆ
θ) = −∂2log L(θ|y)
∂θ∂ θ |θ=ˆ
θis called the observed information.
2Such as identifiability of the model’s parameters and differentiability of the likelihood function.
See page 516 of Casella and Berger (2002) for a complete list of conditions.
46 Essentials of Statistical Inference
If θis a vector of pparameters, then I(ˆ
θ) is a p×pmatrix called the observed
information matrix.
According to Eq. (2.3.8), the distribution of the discrepancy, ˆ
θ−θ, obtained under
repeated sampling is approximately normal with mean zero as n→ ∞. Therefore,
ˆ
θis an asymptotically unbiased estimator of θ. Similarly, Eq. (2.3.8) implies that
the inverse of the observed information provides the estimated asymptotic variance
(or asymptotic covariance matrix ) of ˆ
θ.
The practical utility of asymptotic normality is evident in the construction of
100(1 −α) percent confidence intervals for θ. For example, suppose θis scalar-
valued; then in repeated samples, the random interval
ˆ
θ±z1−α/2([I(ˆ
θ)]−1)1/2(2.3.9)
‘covers’ the fixed value θ100(1 −α) percent of the time, provided nis sufficiently
large. Here, z1−α/2denotes the (1 −α/2) quantile of a standard normal distribution.
Note that Eq. (2.3.9) does not imply that any individual confidence interval includes
θwith probability 1 −α. This misinterpretation of the role of probability is an
all-too-common occurrence in applications of statistics. An individual confidence
interval either includes θor it doesn’t. A correct probability statement (or inference)
refers to the proportion of confidence intervals that include θin a hypothetical,
infinitely long sequence of repeated samples. In this sense 1 −αis the probability
(relative frequency) that an interval constructed using Eq. (2.3.9) includes the fixed
value θ.
Example: estimating the probability of occurrence
As an illustration, let’s compute a 95 percent confidence interval for ψ, the
probability of occurrence, that was defined earlier in an example (Section 2.3.1.1).
The information is easily derived using calculus:
d2log L(ψ|v)
dψ2=I(ψ) = n
ψ(1 −ψ).
The model has only one parameter ψ; therefore, we simply take the reciprocal of
I(ψ) to compute its inverse. Substituting ˆ
ψfor ψyields the 95 percent confidence
interval for ψ:
ˆ
ψ±1.96sˆ
ψ(1 −ˆ
ψ)
n.
Suppose we had not been able to derive the observed information or to compute
its inverse analytically. In this case we would need to compute a numerical
approximation of [I(ˆ
ψ)]−1.Panel 2.5 contains the Rcode for computing the MLE of
CLASSICAL (FREQUENTIST) INFERENCE 47
ψand a 95 percent confidence interval having observed only v= 1 occupied site in a
sample of n= 5 sites. As before, we estimate ˆ
ψ= 0.20. A numerical approximation
of I(ˆ
ψ) is computed by adding hessian=TRUE to the list of optim’s arguments.
After rounding, our Rcode yields the following 95 percent confidence interval for
ψ: [−0.15,0.55]. This is the correct answer, but it includes negative values of ψ,
which don’t really make sense because ψis bounded on [0,1] by definition.
One solution to this problem is to compute a confidence interval for θ= logit(ψ)
and then transform the upper and lower confidence limits back to the ψscale (see
Panel 2.6). This approach produces an asymmetrical confidence interval for ψ
([0.027,0.691]), but the interval is properly contained in [0,1].
Another solution to the problem of nonsensical confidence limits is to use a
procedure which produces limits that are invariant to reparameterization. We will
describe such procedures in the context of hypothesis testing (see Section 2.5). For
now, we simply note that confidence limits computed using these procedures and
those computed using Eq. (2.3.9) are asymptotically equivalent. The construction
of intervals based on Eq. (2.3.9) is far more popular because the confidence limits
are relatively easy to compute. In contrast, the calculation of confidence limits
based on alternative procedures is more challenging in many instances.
Before leaving our example of interval estimation for ψ, let’s examine the influence
of sample size. Suppose we had examined a sample of n= 50 randomly selected
> v=1
> n=5
> neglogLike = function(psi) -dbinom(v, size=n, prob=psi, log=TRUE)
> fit = optim(par=0.5, fn=neglogLike, method=’BFGS’, hessian=TRUE)
Warning messages: 1: NaNs produced in: dbinom(x, size, prob, log) 2:
NaNs produced in: dbinom(x, size, prob, log)
>
> fit$par
[1] 0.2000002
>
> psi.mle = fit$par
> psi.se = sqrt(1/fit$hessian)
> zcrit = qnorm(.975)
> c(psi.mle-zcrit*psi.se, psi.mle+zcrit*psi.se)
[1] -0.1506020 0.5506024
>
Panel 2.5. R code for computing a 95 percent confidence interval for ψ.
48 Essentials of Statistical Inference
locations (a tenfold increase in sample size) and had observed the species to be
present at v= 10 of these locations. Obviously, our estimate of ψis unchanged
because ˆ
ψ= 10/50 = 0.2. But how has the uncertainty in our estimate changed?
Earlier we showed that I(ψ) = n/(ψ(1 −ψ)). Because ˆ
ψis identical in both
samples, we may conclude that the observed information in the sample of n= 50
locations is ten times higher than that in the sample of n= 5 locations, as
shown in Table 2.3. In fact, this is easily illustrated by plotting the log-likelihood
functions for each sample (Figure 2.2). Notice that the curvature of the log-
likelihood function in the vicinity of the MLE is greater for the larger sample
(n= 50). This is consistent with the differences in observed information because
I(ˆ
ψ) is the negative of the second derivative of log L(ψ|v) evaluated at ˆ
ψ, which
essentially measures the curvature of log L(ψ|v) at the MLE. The log-likelihood
function decreases with distance from ˆ
ψmore rapidly in the sample of n= 50
locations than in the sample of n= 5 locations; therefore, we might expect the
estimated precision of ˆ
ψto be higher in the larger sample. This is exactly the
case; the larger sample yields a narrower confidence interval for ψ.Table 2.3 and
Figure 2.2 also illustrate the effects of parameterizing the log-likelihood in terms
of θ= logit(ψ). The increase in observed information associated with the larger
sample is the same (a tenfold increase), and this results in a narrower confidence
interval. The confidence limits for ψbased on the asymptotic normality of ˆ
θare
not identical to those based on the asymptotic normality of ˆ
ψ, as mentioned earlier;
however, the discrepancy between the two confidence intervals is much lower in the
larger sample.
> v=1
> n=5
> expit = function(x) 1/(1+exp(-x))
>
> neglogLike = function(theta) -dbinom(v, size=n, prob=expit(theta), log=TRUE)
> fit = optim(par=0, fn=neglogLike, method=’BFGS’, hessian=TRUE)
>
> theta.mle = fit$par
> theta.se = sqrt(1/fit$hessian)
> zcrit = qnorm(.975)
> expit(c(theta.mle-zcrit*theta.se, theta.mle+zcrit*theta.se))
[1] 0.02718309 0.69104557
>
Panel 2.6. R code for computing a 95 percent confidence interval for ψby back-transforming
the lower and upper limits of θ= logit(ψ).
CLASSICAL (FREQUENTIST) INFERENCE 49
0.0 0.2 0.4 0.6 0.8 1.0
= probability of occurrence
–300
–250
–200
–150
–100
–50
Log likelihood
–350
0
0–5 5
= log((1−))
–300
–250
–200
–150
–100
–50
Log likelihood
–350
0
ψ
ψψ
θ
Figure 2.2. Comparison of log-likelihood functions for binomial outcomes based on different
sample sizes, n= 5 (dashed line) and n= 50 (solid line), and different parameterizations, ψ
(upper panel) and logit(ψ)(lower panel). Dashed vertical line indicates the MLE, which is
identical in both samples.
Table 2.3. Effects of sample size nand parameterization on 95 percent confidence intervals
for ψ.
n I(ˆ
ψ) [I(ˆ
ψ)]−195% C.I. for ψ
5 31.25 0.0320 [-0.15, 0.55]
50 312.50 0.0032 [0.09, 0.31]
I(ˆ
θ) [I(ˆ
θ)]−195% C.I. for ψ= expit(θ)
0.8 1.250 [0.03, 0.69]
8.0 0.125 [0.11, 0.33]
50 Essentials of Statistical Inference
Example: estimating parameters of normally distributed data
We conclude this section with a slightly more complicated example to illustrate
the construction of confidence intervals when the model contains two or more
parameters. In an earlier example, the body weights of nanimals were modeled
as yi
iid
∼N(µ, σ2), and we determined that the MLE is (ˆµ, ˆσ2) = (¯y, n−1
ns2), where
¯yand s2denote the sample mean and variance, respectively, of the observed body
weights.
Suppose we want to compute a 95 percent confidence interval for the model
parameter µ. To do this, we rely on the asymptotic normality of MLEs stated
in Eq. (2.3.8). Let θ= (µ, σ2). It turns out that [I(ˆ
θ)]−1may be expressed in
closed form
[I(ˆ
θ)]−1="s2
n·n−1
n0
0s4
n2·2(n−1)2
n#.
Therefore, the asymptotic normality of MLEs justifies the following approximation
ˆµ−µ
ˆσ2−σ2∼N 0
0,"s2
n·n−1
n0
0s4
n2·2(n−1)2
n#!
which implies that the 95 percent confidence interval for µmay be computed as
follows:
¯y±1.96rs2
n·n−1
n.
Applying this formula to the sample of n= 10 body weights yields a confidence
interval of [5.913,8.157].
If we had not been able to derive [I(ˆ
θ)]−1in closed form, we still could have
computed it by numerical approximation. For example, Panel 2.7 contains the R
code for computing the MLE of µand its 95 percent confidence interval. Rcontains
several functions for inverting matrices, including chol2inv and solve. In Panel 2.7
we use the function chol to compute the Cholesky decomposition of I(ˆ
θ) and then
chol2inv to compute its inverse. This procedure is particularly accurate because
I(ˆ
θ) is a positive-definite symmetric matrix (by construction).
2.4 BAYESIAN INFERENCE
In this section we describe the Bayesian approach to model-based inference. To
facilitate comparisons with classical inference procedures, we will apply the Bayesian
approach to some of the same examples used in Section 2.3.
BAYESIAN INFERENCE 51
Let y= (y1, . . . , yn) denote a sample of nobservations, and suppose we develop
an approximating model of ythat contains a (possibly vector-valued) parameter θ.
As in classical statistics, the approximating model is a formal expression of the
processes that are assumed to have produced the observed data. However, in the
Bayesian view the model parameter θis treated as a random variable and the
approximating model is elaborated to include a probability distribution for θthat
specifies one’s beliefs about the magnitude of θprior to having observed the data.
This elaboration of the model is therefore called the prior distribution.
In the Bayesian view, computing an inference about θis fundamentally just a
probability calculation that yields the probable magnitude of θgiven the assumed
> y = c(8.51, 4.03, 8.20, 4.19, 8.72, 6.15, 5.40, 8.66, 7.91, 8.58)
>
> neglogLike = function(param) {
+ mu = param[1]
+ sigma = exp(param[2])
+-sum(dnorm(y,mean=mu,sd=sigma, log=TRUE))
+}
>
> fit = optim(par=c(0,0), fn=neglogLike, method=’BFGS’, hessian=TRUE)
>
> fit$hessian
[,1] [,2]
[1,] 3.053826e+00 -1.251976e-05
[2,] -1.251976e-05 2.000005e+01
>
> covMat = chol2inv(chol(fit$hessian))
> covMat
[,1] [,2]
[1,] 3.274581e-01 2.049843e-07
[2,] 2.049843e-07 4.999987e-02
>
> mu.mle = fit$par[1]
> mu.se = sqrt(covMat[1,1])
> zcrit = qnorm(.975)
>
> c(mu.mle-zcrit*mu.se, mu.mle+zcrit*mu.se)
[1] 5.913433 8.156571
Panel 2.7. R code for computing a 95 percent confidence interval for µ.
52 Essentials of Statistical Inference
prior distribution and given the evidence in the data. To accomplish this calculation,
the observed data yare assumed to be fixed (once the sample has been obtained),
and all inferences about θare made with respect to the fixed observations y. Unlike
classical statistics, Bayesian inferences do not rely on the idea of hypothetical
repeated samples or on the asymptotic properties of estimators of θ. In fact,
probability statements (i.e., inferences) about θare exact for any sample size under
the Bayesian paradigm.
2.4.1 Bayes’ Theorem and the Problem of ‘Inverse Probability’
To describe the principles of Bayesian inference in more concrete terms, it’s
convenient to begin with some definitions. Let’s assume, without loss of generality,
that the observed data yare modeled as continuous random variables and that
f(y|θ) denotes the joint pdf of ygiven a model indexed by the parameter θ. In
other words, f(y|θ) is an approximating model of the data. Let π(θ) denote the
pdf of an assumed prior distribution of θ. Note that f(y|θ) provides the probability
of the data given θ. However, once the data have been collected the value of yis
known; therefore, to compute an inference about θ, we really need the probability
of θgiven the evidence in the data, which we denote by π(θ|y).
Historically, the question of how to compute π(θ|y) was called the ‘problem of
inverse probability.’ In the 18th century Reverend Thomas Bayes (1763) provided
a solution to this problem, showing that π(θ|y) can be calculated to update one’s
prior beliefs (as summarized in π(θ)) using the laws of probability3:
π(θ|y) = f(y|θ)π(θ)
m(y),(2.4.1)
where m(y) = Rf(y|θ)π(θ) dθdenotes the marginal probability of y. Eq. (2.4.1) is
known as Bayes’ theorem (or Bayes’ rule), and θ|yis called the posterior distribution
of θto remind us that π(θ|y) summarizes one’s beliefs about the magnitude of θ
after having observed the data. Bayes’ theorem provides a coherent, probability-
based framework for inference because it specifies how prior beliefs about θcan be
converted into posterior beliefs in light of the evidence in the data.
Close ties obviously exist between Bayes’ theorem and likelihood-based inference
because f(y|θ) is also the basis of Fisher’s likelihood function (Section 2.3.1).
However, Fisher was vehemently opposed to the ‘theory of inverse probability’,
3Based on the definition of conditional probability, we know [θ|y]=[y, θ ]/[y] and [y|θ] =
[y, θ]/[θ]. Rearranging the second equation yields the joint pdf, [y, θ] = [y|θ][θ], which when
substituted into the first equation produces Bayes’ rule: [θ|y] = ([y|θ][θ])/[y].
BAYESIAN INFERENCE 53
as applications of Bayes’ theorem were called in his day. Fisher sought inference
procedures that did not rely on the specification of a prior distribution, and he
deliberately used the term ‘likelihood’ for f(y|θ) instead of calling it a probability.
Therefore, it is important to remember that although the likelihood function
is present in both inference paradigms (i.e., classical and Bayesian), dramatic
differences exist in the way that f(y|θ) is used and interpreted.
2.4.1.1 Example: estimating the probability of occurrence
Let’s reconsider the problem introduced in Section 2.3.1.1 of computing inferences
about the probability of occurrence ψ. Our approximating model of v, the total
number of sample locations where the species is present, is given by the binomial pmf
f(v|ψ) given in Eq. (2.3.4). A prior density π(ψ) is required to compute inferences
about ψfrom the posterior density
π(ψ|v) = f(v|ψ)π(ψ)
m(v),
where m(v) = R1
0f(v|ψ)π(ψ) dψ. It turns out that π(ψ|v) can be expressed in
closed form if the prior π(ψ) = Be(ψ|a,b) is assumed, where the values of a
and b are fixed (by assumption). To be specific, this choice of prior implies
that the posterior distribution of ψis Be(a + v, b + n−v). Thus, the prior and
posterior distributions belong to the same class of distributions (in this case, the
class of beta distributions). This equivalence, known as conjugacy, identifies the
beta distribution as the conjugate prior for the success parameter of a binomial
distribution. We will encounter other examples of conjugacy throughout this book.
For now, let’s continue with the example.
Suppose we assume prior indifference in the magnitude of ψ. In other words,
before observing the data, we assume that all values of ψare equally probable.
This assumption is specified with a Be(1,1) prior (≡U(0,1) prior) and implies that
the posterior distribution of ψis Be(1 + v , 1 + n−v). It’s worth noting that the
mode of this distribution equals v/n, which is equivalent to the MLE of ψobtained
in a classical, likelihood-based analysis. Now suppose a sample of n= 5 locations
contains only v= 1 occupied site; then the Be(2,5) distribution, illustrated in
Figure 2.3, may be used to compute inferences for ψ. For example, the posterior
mean and mode of ψare 0.29 and 0.20, respectively. Furthermore, we can compute
the α/2 and 1 −α/2 quantiles of the Be(2,5) posterior and use these to obtain
a 100(1 −α) percent credible interval for ψ. (Bayesians use the term, ‘credible
interval’, to distinguish it from the frequentist concept of a confidence interval.)
For example, the 95 percent credible interval for ψis [0.04,0.64].
We use this example to emphasize that a Bayesian credible interval and a fre-
quentist confidence interval have completely different interpretations. The Bayesian
54 Essentials of Statistical Inference
0.2 0.4 0.6 0.8
0.5
1.0
1.5
2.0
Probability density
0.0
2.5
Prior
Posterior
0.0 1.0
= probability of occurrence
ψ
Figure 2.3. Posterior distribution for the probability of occurrence assuming a uniform prior.
Vertical line indicates the posterior mode.
credible interval is the result of a probability calculation and reflects our posterior
belief in the probable range of ψvalues given the evidence in the observed data.
Thus, we might choose to summarize the analysis by saying, “the probability that
ψlies in the interval [0.04,0.64] is 0.95.” In contrast, the probability statement
associated with a confidence interval corresponds to the proportion of confidence
intervals that contain the fixed, but unknown, value of ψin an infinite sequence of
hypothetical, repeated samples (see Section 2.3.2). The frequentist’s interval there-
fore requires considerably more explanation and is far from a direct statement of
probability. Unfortunately, the difference in interpretation of credible intervals and
confidence intervals is often ignored in practice, much to the consternation of many
statisticians.
2.4.2 Pros and Cons of Bayesian Inference
Earlier we mentioned that one of the virtues of Bayesian inference is that probability
statements about θare exact for any sample size. This is especially meaningful when
one considers that a Bayesian analysis yields the entire posterior pdf of θ,π(θ|y),
as opposed to a single point estimate of θ. Therefore, in addition to computing
summaries of the posterior, such as its mean E(θ|y) or variance Var(θ|y), any
function of θcan be calculated while accounting for all of the posterior uncertainty
in θ. The benefits of being able to manage errors in estimation in this way are
especially evident in computing inferences for latent parameters of hierarchical
BAYESIAN INFERENCE 55
models, as we will illustrate in Section 2.6, or in computing predictions that depend
on the estimated value of θ.
Specification of the prior distribution may be perceived as a benefit or as a
disadvantage of the Bayesian mode of inference. In scientific problems where
prior information about θmay exist or can be elicited (say, from expert opinion),
Bayes’ theorem reveals precisely how such information may be used when computing
inferences for θ. In other (or perhaps most) scientific problems, little may be known
about the probable magnitude of θin advance of an experiment or survey. In these
cases an objective approach would be to use a prior that places equal (or nearly
equal) probability on all values of θ. Such priors are often called ‘vague’ or ‘non-
informative.’ A problem with this approach is that priors are not invariant to
transformation of the parameters. In other words a prior that is ‘non-informative’
for θcan be quite informative for g(θ), a one-to-one transformation of θ.
One solution to this problem is to develop a prior that is both non-informative
and invariant to transformation of its parameters. A variety of such ‘objective
priors’, as they are currently called (see Chapter 5 of Ghosh et al. (2006)), have
been developed for models with relatively few parameters. Objective priors are often
improper (that is, Rπ(θ) dθ=∞); therefore, if an objective prior is to be used, the
analyst must prove that that the resulting posterior distribution is proper (that
is, Rf(y|θ)π(θ) dθ < ∞). Such proofs often require considerable mathematical
expertise, particularly for models that contain many parameters.
A second solution to the problem of constructing a non-informative prior is to
identify a particular parameterization of the model for which a uniform (or nearly
uniform) prior makes sense. Of course, this approach is possible only if we are
able to assign scientific relevance and context to the model’s parameters. We have
found this approach to be useful in the analysis of ecological data, and we use this
approach throughout the book.
Specification of the prior distribution can be viewed as the ‘price’ paid for the
exactness of inferences computed using Bayes’ theorem. When the sample size
is low, the price of an exact inference may be high. As the size of a sample
increases, the price of an exact inference declines because the information in the
data eventually exceeds the information in the prior. We will return to this tradeoff
in the next section, where we describe some asymptotic properties of posteriors.
2.4.3 Asymptotic Properties of the Posterior Distribution
We have noted already that the Bayesian approach to model-based inference has
several appealing characteristics. In this section we describe additional features
that are associated with computing inferences from large samples.
56 Essentials of Statistical Inference
0.1 0.2 0.3 0.40.0 0.5
= probability of occurrence
1
2
3
4
5
6
Probability density
0
7
ψ
Figure 2.4. A normal approximation (dashed line) of the posterior distribution of the probability
of occurrence (solid line). Vertical line indicates the posterior mode.
2.4.3.1 Approximate normality
Let [θ|y] denote the posterior distribution of θgiven an observed set of data y.
If a set of ‘regularity conditions’ that have to do with technical details, such as
identifiability of the model’s parameters and differentiability of the posterior density
function π(θ|y), are satisfied, we can prove that as sample size n→ ∞,
(θ−ˆ
θ)|y∼N(0,[I(ˆ
θ)]−1),(2.4.2)
where ˆ
θis the posterior mode and I(ˆ
θ) = −∂2log π(θ|y)
∂θ∂ θ |θ=ˆ
θis called the generalized
observed information (Ghosh et al.,2006). The practical utility of this limiting
behavior is that the posterior distribution of θcan be approximated by a normal
distribution N(ˆ
θ, [I(ˆ
θ)]−1) if nis sufficiently large. In other words, when nis large,
we can expect the posterior to become highly concentrated around the posterior
mode ˆ
θ.
Example: estimating the probability of occurrence
Recall from Section 2.4.1.1 that the posterior mode for the probability of
occurrence was ˆ
ψ=v/n when a Be(1,1) prior was assumed for ψ. It is easily
proved that [I(ˆ
ψ)]−1=ˆ
ψ(1 −ˆ
ψ)/n given this choice of prior; therefore, according
to Eq. (2.4.2) we can expect a N( ˆ
ψ, ˆ
ψ(1−ˆ
ψ)/n) distribution to approximate the true
posterior, a Be(1 + v , 1 + n−v) distribution, when nis sufficiently large. Figure 2.4
illustrates that the approximation holds very well for a sample of n= 50 locations,
of which z= 10 are occupied.
BAYESIAN INFERENCE 57
The asymptotic normality of the posterior (indicated in Eq. (2.4.2)) is an
important result because it establishes formally that the relative importance of
the prior distribution must decrease with an increase in sample size. To see this,
note that I(θ) is the sum of two components, one due to the likelihood function
f(y|θ) and another due to the prior density π(θ):
I(θ) = −∂2log π(θ|y)
∂θ∂θ
=−∂2log f(y|θ)
∂θ∂θ −∂2log π(θ)
∂θ∂θ .(2.4.3)
As nincreases, only the magnitude of the first term on the right-hand side of
Eq. (2.4.3) increases, whereas the magnitude of the second term, which quantifies
the information in the prior, remains constant. An important consequence of this
result is that we can expect inferences to