Content uploaded by Sonia Petrone
Author content
All content in this area was uploaded by Sonia Petrone on Jan 12, 2015
Content may be subject to copyright.
1 23
METRON
ISSN 0026-1424
Volume 72
Number 2
METRON (2014) 72:201-215
DOI 10.1007/s40300-014-0044-1
Empirical Bayes methods in classical and
Bayesian inference
Sonia Petrone, Stefano Rizzelli, Judith
Rousseau & Catia Scricciolo
1 23
Your article is protected by copyright and
all rights are held exclusively by Sapienza
Università di Roma. This e-offprint is for
personal use only and shall not be self-
archived in electronic repositories. If you wish
to self-archive your article, please use the
accepted manuscript version for posting on
your own website. You may further deposit
the accepted manuscript version in any
repository, provided it is only made publicly
available 12 months after official publication
or later and provided acknowledgement is
given to the original source of publication
and a link is inserted to the published article
on Springer's website. The link must be
accompanied by the following text: "The final
publication is available at link.springer.com”.
METRON (2014) 72:201–215
DOI 10.1007/s40300-014-0044-1
Empirical Bayes methods in classical and Bayesian
inference
Sonia Petrone ·Stefano Rizzelli ·
Judith Rousseau ·Catia Scricciolo
Received: 27 April 2014 / Accepted: 5 May 2014 / Published online: 3 June 2014
© Sapienza Università di Roma 2014
Abstract Empirical Bayes methods are often thought of as a bridge between classical and
Bayesian inference. In fact, in the literature the term empirical Bayes is used in quite diverse
contexts and with different motivations. In this article, we provide a brief overview of empiri-
cal Bayes methods highlighting their scopes and meanings in different problems. We focus on
recent results about merging of Bayes and empirical Bayes posterior distributions that regard
popular, but otherwise debatable, empirical Bayes procedures as computationally convenient
approximations of Bayesian solutions.
Keywords Bayesian weak merging ·Compound experiments ·Frequentist strong
merging ·Hyper-parameter oracle value ·Latent distribution ·Maximum marginal
likelihood estimation ·Shrinkage estimation
1 Introduction
Empirical Bayes methods are popularly employed by researchers and practitioners and are
attractive in appearing to bridge frequentist and Bayesian approaches to inference. In fact, a
frequentist statistician would find just a formal Bayesian flavor in empirical Bayes methods,
while a Bayesian statistician would say that there is nobody less Bayesian than an empirical
Bayesian (Lindley, in [6]). Further confusing, in the literature the term empirical Bayes is
used in quite diverse contexts, with different motivations. Classical empirical Bayes methods
arose in the context of compound experiments, where a latent distribution driving experiment-
specific parameters formally acts as a prior on each one such parameter and is estimated
from the data, usually by maximum likelihood. The term empirical Bayes is also used in the
context of purely Bayesian inference when hyper-parameters of a subjective prior distribu-
S. Petrone (B)·S. Rizzelli ·C. Scricciolo
Bocconi University, Milan, Italy
e-mail: sonia.petrone@unibocconi.it
J. Rousseau
CREST-ENSAE and CEREMADE, Université Paris Dauphine, Paris, France
123
Author's personal copy
202 S. Petrone et al.
tion are selected through the data. Empirical Bayes estimates are also popularly employed
to deal with nuisance parameters. All these situations are different and require specific
analysis.
In this article, we give a brief overview of classical and recent results on empirical Bayes
methods, discussing their use in these different contexts. Section 2recalls classical empirical
Bayes methods in compound sampling problems and mixture models. Although arising as
a way to by-pass the need of specifying a prior distribution in computing optimal Bayesian
solutions [31], here the approach is purely frequentist. The empirical Bayes solution is basi-
cally shrinkage estimation, where the introduction of a latent distribution may facilitate
interpretation and modeling thus helping to design efficient shrinkage.
In a broader sense, the term empirical Bayes is used to denote a data-driven selection
of prior hyper-parameters in Bayesian inference. We discuss this case in Sect. 3. Here, the
prior distribution can only have an interpretation in terms of subjective probability on an
unknown, but fixed, parameter. Although not rigorous, it is a common practice to try to
overcome difficulties in specifying the prior distribution by plugging in some estimates of
the prior hyper-parameters. From a Bayesian viewpoint, in such cases one should rather
assign a hyper-prior distribution, which however makes computations more involved. The
empirical Bayes selection thus appears as a convenient way out that is expected to give similar
inferential results as the hierarchical Bayesian solution for large sample sizes and better results
for finite samples than a “wrong choice” of the prior hyper-parameters. Although commonly
trusted, these facts are not rigorously proved. Recent results [30] address the presumed
asymptotic equivalence of Bayesian and empirical Bayes solutions in terms of merging.
Roughly speaking, they show that, in regular parametric problems, the empirical Bayes and
the Bayesian posterior distributions generally tend to merge, that is, to be asymptotically
close, but also that possible divergent behavior may arise. Thus, the use of empirical Bayes
prior selection requires much care.
Section 4discusses another popular use of empirical Bayes methods in problems with
nuisance parameters. We extend, in particular, the results of [30] on weak merging of
empirical Bayes procedures to nuisance parameter problems, which we illustrate with partial
linear regression models.
The result about merging recalled in Sect. 3only gives first-order asymptotic comparison
between empirical Bayes and any Bayesian posterior distributions. A higher-order compar-
ison would be needed to distinguish among them. We conclude the article with some hints
on the finite-sample behavior of the empirical Bayes posterior distribution in a simple but
insightful example (Sect. 5). The results suggest that, when merging holds, the empirical
Bayes posterior distribution can indeed be a computationally convenient approximation of
an efficient, in a sense to be specified, Bayesian solution.
2 Classical empirical Bayes
The introduction of the empirical Bayes method is traditionally associated with Robbins’
article [31] on compound sampling problems. Compound sampling models arise in a variety
of situations including multi-site clinical trials, estimation of disease rates in small geo-
graphical areas, longitudinal studies. In this setting, nvalues θ1,...,θ
nare drawn at random
from a latent distribution G. Then, conditionally on θ1,...,θ
n, observable random variables
X1,...,Xnare drawn independently from probability distributions p(·|θ1), ..., p(·|θn),
respectively. The framework can be thus described:
123
Author's personal copy
Empirical Bayes methods 203
Xi|θi
indep
∼p(·|θi)
θi|Giid
∼G(·), i=1,...,n,
where the index irefers to the ith experiment. Interest lies in estimating an experiment spe-
cific parameter θiwhen all the nobservations X1,...,Xnare available. For the generic ith
experiment, one has Xi|θi∼p(·|θi)and θi∼G; thus, the latent distribution Gformally
plays the role of a prior distribution on θiin a Bayesian flavor. Were Gknown, inference on θi
would be carried out through the Bayes’ rule, computing the posterior distribution of θigiven
Xi,dG(θi|Xi)∝p(Xi|θi)dG(θi),andθicould be estimated by the Bayes’ estimator
with respect to squared error loss, i.e., the posterior mean EG[θi|Xi]=θdG(θ |Xi).
In fact, in general Gis unknown and the Bayes’ estimator EG[θi|Xi]is not computable.
One can however use an estimate of the “prior distribution” Gbased on the available obser-
vations X1,...,Xn, which is what originated the term “empirical Bayes”. Were θ1,...,θ
n
observable, their common distribution Gcould be pointwise consistently estimated by the
empirical cumulative distribution function (cdf)
Gn(θ) =n
i=11(−∞,θ](θi).Astheθiare not
observable, the empirical Bayes approach suggests estimating Gfrom the data X1,..., Xn
exploiting the fact that
Xi|Giid
∼fG(·)=p(·|θ)dG(θ ), i=1,...,n.
We still denote by
Gnany estimator for Gbased on X1,...,Xn.Asin[31], consider i=n,
that is, estimating θn. The unit-specific unknown θncan be estimated by the empirical Bayes
version E
Gn[θ|Xn]of the posterior mean. Empirical Bayes methods considered in [31]
have been named nonparametric empirical Bayes, because Gis assumed to be completely
unknown, to distinguish them from parametric empirical Bayes methods later developed
by Efron and Morris [11–15], where Gis assumed to be known up to a finite-dimensional
parameter. If Gis completely unknown, then the cdf FG(x)=x
−∞ fG(u)du,x∈R, can
be estimated from the empirical cdf
Fn(x)=n
i=11(−∞,x](Xi)which, for every fixed x,
tends to FG(x),asn→∞, whichever the mixing distribution G. Thus, depending on the
kernel density p(·|θ) and the class Gto which Gbelongs, the estimator
Gnentailed by
Fn
approximates Gfor large nand the corresponding empirical Bayes’ estimator E
Gn[θ|Xn]for
θnapproximates the posterior mean EG[θ|Xn]. To illustrate this, we consider the following
example due to Robbins [31], which deals with the Poisson case. Here, whatever the unknown
distribution G, the posterior mean can be written as the ratio of the probability mass function
fG(·)evaluated at different points. These terms can be estimated by the corresponding values
of the empirical mass function.
Example 1 Let Xi|θi∼Poisson(θi)independently, with θi|Giid
∼G,i=1,...,n,where
Gis a cdf on R+. In this case, EG[θ|X=x]=(x+1)fG(x+1)/ fG(x),x=0,1,...,which
can be estimated by ϕn(x)=(x+1)n
i=11{x+1}(Xi)/ n
i=11{x}(Xi). Then, whatever the
unknown distribution G,foranyfixedx,ϕn(x)→EG[θ|X=x]as n→∞, with proba-
bility 1. This naturally suggests using ϕn(Xn)as an estimator for θn. Robbins [31] extended
this technique to the cases where the Xihas geometric, binomial or Laplace distribution.
As discussed by Morris [29], parametric empirical Bayes procedures are needed to deal
with those cases where nis too small to well approximate the Bayesian solution, but still
a substantial improvement over standard methods can be made as for the James–Stein’s
estimator. When the mixing distribution is assumed to have a specific parametric form G(·|
ψ), it is common practice to estimate the unknown parameter ψfrom the data by maximum
123
Author's personal copy
204 S. Petrone et al.
likelihood, computing
ψn≡
ψ(X1,...,Xn)as
ψn=argmaxψn
i=1p(Xi|θ)dG(θ |
ψ). Inference on θnis then carried out using G(·|
ψn)to compute EG(·|
ψn)[θ|Xn]. Empirical
Bayes estimation of θnhas the advantage of doing asymptotically as well as the Bayes’
estimator without knowing the “prior” distribution. However, the Bayesian approach and the
empirical Bayes approach are only seemingly related: there is, indeed, a clearcut difference
between them. In the empirical Bayes approach to compound problems, although Gformally
acts as a prior distribution on a single parameter, its introduction is motivated in a frequentist
sense, as the common distribution of the random sample (θ1,...,θ
n); indeed, estimation of
Gis carried out by frequentist methods. In the Bayesian approach, a prior distribution can
be assigned to a fixed unknown parameter, being interpreted as a formalization of subjective
information. In the context of multiple independent experiments, a Bayesian statistician
would rather assume probabilistic dependence across experiments by regarding the θias
exchangeable and assigning a prior probability law to G(in the nonparametric case) or to ψ
(in the parametric case); see, e.g., [1,4,8].
Rather than in comparison with Bayesian inference, the advantage of empirical Bayes
methods can be appreciated in comparison with classical maximum likelihood estimators.
The empirical Bayes estimate of θimakes efficient use of the available information because
all data are used when estimating G. In other terms, empirical Bayes techniques involve
learning from the experience of others or, using Tukey’s evocative expression, “borrowing
strength”. To illustrate this crucial aspect, we consider the following classical example.
Example 2 Let (X1,...,Xp)∼Np(θ , σ 2Ip)be a p-variate Gaussian distribution, where
θ=(θ1,...,θp)and Ipis the p-dimensional identity matrix. The Xican be the mean of
a random sample Xi,j,j=1,...,n, within the ith experiment. Suppose σ2is known. Let
θi|ψiid
∼N(0,ψ), with unknown variance ψ. Then, the maximum marginal likelihood
estimator for ψis
ψp=max{0,s2−σ2},wheres2=p
i=1x2
i/p. The empirical Bayes’
estimator for θiis EN(0,
ψp)[θ|Xi]=[1−(p−2)σ 2/p
i=1x2
i]Xi, which coincides with the
James–Stein’s estimator [23,33], that dominates the maximum likelihood estimator ˆ
θi=Xi
for p≥3 with respect to the overall quadratic loss p
i=1(θi−ˆ
θi)2.
As remarked by Morris [29], James–Stein’s estimator is minimax-optimal for the sum of
the individual squared error loss functions only in the equal variances case. Optimality is
lost, for example, if global loss functions that weight differently the individual squared losses
are used. Other forms of shrinkage, possibly suggested by the empirical Bayes approach, are
then necessary.
We conclude this section with a historical note. Although the introduction of the empirical
Bayes method is traditionally associated with Robbins [31]’s article, the idea was partially
anticipated by, among others, Gini [21] who, as pointed out by Forcina [18], pioneerly
provided empirical Bayes solutions for estimating the parameter of a binomial distribution,
andbyFisheretal.[17] who applied the parametric empirical Bayes technique to the so-called
species sampling problem assuming a Gamma “prior” distribution, see also Good [22]. Since
then, the field has witnessed a tremendous growth both in terms of theoretical developments
as well as in diversity of applications, see, e.g., the monographs [27]and[10].
3 Empirical Bayes selection of prior hyper-parameters
In a broader sense, the term empirical Bayes is commonly associated with general techniques
that make use of a data-driven choice of the prior distribution in Bayesian inference. Here,
123
Author's personal copy
Empirical Bayes methods 205
the basic setting is inference on an exchangeable sequence (Xi). Exchangeability is intended
in a subjective sense: the data are physically independent, but probabilistic dependence is
expressed among them, as past observations give information on future values and such
incomplete information is described probabilistically through the conditional distribution of
Xn+1,Xn+2,...,given X1=x1,...,Xn=xn. Exchangeability is the basic dependence
assumption, which is equivalent to assuming a statistical model p(·|θ) such that the Xi
are conditionally independent and identically distributed (iid) given θand expressing a prior
distribution on θ, by de Finetti’s representation theorem for exchangeable sequences. Thus,
the statistical model and the prior are together a way of expressing the probability law of
the observable sequence (Xi)and in such way they should be chosen. In fact, choosing a
honest subjective prior in Bayesian inference can be a difficult task. A way of formulating
such uncertainty is to assign the prior on θhierarchically, assuming θ|λ∼(·|λ), a para-
metric distribution depending on hyper-parameters λ,andλ∼H(λ).However,thisoften
complicates computations, so that it is a common practice to plug in some estimate ˆ
λnof
the prior hyper-parameters as a shortcut. The resulting data-dependent prior (·|ˆ
λn),com-
bined with the likelihood, results into a pseudo-posterior distribution (·|ˆ
λn,X1,...,Xn)
that is commonly referred to as empirical Bayes. Many types of estimators for λare
considered, the most popular being the maximum marginal likelihood estimator, defined
as
ˆ
λn∈argmax
λ∈¯
n
i=1
p(Xi|θ)(dθ|λ),
where ¯
is the closure of .
Such empirical Bayes approach is appealing in offering the possibility of making Bayesian
inference by-passing a complete specification of the prior and it is largely used in practical
applications and in the literature: see, e.g., [7,19,25,32] in the context of variable selection
in regression, [5] for wavelet shrinkage estimation, [26]and[28] in Bayesian nonparametric
mixture models, [16] in Bayesian nonparametric inference for species diversity, [2,3]and
[34] in Bayesian nonparametric procedures for curve estimation.
Although popular, this mixed approach is not rigorous from a Bayesian point of view. Its
interest mainly lies in being a computationally simpler alternative to a more rigorous, but
usually analytically more complex, hierarchical specification of the prior: one expects that,
when the sample size is large, the empirical Bayes posterior distribution will be close to
some Bayesian posterior distribution. Moreover, for finite samples, a data-driven empirical
Bayes selection of the prior hyper-parameters is expected to give better inferential results
than a “wrong choice” of λ. These commonly believed facts do not seem to be rigorously
proved in the literature. A recent work by Petrone et al. [30] addresses the supposed asymp-
totic equivalence between empirical Bayes and Bayesian posterior distributions in terms of
merging.
Two notions of merging are considered: Bayesian weak merging in the sense of [9], and
frequentist strong merging in the sense of [20]. Bayesian weak merging compares posterior
distributions in terms of weak convergence, with respect to (wrt) the exchangeable probability
law of (Xi). Roughly speaking, we have weak merging of the empirical Bayes and Bayesian
posterior distributions if any Bayesian statistician is sure that her/his posterior distribution
and the empirical Bayes posterior distribution will eventually be close, in the sense of weak
convergence. This is a minimal requirement, but it is not guaranteed. From results in [9], it can
be proved that weak merging holds if and only if the empirical Bayes posterior distribution is
consistent in the frequentist sense at the true value θ0,whateverθ0. Consistency at θ0means
123
Author's personal copy
206 S. Petrone et al.
that the sequence of empirical Bayes posterior distributions weakly converges to a point mass
at θ0, almost surely wrt P∞
θ0,whereP∞
θ0denotes the probability law of (Xi)such that the Xi
are iid according to Pθ0.
Sufficient conditions for consistency of empirical Bayes posterior distributions are pro-
vided in [30], Section 3. In general, consistency of Bayesian posterior distributions does not
imply consistency of empirical Bayes posteriors. For the latter, one has to control the asymp-
totic behavior of the estimator ˆ
λn, too. If ˆ
λnis the maximum marginal likelihood estimator,
its properties can be exploited to show that the empirical Bayes posterior distribution is con-
sistent at θ0under essentially the same conditions which ensure consistency of Bayesian
posterior distributions. For more general estimators, conditions become more cumbersome.
When ˆ
λnis a convergent sequence, sufficient conditions are given in Proposition 3 of [30],
based on a change of the prior probability measure such that the dependence on the data is
transferred from the prior to the likelihood.
Even when consistency and weak merging hold, the empirical Bayes posterior distribution
may underestimate the uncertainty on θand diverge from any Bayesian posterior, relatively
to a stronger metric than the one of weak convergence. This behavior is illustrated in the
following example.
Example 3 Let Xi|θ∼N(θ , σ 2)independently, with σ2known, and θ∼N(μ, τ 2).
Consider empirical Bayes inference where the prior variance λ=τ2is estimated by the
maximum marginal likelihood estimator, the prior mean μbeing fixed. Then, see, e.g., [24],
p. 263, σ2+nˆτ2
n=max{σ2,n(¯
Xn−μ)2}so that ˆτ2
n=(σ 2/n)max{n(¯
Xn−μ)2/σ 2−1,0}.
The resulting empirical Bayes posterior distribution (·| ˆτ2
n,X1,...,Xn)is Gaussian with
mean μn=(σ 2/n)/( ˆτ2
n+σ2/n)μ +ˆτ2
n/( ˆτ2
n+σ2/n)¯
Xnand variance (1/ˆτ2
n+n/σ 2)−1.
Since ˆτ2
ncan be equal to zero with positive probability, the empirical Bayes posterior can be
degenerate at μ. The probability of the event ˆτn=0 converges to zero when θ0= μ,but
remains strictly positive when θ0=μ. This suggests that, if θ0= μ, the hierarchical and the
empirical Bayes posterior densities can asymptotically be close relatively to some distance;
however, if θ0=μ, there is a positive probability that the empirical Bayes and the Bayesian
posterior distributions are singular. The possible degeneracy of the empirical Bayes posterior
distribution is pathological in the sense that the uncertainty on the parameter is a posteriori
underestimated.
Such behaviour is not restricted to the Gaussian distribution and applies more generally to
location-scale families of priors. If the model admits a maximum likelihood estimator ˆ
θnand
the prior density is of the form τ−1g((·−μ)/τ ), with λ=(μ, τ ) for some unimodal density
gthat attains the maximum at zero, then ˆ
λn=(ˆ
θn,0)and the empirical Bayes posterior is a
point mass at ˆ
θn. These families of priors should not be jointly used with maximum marginal
likelihood empirical Bayes procedures.
A way to refine the analysis to better understand the impact of a data-dependent prior
on the posterior distribution is to study frequentist strong merging in the sense of [20]. Two
sequences of posterior distributions are said to merge strongly if their total variation distance
converges to zero almost surely wrt P∞
θ0.
Strong merging of Bayesian posterior distributions in nonparametric contexts is often
impossible since pairs of priors are typically singular. Petrone et al. [30] study the problem
for regular parametric models, comparing Bayesian posterior distributions and empirical
Bayes posterior distributions based on the maximum marginal likelihood estimator of λ.
Informally, their results show that strong merging may hold for some true values θ0,butmay
fail for others. That is, for values of θ0in an appropriate set, say 0, the empirical Bayes
123
Author's personal copy
Empirical Bayes methods 207
posterior distribution strongly merges with any Bayesian posterior distribution corresponding
to a prior distribution qwhich is continuous and bounded at θ0,
(·|ˆ
λn,X1,...,Xn)−q(·| X1,..., Xn)TV →0(1)
almost surely wrt P∞
θ0,where·
TV denotes the total variation distance. However, for
θ0/∈0, strong merging fails: the empirical Bayes posterior can indeed be singular wrt any
smooth Bayesian posterior distribution.
More precisely, suppose that the prior distribution has density π(·)with respect to some
dominating measure and includes θ0in its Kullback–Leibler support. Furthermore, suppose
that the parameter space is totally bounded. Assume that conditions hold that guarantee con-
sistency for the empirical Bayes and the Bayesian posterior distributions. Under such assump-
tions, and some additional requirements that are satisfied by regular parametric models, it can
be shown ([30], Theorem 1) that the maximum marginal likelihood estimator ˆ
λnconverges
to a value λ∗(here assumed to be unique for brevity) such that π(θ0|λ∗)≥π(θ0|λ) for
every λin the hyper-parameter space . Such value can be interpreted as the “oracle value”
of the hyper-parameter, that is the value of the hyper-parameter for which the prior mostly
favors the true value θ0. Furthermore, it is proved that if θ0is such that π(θ0|λ∗)<∞,then
strong merging holds, namely
(·|ˆ
λn,X1,...,Xn)−(·|λ∗,X1,..., Xn)TV →0(2)
almost surely wrt P∞
θ0.Since(·|λ∗,X1,..., Xn)−q(·|X1,...,Xn)TV goes to zero
P∞
θ0-almost surely for any prior qthat is continuous and bounded at θ0([20], Theorem 1.3.1),
by the triangular inequality, one has (1). However, if θ0is such that π(θ0|λ∗)=∞,then
strong merging fails. This is the case if, for such θ0,λ∗is in the boundary of and the
prior distribution is degenerate at θ0for λ→λ∗. In this case, the empirical Bayes posterior
distribution is degenerate too, thus it is singular wrt any smooth Bayesian posterior.
Result (1), which holds only in the non-degenerate case, ensures that the empirical Bayes
posterior distribution will be close in total variation to the Bayesian posterior, whatever
the prior distribution. But this result only provides a first-order asymptotic comparison that
does not distinguish among Bayesian solutions. In fact, from (2), one could expect that the
empirical Bayes approach can actually give a closer approximation of an efficient, in the
sense of using the prior distribution that mostly favors the true value of θ0, Bayesian solution.
Higher-order asymptotic results are beyond the scope of this note, but we will return to this
issue in Sect. 5, providing a simple, but we believe insightful, example.
4 Empirical Bayes selection of nuisance parameters
Another relevant context of application of empirical Bayes methods concerns Bayesian analy-
sis in semi-parametric models, where estimation of nuisance parameters is preliminarily
considered to carry out inference on the component of interest. The framework can be thus
described: observations X1,..., Xnare drawn independently from a distribution with density
pψ,λ(·),
Xi|(ψ, λ) iid
∼pψ,λ(·), i=1,...,n,
where ψ∈⊆Rkis the parameter of interest and λ∈⊆Ra nuisance parame-
ter. Bayesian inference with nuisance parameters does not conceptually present particular
difficulties: a prior distribution is assigned to the overall parameter (ψ, λ),
123
Author's personal copy
208 S. Petrone et al.
(dψ, dλ) =(dψ|λ)(dλ),
and inference on ψis carried out marginalizing the joint posterior distribution (dψ, dλ|
X1,...,Xn). However, this can be computationally cumbersome. A common approach is
thus to plug in some estimator ˆ
λnof λand use a data-dependent prior
(dψ|ˆ
λn)δ
ˆ
λn(dλ). (3)
We highlight the difference between the present context and the one described in Sect. 3:
there, ˆ
λnis used to estimate a hyper-parameter λ, a parameter of the prior only, whereas here
ˆ
λnis used to estimate λwhich is, in the first place, a component of the overall parameter of
the model, it is part of the model.
The results developed in [30] can be extended to prove the asymptotic equivalence, in
terms of weak merging, between the empirical Bayes posterior and any Bayesian posterior
for ψ, provided ˆ
λnis a sequence of consistent estimators for the true value λ0corresponding
to the density pψ0,λ0generating the observations. It is known from Proposition 1 in [30]that
a necessary and sufficient condition for weak merging is that the empirical Bayes posterior
for ψis consistent at ψ0, namely, (·|ˆ
λn,X1,...,Xn)weakly converges to a point mass
at ψ0,(·|ˆ
λn,X1,...,Xn)⇒δψ0along almost all sample paths when sampling from the
infinite product measure P∞
ψ0,λ0. To illustrate how this assertion can be shown, we present an
example on partially linear regression.
Example 4 Suppose we observe a random sample from the distribution of X=(Y,V,W),
in which, for some unobservable error eindependent of (V,W), the relationship among the
components is described by
Y=ψV+ηλ(W)+e.
The independent variable Yis a regression on (V,W)that is linear in Vwith slope ψ,
but may depend on Win a nonlinear way through ηλ(W)which represents an additive
contamination of the linear structure of Y.WeassumethatVand Wtake values in [0,1]
and that, for λ∈⊆R, the function w→ ηλ(w) is known up to λ. If the error eis
assumed to be normal, e∼N(0,σ
2
0)with known variance σ2
0, then the density of Xis given
by
pψ,λ(x)=φσ0(y−ψv −ηλ(w)) pV,W(v, w), x=(y,v,w)∈R×[0,1]2,
where φσ0(·)=σ−1
0φ(·/σ0), with φthe standard Gaussian density, and pV,Wthe joint den-
sity of (V,W). Consider an empirical Bayes approach that estimates λby any sequence ˆ
λn
of consistent estimators for λ0and use the empirical Bayes posterior (·|ˆ
λn,X1,...,Xn)
corresponding to a prior of the form in (3) to carry out inference on ψ. The empiri-
cal Bayes posterior (·|ˆ
λn,X1,...,Xn)weakly merges with the posterior for ψcor-
responding to any genuine prior on (ψ, λ) if only (·|ˆ
λn,X1,...,Xn)is consis-
tent at ψ0.Weshowthat,foreveryδ>0, the empirical Bayes posterior probability
(|ψ−ψ0|>δ|ˆ
λn,X1,...,Xn)→0in Pn
ψ0,λ0-probability. Let mψ,λ(v, w) =ψv +
ηλ(w). Assume there exists a constant B>0 such that supψ,λ mψ,λ∞≤B.Since
the Hellinger distance h(pψ,λ0,pψ0,λ0)≥(E[V2])1/2e−B2/4σ2
0|ψ−ψ0|/2σ0, the inclu-
sion {ψ:|ψ−ψ0|>δ}⊆{ψ:h(pψ,λ0,pψ0,λ0)>Mδ}holds for a suitable positive
constant M. To prove the claim, it is therefore enough to study the asymptotic behavior of
(h(pψ,λ0,pψ0,λ0)>Mδ|ˆ
λn,X1,...,Xn)which, if the prior for ψ,givenλ, belongs
to a location family of ν-densities generated by π0(·), i.e., π(·|λ) =π0(·−λ), is equal
to
123
Author's personal copy
Empirical Bayes methods 209
(h(pψ,λ0,pψ0,λ0)>Mδ|ˆ
λn,X1,...,Xn)
=h(pψ,λ0,pψ0,λ0)>Mδn
i=1φσ0(Yi−mψ,ˆ
λn(Vi,Wi))π0(ψ −ˆ
λn)ν(dψ)
n
i=1φσ0(Yi−mψ0,λ0(Vi,Wi))π0(ψ −ˆ
λn)ν(dψ)
=N(X1,...,Xn)
D(X1,...,Xn).
Assume there exists a continuous function g:[0,1]→Rand α>0 such that, for any
λ, λ0∈, the difference |ηλ(w)−ηλ0(w)|≤|g(w)|λ−λ0α≤g∞λ−λ0αfor every
w∈[0,1]. Then, on the event n=(−an≤miniYi≤maxiYi≤an,ˆ
λn−λ0≤un),
which, for sequences un↓0andan=O((log n)κ),κ>0, has probability Pn
ψ0,λ0(n)=
1+o(1),wehave
N(X1,...,Xn)/ D(X1,..., Xn)≤exp {2n(un+g∞uα
n)(an+B)+nu2
n}
×(h(pψ,λ0,pψ0,λ0)>Mδ|λ0,X1,...,Xn).
If the Bayesian posterior (·|λ0,X1,...,Xn)is Hellinger consistent at Pψ0,λ0and the con-
vergence is exponentially fast, then also the empirical Bayes posterior (·|ˆ
λn,X1,...,Xn)
is consistent at Pψ0,λ0and the claim that (|ψ−ψ0|>δ|ˆ
λn,X1,...,Xn)→0fol-
lows.
5 Higher-order comparisons and finite-sample properties
We return to the discussion at the end of Sect. 3, providing a simple example. Although limited
to the Gaussian case, this gives some hints about finer comparisons between Bayesian and
empirical Bayes posterior distributions. The evidence in this example could be extended to
more general contexts, such as Bayesian inference and variable selection in linear regression
with g-priors.
As discussed, an empirical Bayes choice of the prior hyper-parameters in Bayesian infer-
ence is not rigorous, but can be of interest as an approximation of a computationally more
involved hierarchical Bayesian posterior distribution. In fact, the results recalled in Sect. 3
show that, even for regular parametric models, the empirical Bayes posterior distribution can
be singular wrt any smooth Bayesian posterior, depending on the form of the prior distribution
and on the nature of the prior hyper-parameters. Thus, care is needed when using empirical
Bayes methods as an approximation of Bayesian solutions. On a positive side, these results
show that, in non-degenerate cases, the empirical Bayes posterior distribution does merge
strongly with any smooth Bayesian posterior distribution. However, this first-order asymp-
totic comparison does not distinguish among Bayesian posterior distributions arising from
different priors. The aim here is to grasp some evidence for finer comparisons. We explore
the following two issues.
Asymptotically, in regular parametric models, any smooth Bayesian posterior distribution
is approximated by a Gaussian distribution centered at the maximum likelihood estimate
ˆ
θn, by the Bernstein–von Mises theorem. Strong merging of Bayesian and empirical Bayes
posterior distributions implies that, when a Bernstein–von Mises behavior holds for the
Bayesian posterior distribution, it also holds for the empirical Bayes posterior; which is a
particularly interesting implication in nonparametric problems. In fact, one would expect that
the empirical Bayes posterior distribution can provide a closer approximation to a hierarchical
Bayesian posterior than the Bernstein–von Mises Gaussian distribution.
123
Author's personal copy
210 S. Petrone et al.
Less informally, based on the results of Sect. 3, one would conjecture that the hierar-
chical posterior distribution concentrates around the oracle value λ∗of the prior hyper-
parameters for increasing sample sizes and, since ˆ
λn→λ∗, the empirical Bayes posterior
distribution (θ |ˆ
λn,X1,...,Xn)and the hierarchical Bayesian posterior distribution
h(θ |X1,...,Xn)can be close even for moderate sample sizes. The following example
suggests that, although this is the case asymptotically by the results in Sect. 3, the posterior
distribution h(λ |X1,...,Xn)slowly incorporates the sample information so that, for finite
samples, the empirical Bayes posterior distribution (θ |ˆ
λn,X1,...,Xn)is a close approx-
imation of h(θ |X1,...,Xn)only if the prior distribution on λis enough concentrated
around the oracle value λ∗. In other words, the example suggests that, in the non-degenerate
case, the empirical Bayes posterior distribution is a high-order approximation of the poste-
rior distribution of a “well informed” Bayesian researcher whose prior highly favors the true
value of θ.
Example 5 Consider the simple example of the Gaussian conjugate model introduced in
Sect. 3with now a hierarchical specification of the prior. Let Xi|θ∼N(θ, σ 2)indepen-
dently, with σ2known. Let θ|λ∼N(0,λ)and 1/λ ∼G(α, β), a Gamma distribution where
β>0 is the scale parameter. Then, E(λ) =β/(α −1)and V(λ) =β2/[(α −1)2(α −2)].
The prior of θobtained by integrating out λis a Student’s-twith zero mean, 2αdegrees of
freedom and scaling factor β/α. The prior variance of θequals the prior guess on the hyper-
parameter λ,V(θ) =E(λ). Although this is a simple model, computations of the posterior
distribution of θbecome analytically complicated. The conditional distribution of θ,givenλ
and the data, is
θ|(λ, x1,...,xn)∼N(nλ+σ2)−1nλ¯xn,(nλ+σ2)−1σ2λ
and the posterior distribution of θis obtained by integrating λout wrt its posterior distribution
h(λ |x1,...,xn). This integration step is not analytically manageable and approximation
by Markov chain Monte Carlo (MCMC) is usually employed.
The empirical Bayes selection of λis an attractive, computationally simpler, shortcut.
Estimation of λvia maximum marginal likelihood gives ˆ
λn=max{0,¯x2
n−σ2/n}. Thus,
the maximum marginal likelihood estimator ˆ
λnmay take value zero in the boundary of
=(0,∞)with positive probability. If ˆ
λn=0, then the empirical Bayes prior distribution
of θis a point mass at the prior guess and the resulting posterior distribution is degenerate. As
seen in Sect. 3, if the true value θ0=E[θ]=0, the probability of degeneracy remains positive
even when n→∞, thus determining an asymptotic divergence between the empirical Bayes
posterior distribution and the hierarchical Bayesian posterior distribution. If θ0= 0, such
probability goes to zero and strong merging holds. Interest is in investigating higher-order
approximations in this case.
We first focus on point estimation with quadratic loss. The Bayes’ estimate is the posterior
expectation
E[θ|x1,...,xn]= (1+θ2/2β)−(2α+1)/2exp −n−1
nlog θ+1
2σ2(θ −¯xn)2dθ
(1+θ2/2β)−(2α+1)/2exp −n
2σ2(θ −¯xn)2dθ
,
(4)
123
Author's personal copy
Empirical Bayes methods 211
Tab l e 1 Comparing empirical Bayes and Laplace point estimates as approximations of the hierarchical Bayes’ point estimates
E[λ]=1/3E[λ]=1E[λ]=3E[λ]=4E[λ]=10
(a) n=20;¯xn=1.835
ˆ
ELC[θ|x1,...,xn](Laplace appr.) 1.797 1.769 1.749 1.745 1.738
E[θ|x1,...,xn](hierarc. Bayes, by MCMC) 1.683 (0.0029) 1.750 (0.0031) 1.801 (0.0033) 1.805 (0.0034) 1.821 (0.0033)
(standard dev.)
E[λ|x1,...,xn](Bayes, by MCMC) 0.074 (0.0051) 1.301 (0.0113) 3.018 (0.0261) 3.902 (0.0332) 8.915 (0.0721)
(standard dev.)
ˆ
λn(maximum marginal lik.) 3.320 3.320 3.320 3.320 3.320
E[θ|ˆ
λn,x1,...,xn](empirical Bayes) 1.809 1.809 1.809 1.809 1.809
(b) n=50; ¯xn=2.009
ˆ
ELC[θ|x1:n](Laplace appr.) 1.994 1.982 1.972 1.970 1.967
E[θ|x1,...,xn](hierarc. Bayes, by MCMC) 1.951 (0.0018) 1.974 (0.0019) 1.993 (0.0022) 1.999 (0.0022) 2.005 (0.0024)
(standard dev.)
E[λ|x1,...,xn](Bayes, by MCMC) 0.833 (0.0074) 1.403 (0.0123) 3.117 (0.0251) 4.012 (0.0342) 9.103 (0.0911)
(standard dev.)
ˆ
λn(maximum marginal lik.) 4.016 4.016 4.016 4.016 4.016
E[θ|ˆ
λn.x1,...,xn](empirical Bayes) 1.999 1.999 1.999 1.999 1.999
Simulated data with θ0=2andσ2=1
123
Author's personal copy
212 S. Petrone et al.
Density
0 5 10 15
01234
Posterior density of lambda (Gibbs)
Density
−1 0 1 2 3 4
0.0 0.5 1.0 1.5
Posterior densities of theta: hier. Bayes (Gibbs), EB, BvM
Density
0 5 10 15
01234
Posterior density of lambda (Gibbs)
Density
−1 0 1 2 3 4
0.0 0.5 1.0 1.5
Posterior densities of theta: hier. Bayes (Gibbs), EB, BvM
Fig. 1 Comparing empirical Bayes and hierarchical Bayesian posterior densities. Simulated data from a
Gaussian distribution N(2,6);n=20; ¯xn=1.667. E[λ]=1/3 (first row) and E[λ]=1 (second row).
First column: MCMC estimate of the posterior density of λ;thefull square denotes E(λ |x1,...,xn)and the
empty square denotes the marginal maximum likelihood estimate ˆ
λn. Second column: hierarchical Bayesian
posterior density of θ(MCMC estimate; solid curve), empirical Bayes posterior density of θ(dashed curve)
and limit Gaussian density dN(¯xn,σ
2/n)(bold solid curve). The empty triangle denotes E[θ|x1,...,xn];
the star denotes E[θ|ˆ
λn,x1,...,xn];thefull triangle denotes the sample mean, ¯xn
for which a closed form expression is not available. Its empirical Bayes approximation is
obtained by plugging ˆ
λninto the expression of E[θ|λ, x1,...,xn]:
E[θ|ˆ
λn,x1,...,xn]= nˆ
λn
nˆ
λn+σ2¯xn=1−σ2
nˆ
λn+σ2¯xn.(5)
We may expect that
E[θ|x1,...,xn]=E[θ|λ, x1,...,xn]h(λ |x1,...,xn)dλ
=E[θ|ˆ
λn,x1,...,xn]+O(n−k),
since, as nincreases, ˆ
λntends to the oracle value λ∗and h(λ |x1,...,xn)could collapse
to a point mass at λ∗. It is interesting to investigate on the order of the error term O(n−k).
To grasp some evidence, we compare the empirical Bayes point estimate with the Laplace
approximation developed by [24], p. 270:
123
Author's personal copy
Empirical Bayes methods 213
Density
0 5 10 15
0.0 0.2 0.4 0.6 0.8
Posterior density of lambda (Gibbs)
Density
−1 0 1 2 3 4
0.0 0.5 1.0 1.5
Posterior densities of theta: hier. Bayes (Gibbs), EB, BvM
Density
0 5 10 15
0.0 0.2 0.4 0.6 0.8
Posterior density of lambda (Gibbs)
Density
−101234
0.0 0.5 1.0 1.5
Posterior densities of theta: hier. Bayes (Gibbs), EB, BvM
Fig. 2 Comparing empirical Bayes and hierarchical Bayesian posterior densities. Simulated data from a
Gaussian distribution N(2,6);n=20; ¯xn=1.667. E[λ]=3 (first row) and E[λ]=4 (second row).
Legenda as for Fig. 1
ˆ
ELC[θ|x1,...,xn]=1−(2α+1)/2α
1+¯x2
n/2βσ2
n¯xn(6)
that is a special case of the Laplace approximation with error term O(n−3/2).
Tabl e 1compares E[θ|ˆ
λn,x1,...,xn]and ˆ
ELC[θ|x1,...,xn]as approximations of the
hierarchical Bayes’ point estimate E[θ|x1,...,xn]in a simulation study where θ0=2and
σ2=1. Along the columns, the value of αis fixed at 4, while βvaries, thus resulting into
different prior guesses E[λ].SinceE[λ]=V(θ), increasing values of βcorrespond to smaller
precision of the hierarchical prior. When β=12, the prior guess equals the oracle value, i.e.,
E[λ]=λ∗=4. In this case, the empirical Bayes’ point estimate provides a clearly better
approximation of E[θ|x1,...,xn]than ˆ
ELC[θ|x1,...,xn]. For example, Table 1bshows
how E[θ|x1,...,xn]and E[θ|ˆ
λn,x1,...,xn]coincide up to the thousandths digit for
n=50 and E[λ]=4. This suggests a higher-order form of merging between the empirical
Bayes posterior distribution and the hierarchical posterior distribution of a “more informed”
Bayesian statistician, i.e., the one who assigns a hyper-prior such that E[λ]=λ∗.Inorderto
shade light on this point, we now consider density approximation.
We first want to check whether the empirical Bayes posterior distribution provides a
better approximation of the hierarchical Bayesian posterior distribution than the Bernstein–
von Mises Gaussian approximating distribution, N(¯xn,σ
2/n). This comparison has been
investigated in several simulation studies, each one giving similar indications. We report
the results for simulated data from a Gaussian distribution with mean θ0=2 and variance
123
Author's personal copy
214 S. Petrone et al.
σ2=6(Figs.1,2). The hierarchical Bayesian posterior densities are computed by Gibbs
sampling. The first column in the plots shows the posterior density h(λ |x1,...,xn)of
λ. This appears to slowly concentrate towards the oracle value λ∗=4. The second column
shows the MCMC approximation of the hierarchical Bayesian posterior density of θ, together
with the empirical Bayes posterior density (dashed curve) and the limit Gaussian density
N(¯xn,σ
2/n)(bold curve). What emerges is that for a prior guess of λclose to the oracle value,
the empirical Bayes posterior density provides a better approximation of the hierarchical
Bayesian posterior density already for the small sample size n=20. This seems to confirm
the previously formulated conjecture: for finite sample sizes, empirical Bayes provides a
good approximation of the hierarchical Bayesian procedure adopted by the more informed
statistician and strong merging may hold up to a higher-order approximation.
References
1. Antoniak, C.E.: Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems.
Ann. Stat. 2, 1152–1174 (1974)
2. Belitser, E., Enikeeva, F.:Empirical Bayesian test of the smoothness. Math. Methods Stat. 17, 1–18 (2008)
3. Belitser, E., Levit, B.: On the empirical Bayes approach to adaptive filtering. Math. Methods Stat. 12,
131–154 (2003)
4. Berry,D.A., Christensen, R.: Empirical Bayes estimation of a binomial parameter via mixtures of Dirichlet
processes. Ann. Stat. 7, 558–568 (1979)
5. Clyde, M.A., George, E.I.: Flexible empirical Bayes estimation for wavelets. J. R. Stat. Soc. Ser. B 62,
681–698 (2000)
6. Copas, J.B.: Compound decisions and empirical Bayes (with discussion). J. R. Stat. Soc. Ser. B 31,
397–425 (1969)
7. Cui, W., George, E.I.: Empirical Bayes vs. fully Bayes variable selection. J. Stat. Plann. Inference 138,
888–900 (2008)
8. Deely, J.J., Lindley, D.V.: Bayes empirical Bayes. J. Am. Stat. Assoc. 76, 833–841 (1981)
9. Diaconis, P., Freedman, D.: On the consistency of Bayes estimates. Ann. Stat. 14, 1–26 (1986)
10. Efron, B.: Large-scale inference. Empirical Bayes methods for estimation, testing, and prediction. Cam-
bridge University Press, Cambridge (2010)
11. Efron, B., Morris, C.: Limiting the risk of Bayes and empirical Bayes estimators. II. The empirical Bayes
case. J. Am. Stat. Assoc. 67, 130–139 (1972a)
12. Efron, B., Morris, C.: Empirical Bayes on vector observations: an extension of Stein’s method. Biometrika
59, 335–347 (1972b)
13. Efron, B., Morris, C.: Stein’s estimation rule and its competitors-an empirical Bayes approach. J. Am.
Stat. Assoc. 68, 117–130 (1973a)
14. Efron, B., Morris, C.: Combining possibly related estimation problems. (With discussion by Lindley, D.V.,
Copas, J.B., Dickey, J.M., Dawid, A.P., Smith, A.F.M., Birnbaum, A., Bartlett, M.S., Wilkinson, G.N.,
Nelder, J.A., Stein, C., Leonard, T., Barnard, G.A., Plackett, R.L.). J. R. Stat. Soc. Ser. B 35, 379–421
(1973b)
15. Efron, B., Morris, C.N.: Data analysis using Stein’s estimator and its generalizations. J. Am. Stat. Assoc.
70, 311–319 (1973c)
16. Favaro, S., Lijoi, A., Mena, R.H., Prünster, I.: Bayesian nonparametric inference for species variety with
a two parameter Poisson–Dirichlet process prior. J. R. Stat. Soc. Ser. B 71, 993–1008 (2009)
17. Fisher, R.A., Corbet, A.S., Williams, C.B.: The relation between the number of species and the number
of individuals in a random sample of an animal population. J. Anim. Ecol. 12, 42–58 (1943)
18. Forcina, A.: Gini’s contributions to the theory of inference. Int. Stat. Rev. 50, 65–70 (1982)
19. George, E.I., Foster, D.P.: Calibration and empirical Bayes variable selection. Biometrika 87, 731–747
(2000)
20. Ghosh, J.K., Ramamoorthi, R.V.: Bayesian nonparametrics. Springer, New York (2003)
21. Gini, C.: Considerazioni sulla probabilità a posteriori e applicazioni al rapporto dei sessi nelle nascite
umane. Studi Economico-Giuridici. Università di Cagliari. III. Reprinted in Metron, vol. 15, pp. 133–172
(1911)
123
Author's personal copy
Empirical Bayes methods 215
22. Good, I.J.: Breakthroughs in statistics: foundations and basic theory. In: Johnson, N.L., Kotz, S. (eds.)
Introduction to Robbins (1992) An empirical Bayes approach to statistics, pp. 379–387. Springer, Berlin
(1995)
23. James, W., Stein, C.: Estimation with quadratic loss. In: Proceedings of Fourth Berkeley Symposium on
Mathematics Statistics and Probability, vol. 1, pp. 361–379. University of California Press, California
(1961)
24. Lehmann, E.L., Casella, G.: Theory of point estimation, 2nd edn. Springer, New York (1998)
25. Liang, F., Paulo, R., Molina, G., Clyde, M.A., Berger, J.O.: Mixtures of g-priors for Bayesian variable
selection. J. Am. Stat. Assoc. 103, 410–423 (2008)
26. Liu, J.S.: Nonparametric hierarchical Bayes via sequential imputation. Ann. Stat. 24, 911–930 (1996)
27. Maritz, J.S., Lwin, T.: Empirical Bayes methods, 2nd edn. Chapman and Hall, London (1989)
28. McAuliffe, J.D., Blei, D.M., Jordan, M.I.: Nonparametric empirical Bayes for the Dirichlet process
mixture model. Stat. Comput. 16, 5–14 (2006)
29. Morris, C.N.: Parametric empirical Bayes inference: theory and applications. J. Am. Stat. Assoc. 78,
47–55 (1983)
30. Petrone, S., Rousseau, J., Scricciolo, C.: Bayes and empirical Bayes: do they merge? Biometrika 101(2),
285–302 (2014)
31. Robbins, H.: An empirical Bayes approach to statistics. In: Proceedings of Third Berkeley Symposium
on Mathematics, Statistics and Probability, vol. 1, pp. 157–163. University of California Press, California
(1956)
32. Scott, J.G., Berger, J.O.: Bayes and empirical-Bayes multiplicity adjustment in the variable-selection
problem. Ann. Stat. 38, 2587–2619 (2010)
33. Stein, C.: Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In:
Proceedings of Third Berkeley Symposium on Mathematics, Statistics and Probability, vol. 1, pp. 197–
206. University of California Press, California (1956)
34. Szabó, B.T., van der Vaart, A.W., van Zanten, J.H.: Empirical Bayes scaling of Gaussian priors in the
white noise model. Electron. J. Stat. 7, 991–1018 (2013)
123
Author's personal copy