Content uploaded by Sonia Petrone

Author content

All content in this area was uploaded by Sonia Petrone on Jan 12, 2015

Content may be subject to copyright.

1 23

METRON

ISSN 0026-1424

Volume 72

Number 2

METRON (2014) 72:201-215

DOI 10.1007/s40300-014-0044-1

Empirical Bayes methods in classical and

Bayesian inference

Sonia Petrone, Stefano Rizzelli, Judith

Rousseau & Catia Scricciolo

1 23

Your article is protected by copyright and

all rights are held exclusively by Sapienza

Università di Roma. This e-offprint is for

personal use only and shall not be self-

archived in electronic repositories. If you wish

to self-archive your article, please use the

accepted manuscript version for posting on

your own website. You may further deposit

the accepted manuscript version in any

repository, provided it is only made publicly

available 12 months after official publication

or later and provided acknowledgement is

given to the original source of publication

and a link is inserted to the published article

on Springer's website. The link must be

accompanied by the following text: "The final

publication is available at link.springer.com”.

METRON (2014) 72:201–215

DOI 10.1007/s40300-014-0044-1

Empirical Bayes methods in classical and Bayesian

inference

Sonia Petrone ·Stefano Rizzelli ·

Judith Rousseau ·Catia Scricciolo

Received: 27 April 2014 / Accepted: 5 May 2014 / Published online: 3 June 2014

© Sapienza Università di Roma 2014

Abstract Empirical Bayes methods are often thought of as a bridge between classical and

Bayesian inference. In fact, in the literature the term empirical Bayes is used in quite diverse

contexts and with different motivations. In this article, we provide a brief overview of empiri-

cal Bayes methods highlighting their scopes and meanings in different problems. We focus on

recent results about merging of Bayes and empirical Bayes posterior distributions that regard

popular, but otherwise debatable, empirical Bayes procedures as computationally convenient

approximations of Bayesian solutions.

Keywords Bayesian weak merging ·Compound experiments ·Frequentist strong

merging ·Hyper-parameter oracle value ·Latent distribution ·Maximum marginal

likelihood estimation ·Shrinkage estimation

1 Introduction

Empirical Bayes methods are popularly employed by researchers and practitioners and are

attractive in appearing to bridge frequentist and Bayesian approaches to inference. In fact, a

frequentist statistician would ﬁnd just a formal Bayesian ﬂavor in empirical Bayes methods,

while a Bayesian statistician would say that there is nobody less Bayesian than an empirical

Bayesian (Lindley, in [6]). Further confusing, in the literature the term empirical Bayes is

used in quite diverse contexts, with different motivations. Classical empirical Bayes methods

arose in the context of compound experiments, where a latent distribution driving experiment-

speciﬁc parameters formally acts as a prior on each one such parameter and is estimated

from the data, usually by maximum likelihood. The term empirical Bayes is also used in the

context of purely Bayesian inference when hyper-parameters of a subjective prior distribu-

S. Petrone (B)·S. Rizzelli ·C. Scricciolo

Bocconi University, Milan, Italy

e-mail: sonia.petrone@unibocconi.it

J. Rousseau

CREST-ENSAE and CEREMADE, Université Paris Dauphine, Paris, France

123

Author's personal copy

202 S. Petrone et al.

tion are selected through the data. Empirical Bayes estimates are also popularly employed

to deal with nuisance parameters. All these situations are different and require speciﬁc

analysis.

In this article, we give a brief overview of classical and recent results on empirical Bayes

methods, discussing their use in these different contexts. Section 2recalls classical empirical

Bayes methods in compound sampling problems and mixture models. Although arising as

a way to by-pass the need of specifying a prior distribution in computing optimal Bayesian

solutions [31], here the approach is purely frequentist. The empirical Bayes solution is basi-

cally shrinkage estimation, where the introduction of a latent distribution may facilitate

interpretation and modeling thus helping to design efﬁcient shrinkage.

In a broader sense, the term empirical Bayes is used to denote a data-driven selection

of prior hyper-parameters in Bayesian inference. We discuss this case in Sect. 3. Here, the

prior distribution can only have an interpretation in terms of subjective probability on an

unknown, but ﬁxed, parameter. Although not rigorous, it is a common practice to try to

overcome difﬁculties in specifying the prior distribution by plugging in some estimates of

the prior hyper-parameters. From a Bayesian viewpoint, in such cases one should rather

assign a hyper-prior distribution, which however makes computations more involved. The

empirical Bayes selection thus appears as a convenient way out that is expected to give similar

inferential results as the hierarchical Bayesian solution for large sample sizes and better results

for ﬁnite samples than a “wrong choice” of the prior hyper-parameters. Although commonly

trusted, these facts are not rigorously proved. Recent results [30] address the presumed

asymptotic equivalence of Bayesian and empirical Bayes solutions in terms of merging.

Roughly speaking, they show that, in regular parametric problems, the empirical Bayes and

the Bayesian posterior distributions generally tend to merge, that is, to be asymptotically

close, but also that possible divergent behavior may arise. Thus, the use of empirical Bayes

prior selection requires much care.

Section 4discusses another popular use of empirical Bayes methods in problems with

nuisance parameters. We extend, in particular, the results of [30] on weak merging of

empirical Bayes procedures to nuisance parameter problems, which we illustrate with partial

linear regression models.

The result about merging recalled in Sect. 3only gives ﬁrst-order asymptotic comparison

between empirical Bayes and any Bayesian posterior distributions. A higher-order compar-

ison would be needed to distinguish among them. We conclude the article with some hints

on the ﬁnite-sample behavior of the empirical Bayes posterior distribution in a simple but

insightful example (Sect. 5). The results suggest that, when merging holds, the empirical

Bayes posterior distribution can indeed be a computationally convenient approximation of

an efﬁcient, in a sense to be speciﬁed, Bayesian solution.

2 Classical empirical Bayes

The introduction of the empirical Bayes method is traditionally associated with Robbins’

article [31] on compound sampling problems. Compound sampling models arise in a variety

of situations including multi-site clinical trials, estimation of disease rates in small geo-

graphical areas, longitudinal studies. In this setting, nvalues θ1,...,θ

nare drawn at random

from a latent distribution G. Then, conditionally on θ1,...,θ

n, observable random variables

X1,...,Xnare drawn independently from probability distributions p(·|θ1), ..., p(·|θn),

respectively. The framework can be thus described:

123

Author's personal copy

Empirical Bayes methods 203

Xi|θi

indep

∼p(·|θi)

θi|Giid

∼G(·), i=1,...,n,

where the index irefers to the ith experiment. Interest lies in estimating an experiment spe-

ciﬁc parameter θiwhen all the nobservations X1,...,Xnare available. For the generic ith

experiment, one has Xi|θi∼p(·|θi)and θi∼G; thus, the latent distribution Gformally

plays the role of a prior distribution on θiin a Bayesian ﬂavor. Were Gknown, inference on θi

would be carried out through the Bayes’ rule, computing the posterior distribution of θigiven

Xi,dG(θi|Xi)∝p(Xi|θi)dG(θi),andθicould be estimated by the Bayes’ estimator

with respect to squared error loss, i.e., the posterior mean EG[θi|Xi]=θdG(θ |Xi).

In fact, in general Gis unknown and the Bayes’ estimator EG[θi|Xi]is not computable.

One can however use an estimate of the “prior distribution” Gbased on the available obser-

vations X1,...,Xn, which is what originated the term “empirical Bayes”. Were θ1,...,θ

n

observable, their common distribution Gcould be pointwise consistently estimated by the

empirical cumulative distribution function (cdf)

Gn(θ) =n

i=11(−∞,θ](θi).Astheθiare not

observable, the empirical Bayes approach suggests estimating Gfrom the data X1,..., Xn

exploiting the fact that

Xi|Giid

∼fG(·)=p(·|θ)dG(θ ), i=1,...,n.

We still denote by

Gnany estimator for Gbased on X1,...,Xn.Asin[31], consider i=n,

that is, estimating θn. The unit-speciﬁc unknown θncan be estimated by the empirical Bayes

version E

Gn[θ|Xn]of the posterior mean. Empirical Bayes methods considered in [31]

have been named nonparametric empirical Bayes, because Gis assumed to be completely

unknown, to distinguish them from parametric empirical Bayes methods later developed

by Efron and Morris [11–15], where Gis assumed to be known up to a ﬁnite-dimensional

parameter. If Gis completely unknown, then the cdf FG(x)=x

−∞ fG(u)du,x∈R, can

be estimated from the empirical cdf

Fn(x)=n

i=11(−∞,x](Xi)which, for every ﬁxed x,

tends to FG(x),asn→∞, whichever the mixing distribution G. Thus, depending on the

kernel density p(·|θ) and the class Gto which Gbelongs, the estimator

Gnentailed by

Fn

approximates Gfor large nand the corresponding empirical Bayes’ estimator E

Gn[θ|Xn]for

θnapproximates the posterior mean EG[θ|Xn]. To illustrate this, we consider the following

example due to Robbins [31], which deals with the Poisson case. Here, whatever the unknown

distribution G, the posterior mean can be written as the ratio of the probability mass function

fG(·)evaluated at different points. These terms can be estimated by the corresponding values

of the empirical mass function.

Example 1 Let Xi|θi∼Poisson(θi)independently, with θi|Giid

∼G,i=1,...,n,where

Gis a cdf on R+. In this case, EG[θ|X=x]=(x+1)fG(x+1)/ fG(x),x=0,1,...,which

can be estimated by ϕn(x)=(x+1)n

i=11{x+1}(Xi)/ n

i=11{x}(Xi). Then, whatever the

unknown distribution G,foranyﬁxedx,ϕn(x)→EG[θ|X=x]as n→∞, with proba-

bility 1. This naturally suggests using ϕn(Xn)as an estimator for θn. Robbins [31] extended

this technique to the cases where the Xihas geometric, binomial or Laplace distribution.

As discussed by Morris [29], parametric empirical Bayes procedures are needed to deal

with those cases where nis too small to well approximate the Bayesian solution, but still

a substantial improvement over standard methods can be made as for the James–Stein’s

estimator. When the mixing distribution is assumed to have a speciﬁc parametric form G(·|

ψ), it is common practice to estimate the unknown parameter ψfrom the data by maximum

123

Author's personal copy

204 S. Petrone et al.

likelihood, computing

ψn≡

ψ(X1,...,Xn)as

ψn=argmaxψn

i=1p(Xi|θ)dG(θ |

ψ). Inference on θnis then carried out using G(·|

ψn)to compute EG(·|

ψn)[θ|Xn]. Empirical

Bayes estimation of θnhas the advantage of doing asymptotically as well as the Bayes’

estimator without knowing the “prior” distribution. However, the Bayesian approach and the

empirical Bayes approach are only seemingly related: there is, indeed, a clearcut difference

between them. In the empirical Bayes approach to compound problems, although Gformally

acts as a prior distribution on a single parameter, its introduction is motivated in a frequentist

sense, as the common distribution of the random sample (θ1,...,θ

n); indeed, estimation of

Gis carried out by frequentist methods. In the Bayesian approach, a prior distribution can

be assigned to a ﬁxed unknown parameter, being interpreted as a formalization of subjective

information. In the context of multiple independent experiments, a Bayesian statistician

would rather assume probabilistic dependence across experiments by regarding the θias

exchangeable and assigning a prior probability law to G(in the nonparametric case) or to ψ

(in the parametric case); see, e.g., [1,4,8].

Rather than in comparison with Bayesian inference, the advantage of empirical Bayes

methods can be appreciated in comparison with classical maximum likelihood estimators.

The empirical Bayes estimate of θimakes efﬁcient use of the available information because

all data are used when estimating G. In other terms, empirical Bayes techniques involve

learning from the experience of others or, using Tukey’s evocative expression, “borrowing

strength”. To illustrate this crucial aspect, we consider the following classical example.

Example 2 Let (X1,...,Xp)∼Np(θ , σ 2Ip)be a p-variate Gaussian distribution, where

θ=(θ1,...,θp)and Ipis the p-dimensional identity matrix. The Xican be the mean of

a random sample Xi,j,j=1,...,n, within the ith experiment. Suppose σ2is known. Let

θi|ψiid

∼N(0,ψ), with unknown variance ψ. Then, the maximum marginal likelihood

estimator for ψis

ψp=max{0,s2−σ2},wheres2=p

i=1x2

i/p. The empirical Bayes’

estimator for θiis EN(0,

ψp)[θ|Xi]=[1−(p−2)σ 2/p

i=1x2

i]Xi, which coincides with the

James–Stein’s estimator [23,33], that dominates the maximum likelihood estimator ˆ

θi=Xi

for p≥3 with respect to the overall quadratic loss p

i=1(θi−ˆ

θi)2.

As remarked by Morris [29], James–Stein’s estimator is minimax-optimal for the sum of

the individual squared error loss functions only in the equal variances case. Optimality is

lost, for example, if global loss functions that weight differently the individual squared losses

are used. Other forms of shrinkage, possibly suggested by the empirical Bayes approach, are

then necessary.

We conclude this section with a historical note. Although the introduction of the empirical

Bayes method is traditionally associated with Robbins [31]’s article, the idea was partially

anticipated by, among others, Gini [21] who, as pointed out by Forcina [18], pioneerly

provided empirical Bayes solutions for estimating the parameter of a binomial distribution,

andbyFisheretal.[17] who applied the parametric empirical Bayes technique to the so-called

species sampling problem assuming a Gamma “prior” distribution, see also Good [22]. Since

then, the ﬁeld has witnessed a tremendous growth both in terms of theoretical developments

as well as in diversity of applications, see, e.g., the monographs [27]and[10].

3 Empirical Bayes selection of prior hyper-parameters

In a broader sense, the term empirical Bayes is commonly associated with general techniques

that make use of a data-driven choice of the prior distribution in Bayesian inference. Here,

123

Author's personal copy

Empirical Bayes methods 205

the basic setting is inference on an exchangeable sequence (Xi). Exchangeability is intended

in a subjective sense: the data are physically independent, but probabilistic dependence is

expressed among them, as past observations give information on future values and such

incomplete information is described probabilistically through the conditional distribution of

Xn+1,Xn+2,...,given X1=x1,...,Xn=xn. Exchangeability is the basic dependence

assumption, which is equivalent to assuming a statistical model p(·|θ) such that the Xi

are conditionally independent and identically distributed (iid) given θand expressing a prior

distribution on θ, by de Finetti’s representation theorem for exchangeable sequences. Thus,

the statistical model and the prior are together a way of expressing the probability law of

the observable sequence (Xi)and in such way they should be chosen. In fact, choosing a

honest subjective prior in Bayesian inference can be a difﬁcult task. A way of formulating

such uncertainty is to assign the prior on θhierarchically, assuming θ|λ∼(·|λ), a para-

metric distribution depending on hyper-parameters λ,andλ∼H(λ).However,thisoften

complicates computations, so that it is a common practice to plug in some estimate ˆ

λnof

the prior hyper-parameters as a shortcut. The resulting data-dependent prior (·|ˆ

λn),com-

bined with the likelihood, results into a pseudo-posterior distribution (·|ˆ

λn,X1,...,Xn)

that is commonly referred to as empirical Bayes. Many types of estimators for λare

considered, the most popular being the maximum marginal likelihood estimator, deﬁned

as

ˆ

λn∈argmax

λ∈¯

n

i=1

p(Xi|θ)(dθ|λ),

where ¯

is the closure of .

Such empirical Bayes approach is appealing in offering the possibility of making Bayesian

inference by-passing a complete speciﬁcation of the prior and it is largely used in practical

applications and in the literature: see, e.g., [7,19,25,32] in the context of variable selection

in regression, [5] for wavelet shrinkage estimation, [26]and[28] in Bayesian nonparametric

mixture models, [16] in Bayesian nonparametric inference for species diversity, [2,3]and

[34] in Bayesian nonparametric procedures for curve estimation.

Although popular, this mixed approach is not rigorous from a Bayesian point of view. Its

interest mainly lies in being a computationally simpler alternative to a more rigorous, but

usually analytically more complex, hierarchical speciﬁcation of the prior: one expects that,

when the sample size is large, the empirical Bayes posterior distribution will be close to

some Bayesian posterior distribution. Moreover, for ﬁnite samples, a data-driven empirical

Bayes selection of the prior hyper-parameters is expected to give better inferential results

than a “wrong choice” of λ. These commonly believed facts do not seem to be rigorously

proved in the literature. A recent work by Petrone et al. [30] addresses the supposed asymp-

totic equivalence between empirical Bayes and Bayesian posterior distributions in terms of

merging.

Two notions of merging are considered: Bayesian weak merging in the sense of [9], and

frequentist strong merging in the sense of [20]. Bayesian weak merging compares posterior

distributions in terms of weak convergence, with respect to (wrt) the exchangeable probability

law of (Xi). Roughly speaking, we have weak merging of the empirical Bayes and Bayesian

posterior distributions if any Bayesian statistician is sure that her/his posterior distribution

and the empirical Bayes posterior distribution will eventually be close, in the sense of weak

convergence. This is a minimal requirement, but it is not guaranteed. From results in [9], it can

be proved that weak merging holds if and only if the empirical Bayes posterior distribution is

consistent in the frequentist sense at the true value θ0,whateverθ0. Consistency at θ0means

123

Author's personal copy

206 S. Petrone et al.

that the sequence of empirical Bayes posterior distributions weakly converges to a point mass

at θ0, almost surely wrt P∞

θ0,whereP∞

θ0denotes the probability law of (Xi)such that the Xi

are iid according to Pθ0.

Sufﬁcient conditions for consistency of empirical Bayes posterior distributions are pro-

vided in [30], Section 3. In general, consistency of Bayesian posterior distributions does not

imply consistency of empirical Bayes posteriors. For the latter, one has to control the asymp-

totic behavior of the estimator ˆ

λn, too. If ˆ

λnis the maximum marginal likelihood estimator,

its properties can be exploited to show that the empirical Bayes posterior distribution is con-

sistent at θ0under essentially the same conditions which ensure consistency of Bayesian

posterior distributions. For more general estimators, conditions become more cumbersome.

When ˆ

λnis a convergent sequence, sufﬁcient conditions are given in Proposition 3 of [30],

based on a change of the prior probability measure such that the dependence on the data is

transferred from the prior to the likelihood.

Even when consistency and weak merging hold, the empirical Bayes posterior distribution

may underestimate the uncertainty on θand diverge from any Bayesian posterior, relatively

to a stronger metric than the one of weak convergence. This behavior is illustrated in the

following example.

Example 3 Let Xi|θ∼N(θ , σ 2)independently, with σ2known, and θ∼N(μ, τ 2).

Consider empirical Bayes inference where the prior variance λ=τ2is estimated by the

maximum marginal likelihood estimator, the prior mean μbeing ﬁxed. Then, see, e.g., [24],

p. 263, σ2+nˆτ2

n=max{σ2,n(¯

Xn−μ)2}so that ˆτ2

n=(σ 2/n)max{n(¯

Xn−μ)2/σ 2−1,0}.

The resulting empirical Bayes posterior distribution (·| ˆτ2

n,X1,...,Xn)is Gaussian with

mean μn=(σ 2/n)/( ˆτ2

n+σ2/n)μ +ˆτ2

n/( ˆτ2

n+σ2/n)¯

Xnand variance (1/ˆτ2

n+n/σ 2)−1.

Since ˆτ2

ncan be equal to zero with positive probability, the empirical Bayes posterior can be

degenerate at μ. The probability of the event ˆτn=0 converges to zero when θ0= μ,but

remains strictly positive when θ0=μ. This suggests that, if θ0= μ, the hierarchical and the

empirical Bayes posterior densities can asymptotically be close relatively to some distance;

however, if θ0=μ, there is a positive probability that the empirical Bayes and the Bayesian

posterior distributions are singular. The possible degeneracy of the empirical Bayes posterior

distribution is pathological in the sense that the uncertainty on the parameter is a posteriori

underestimated.

Such behaviour is not restricted to the Gaussian distribution and applies more generally to

location-scale families of priors. If the model admits a maximum likelihood estimator ˆ

θnand

the prior density is of the form τ−1g((·−μ)/τ ), with λ=(μ, τ ) for some unimodal density

gthat attains the maximum at zero, then ˆ

λn=(ˆ

θn,0)and the empirical Bayes posterior is a

point mass at ˆ

θn. These families of priors should not be jointly used with maximum marginal

likelihood empirical Bayes procedures.

A way to reﬁne the analysis to better understand the impact of a data-dependent prior

on the posterior distribution is to study frequentist strong merging in the sense of [20]. Two

sequences of posterior distributions are said to merge strongly if their total variation distance

converges to zero almost surely wrt P∞

θ0.

Strong merging of Bayesian posterior distributions in nonparametric contexts is often

impossible since pairs of priors are typically singular. Petrone et al. [30] study the problem

for regular parametric models, comparing Bayesian posterior distributions and empirical

Bayes posterior distributions based on the maximum marginal likelihood estimator of λ.

Informally, their results show that strong merging may hold for some true values θ0,butmay

fail for others. That is, for values of θ0in an appropriate set, say 0, the empirical Bayes

123

Author's personal copy

Empirical Bayes methods 207

posterior distribution strongly merges with any Bayesian posterior distribution corresponding

to a prior distribution qwhich is continuous and bounded at θ0,

(·|ˆ

λn,X1,...,Xn)−q(·| X1,..., Xn)TV →0(1)

almost surely wrt P∞

θ0,where·

TV denotes the total variation distance. However, for

θ0/∈0, strong merging fails: the empirical Bayes posterior can indeed be singular wrt any

smooth Bayesian posterior distribution.

More precisely, suppose that the prior distribution has density π(·)with respect to some

dominating measure and includes θ0in its Kullback–Leibler support. Furthermore, suppose

that the parameter space is totally bounded. Assume that conditions hold that guarantee con-

sistency for the empirical Bayes and the Bayesian posterior distributions. Under such assump-

tions, and some additional requirements that are satisﬁed by regular parametric models, it can

be shown ([30], Theorem 1) that the maximum marginal likelihood estimator ˆ

λnconverges

to a value λ∗(here assumed to be unique for brevity) such that π(θ0|λ∗)≥π(θ0|λ) for

every λin the hyper-parameter space . Such value can be interpreted as the “oracle value”

of the hyper-parameter, that is the value of the hyper-parameter for which the prior mostly

favors the true value θ0. Furthermore, it is proved that if θ0is such that π(θ0|λ∗)<∞,then

strong merging holds, namely

(·|ˆ

λn,X1,...,Xn)−(·|λ∗,X1,..., Xn)TV →0(2)

almost surely wrt P∞

θ0.Since(·|λ∗,X1,..., Xn)−q(·|X1,...,Xn)TV goes to zero

P∞

θ0-almost surely for any prior qthat is continuous and bounded at θ0([20], Theorem 1.3.1),

by the triangular inequality, one has (1). However, if θ0is such that π(θ0|λ∗)=∞,then

strong merging fails. This is the case if, for such θ0,λ∗is in the boundary of and the

prior distribution is degenerate at θ0for λ→λ∗. In this case, the empirical Bayes posterior

distribution is degenerate too, thus it is singular wrt any smooth Bayesian posterior.

Result (1), which holds only in the non-degenerate case, ensures that the empirical Bayes

posterior distribution will be close in total variation to the Bayesian posterior, whatever

the prior distribution. But this result only provides a ﬁrst-order asymptotic comparison that

does not distinguish among Bayesian solutions. In fact, from (2), one could expect that the

empirical Bayes approach can actually give a closer approximation of an efﬁcient, in the

sense of using the prior distribution that mostly favors the true value of θ0, Bayesian solution.

Higher-order asymptotic results are beyond the scope of this note, but we will return to this

issue in Sect. 5, providing a simple, but we believe insightful, example.

4 Empirical Bayes selection of nuisance parameters

Another relevant context of application of empirical Bayes methods concerns Bayesian analy-

sis in semi-parametric models, where estimation of nuisance parameters is preliminarily

considered to carry out inference on the component of interest. The framework can be thus

described: observations X1,..., Xnare drawn independently from a distribution with density

pψ,λ(·),

Xi|(ψ, λ) iid

∼pψ,λ(·), i=1,...,n,

where ψ∈⊆Rkis the parameter of interest and λ∈⊆Ra nuisance parame-

ter. Bayesian inference with nuisance parameters does not conceptually present particular

difﬁculties: a prior distribution is assigned to the overall parameter (ψ, λ),

123

Author's personal copy

208 S. Petrone et al.

(dψ, dλ) =(dψ|λ)(dλ),

and inference on ψis carried out marginalizing the joint posterior distribution (dψ, dλ|

X1,...,Xn). However, this can be computationally cumbersome. A common approach is

thus to plug in some estimator ˆ

λnof λand use a data-dependent prior

(dψ|ˆ

λn)δ

ˆ

λn(dλ). (3)

We highlight the difference between the present context and the one described in Sect. 3:

there, ˆ

λnis used to estimate a hyper-parameter λ, a parameter of the prior only, whereas here

ˆ

λnis used to estimate λwhich is, in the ﬁrst place, a component of the overall parameter of

the model, it is part of the model.

The results developed in [30] can be extended to prove the asymptotic equivalence, in

terms of weak merging, between the empirical Bayes posterior and any Bayesian posterior

for ψ, provided ˆ

λnis a sequence of consistent estimators for the true value λ0corresponding

to the density pψ0,λ0generating the observations. It is known from Proposition 1 in [30]that

a necessary and sufﬁcient condition for weak merging is that the empirical Bayes posterior

for ψis consistent at ψ0, namely, (·|ˆ

λn,X1,...,Xn)weakly converges to a point mass

at ψ0,(·|ˆ

λn,X1,...,Xn)⇒δψ0along almost all sample paths when sampling from the

inﬁnite product measure P∞

ψ0,λ0. To illustrate how this assertion can be shown, we present an

example on partially linear regression.

Example 4 Suppose we observe a random sample from the distribution of X=(Y,V,W),

in which, for some unobservable error eindependent of (V,W), the relationship among the

components is described by

Y=ψV+ηλ(W)+e.

The independent variable Yis a regression on (V,W)that is linear in Vwith slope ψ,

but may depend on Win a nonlinear way through ηλ(W)which represents an additive

contamination of the linear structure of Y.WeassumethatVand Wtake values in [0,1]

and that, for λ∈⊆R, the function w→ ηλ(w) is known up to λ. If the error eis

assumed to be normal, e∼N(0,σ

2

0)with known variance σ2

0, then the density of Xis given

by

pψ,λ(x)=φσ0(y−ψv −ηλ(w)) pV,W(v, w), x=(y,v,w)∈R×[0,1]2,

where φσ0(·)=σ−1

0φ(·/σ0), with φthe standard Gaussian density, and pV,Wthe joint den-

sity of (V,W). Consider an empirical Bayes approach that estimates λby any sequence ˆ

λn

of consistent estimators for λ0and use the empirical Bayes posterior (·|ˆ

λn,X1,...,Xn)

corresponding to a prior of the form in (3) to carry out inference on ψ. The empiri-

cal Bayes posterior (·|ˆ

λn,X1,...,Xn)weakly merges with the posterior for ψcor-

responding to any genuine prior on (ψ, λ) if only (·|ˆ

λn,X1,...,Xn)is consis-

tent at ψ0.Weshowthat,foreveryδ>0, the empirical Bayes posterior probability

(|ψ−ψ0|>δ|ˆ

λn,X1,...,Xn)→0in Pn

ψ0,λ0-probability. Let mψ,λ(v, w) =ψv +

ηλ(w). Assume there exists a constant B>0 such that supψ,λ mψ,λ∞≤B.Since

the Hellinger distance h(pψ,λ0,pψ0,λ0)≥(E[V2])1/2e−B2/4σ2

0|ψ−ψ0|/2σ0, the inclu-

sion {ψ:|ψ−ψ0|>δ}⊆{ψ:h(pψ,λ0,pψ0,λ0)>Mδ}holds for a suitable positive

constant M. To prove the claim, it is therefore enough to study the asymptotic behavior of

(h(pψ,λ0,pψ0,λ0)>Mδ|ˆ

λn,X1,...,Xn)which, if the prior for ψ,givenλ, belongs

to a location family of ν-densities generated by π0(·), i.e., π(·|λ) =π0(·−λ), is equal

to

123

Author's personal copy

Empirical Bayes methods 209

(h(pψ,λ0,pψ0,λ0)>Mδ|ˆ

λn,X1,...,Xn)

=h(pψ,λ0,pψ0,λ0)>Mδn

i=1φσ0(Yi−mψ,ˆ

λn(Vi,Wi))π0(ψ −ˆ

λn)ν(dψ)

n

i=1φσ0(Yi−mψ0,λ0(Vi,Wi))π0(ψ −ˆ

λn)ν(dψ)

=N(X1,...,Xn)

D(X1,...,Xn).

Assume there exists a continuous function g:[0,1]→Rand α>0 such that, for any

λ, λ0∈, the difference |ηλ(w)−ηλ0(w)|≤|g(w)|λ−λ0α≤g∞λ−λ0αfor every

w∈[0,1]. Then, on the event n=(−an≤miniYi≤maxiYi≤an,ˆ

λn−λ0≤un),

which, for sequences un↓0andan=O((log n)κ),κ>0, has probability Pn

ψ0,λ0(n)=

1+o(1),wehave

N(X1,...,Xn)/ D(X1,..., Xn)≤exp {2n(un+g∞uα

n)(an+B)+nu2

n}

×(h(pψ,λ0,pψ0,λ0)>Mδ|λ0,X1,...,Xn).

If the Bayesian posterior (·|λ0,X1,...,Xn)is Hellinger consistent at Pψ0,λ0and the con-

vergence is exponentially fast, then also the empirical Bayes posterior (·|ˆ

λn,X1,...,Xn)

is consistent at Pψ0,λ0and the claim that (|ψ−ψ0|>δ|ˆ

λn,X1,...,Xn)→0fol-

lows.

5 Higher-order comparisons and ﬁnite-sample properties

We return to the discussion at the end of Sect. 3, providing a simple example. Although limited

to the Gaussian case, this gives some hints about ﬁner comparisons between Bayesian and

empirical Bayes posterior distributions. The evidence in this example could be extended to

more general contexts, such as Bayesian inference and variable selection in linear regression

with g-priors.

As discussed, an empirical Bayes choice of the prior hyper-parameters in Bayesian infer-

ence is not rigorous, but can be of interest as an approximation of a computationally more

involved hierarchical Bayesian posterior distribution. In fact, the results recalled in Sect. 3

show that, even for regular parametric models, the empirical Bayes posterior distribution can

be singular wrt any smooth Bayesian posterior, depending on the form of the prior distribution

and on the nature of the prior hyper-parameters. Thus, care is needed when using empirical

Bayes methods as an approximation of Bayesian solutions. On a positive side, these results

show that, in non-degenerate cases, the empirical Bayes posterior distribution does merge

strongly with any smooth Bayesian posterior distribution. However, this ﬁrst-order asymp-

totic comparison does not distinguish among Bayesian posterior distributions arising from

different priors. The aim here is to grasp some evidence for ﬁner comparisons. We explore

the following two issues.

Asymptotically, in regular parametric models, any smooth Bayesian posterior distribution

is approximated by a Gaussian distribution centered at the maximum likelihood estimate

ˆ

θn, by the Bernstein–von Mises theorem. Strong merging of Bayesian and empirical Bayes

posterior distributions implies that, when a Bernstein–von Mises behavior holds for the

Bayesian posterior distribution, it also holds for the empirical Bayes posterior; which is a

particularly interesting implication in nonparametric problems. In fact, one would expect that

the empirical Bayes posterior distribution can provide a closer approximation to a hierarchical

Bayesian posterior than the Bernstein–von Mises Gaussian distribution.

123

Author's personal copy

210 S. Petrone et al.

Less informally, based on the results of Sect. 3, one would conjecture that the hierar-

chical posterior distribution concentrates around the oracle value λ∗of the prior hyper-

parameters for increasing sample sizes and, since ˆ

λn→λ∗, the empirical Bayes posterior

distribution (θ |ˆ

λn,X1,...,Xn)and the hierarchical Bayesian posterior distribution

h(θ |X1,...,Xn)can be close even for moderate sample sizes. The following example

suggests that, although this is the case asymptotically by the results in Sect. 3, the posterior

distribution h(λ |X1,...,Xn)slowly incorporates the sample information so that, for ﬁnite

samples, the empirical Bayes posterior distribution (θ |ˆ

λn,X1,...,Xn)is a close approx-

imation of h(θ |X1,...,Xn)only if the prior distribution on λis enough concentrated

around the oracle value λ∗. In other words, the example suggests that, in the non-degenerate

case, the empirical Bayes posterior distribution is a high-order approximation of the poste-

rior distribution of a “well informed” Bayesian researcher whose prior highly favors the true

value of θ.

Example 5 Consider the simple example of the Gaussian conjugate model introduced in

Sect. 3with now a hierarchical speciﬁcation of the prior. Let Xi|θ∼N(θ, σ 2)indepen-

dently, with σ2known. Let θ|λ∼N(0,λ)and 1/λ ∼G(α, β), a Gamma distribution where

β>0 is the scale parameter. Then, E(λ) =β/(α −1)and V(λ) =β2/[(α −1)2(α −2)].

The prior of θobtained by integrating out λis a Student’s-twith zero mean, 2αdegrees of

freedom and scaling factor β/α. The prior variance of θequals the prior guess on the hyper-

parameter λ,V(θ) =E(λ). Although this is a simple model, computations of the posterior

distribution of θbecome analytically complicated. The conditional distribution of θ,givenλ

and the data, is

θ|(λ, x1,...,xn)∼N(nλ+σ2)−1nλ¯xn,(nλ+σ2)−1σ2λ

and the posterior distribution of θis obtained by integrating λout wrt its posterior distribution

h(λ |x1,...,xn). This integration step is not analytically manageable and approximation

by Markov chain Monte Carlo (MCMC) is usually employed.

The empirical Bayes selection of λis an attractive, computationally simpler, shortcut.

Estimation of λvia maximum marginal likelihood gives ˆ

λn=max{0,¯x2

n−σ2/n}. Thus,

the maximum marginal likelihood estimator ˆ

λnmay take value zero in the boundary of

=(0,∞)with positive probability. If ˆ

λn=0, then the empirical Bayes prior distribution

of θis a point mass at the prior guess and the resulting posterior distribution is degenerate. As

seen in Sect. 3, if the true value θ0=E[θ]=0, the probability of degeneracy remains positive

even when n→∞, thus determining an asymptotic divergence between the empirical Bayes

posterior distribution and the hierarchical Bayesian posterior distribution. If θ0= 0, such

probability goes to zero and strong merging holds. Interest is in investigating higher-order

approximations in this case.

We ﬁrst focus on point estimation with quadratic loss. The Bayes’ estimate is the posterior

expectation

E[θ|x1,...,xn]= (1+θ2/2β)−(2α+1)/2exp −n−1

nlog θ+1

2σ2(θ −¯xn)2dθ

(1+θ2/2β)−(2α+1)/2exp −n

2σ2(θ −¯xn)2dθ

,

(4)

123

Author's personal copy

Empirical Bayes methods 211

Tab l e 1 Comparing empirical Bayes and Laplace point estimates as approximations of the hierarchical Bayes’ point estimates

E[λ]=1/3E[λ]=1E[λ]=3E[λ]=4E[λ]=10

(a) n=20;¯xn=1.835

ˆ

ELC[θ|x1,...,xn](Laplace appr.) 1.797 1.769 1.749 1.745 1.738

E[θ|x1,...,xn](hierarc. Bayes, by MCMC) 1.683 (0.0029) 1.750 (0.0031) 1.801 (0.0033) 1.805 (0.0034) 1.821 (0.0033)

(standard dev.)

E[λ|x1,...,xn](Bayes, by MCMC) 0.074 (0.0051) 1.301 (0.0113) 3.018 (0.0261) 3.902 (0.0332) 8.915 (0.0721)

(standard dev.)

ˆ

λn(maximum marginal lik.) 3.320 3.320 3.320 3.320 3.320

E[θ|ˆ

λn,x1,...,xn](empirical Bayes) 1.809 1.809 1.809 1.809 1.809

(b) n=50; ¯xn=2.009

ˆ

ELC[θ|x1:n](Laplace appr.) 1.994 1.982 1.972 1.970 1.967

E[θ|x1,...,xn](hierarc. Bayes, by MCMC) 1.951 (0.0018) 1.974 (0.0019) 1.993 (0.0022) 1.999 (0.0022) 2.005 (0.0024)

(standard dev.)

E[λ|x1,...,xn](Bayes, by MCMC) 0.833 (0.0074) 1.403 (0.0123) 3.117 (0.0251) 4.012 (0.0342) 9.103 (0.0911)

(standard dev.)

ˆ

λn(maximum marginal lik.) 4.016 4.016 4.016 4.016 4.016

E[θ|ˆ

λn.x1,...,xn](empirical Bayes) 1.999 1.999 1.999 1.999 1.999

Simulated data with θ0=2andσ2=1

123

Author's personal copy

212 S. Petrone et al.

Density

0 5 10 15

01234

Posterior density of lambda (Gibbs)

Density

−1 0 1 2 3 4

0.0 0.5 1.0 1.5

Posterior densities of theta: hier. Bayes (Gibbs), EB, BvM

Density

0 5 10 15

01234

Posterior density of lambda (Gibbs)

Density

−1 0 1 2 3 4

0.0 0.5 1.0 1.5

Posterior densities of theta: hier. Bayes (Gibbs), EB, BvM

Fig. 1 Comparing empirical Bayes and hierarchical Bayesian posterior densities. Simulated data from a

Gaussian distribution N(2,6);n=20; ¯xn=1.667. E[λ]=1/3 (ﬁrst row) and E[λ]=1 (second row).

First column: MCMC estimate of the posterior density of λ;thefull square denotes E(λ |x1,...,xn)and the

empty square denotes the marginal maximum likelihood estimate ˆ

λn. Second column: hierarchical Bayesian

posterior density of θ(MCMC estimate; solid curve), empirical Bayes posterior density of θ(dashed curve)

and limit Gaussian density dN(¯xn,σ

2/n)(bold solid curve). The empty triangle denotes E[θ|x1,...,xn];

the star denotes E[θ|ˆ

λn,x1,...,xn];thefull triangle denotes the sample mean, ¯xn

for which a closed form expression is not available. Its empirical Bayes approximation is

obtained by plugging ˆ

λninto the expression of E[θ|λ, x1,...,xn]:

E[θ|ˆ

λn,x1,...,xn]= nˆ

λn

nˆ

λn+σ2¯xn=1−σ2

nˆ

λn+σ2¯xn.(5)

We may expect that

E[θ|x1,...,xn]=E[θ|λ, x1,...,xn]h(λ |x1,...,xn)dλ

=E[θ|ˆ

λn,x1,...,xn]+O(n−k),

since, as nincreases, ˆ

λntends to the oracle value λ∗and h(λ |x1,...,xn)could collapse

to a point mass at λ∗. It is interesting to investigate on the order of the error term O(n−k).

To grasp some evidence, we compare the empirical Bayes point estimate with the Laplace

approximation developed by [24], p. 270:

123

Author's personal copy

Empirical Bayes methods 213

Density

0 5 10 15

0.0 0.2 0.4 0.6 0.8

Posterior density of lambda (Gibbs)

Density

−1 0 1 2 3 4

0.0 0.5 1.0 1.5

Posterior densities of theta: hier. Bayes (Gibbs), EB, BvM

Density

0 5 10 15

0.0 0.2 0.4 0.6 0.8

Posterior density of lambda (Gibbs)

Density

−101234

0.0 0.5 1.0 1.5

Posterior densities of theta: hier. Bayes (Gibbs), EB, BvM

Fig. 2 Comparing empirical Bayes and hierarchical Bayesian posterior densities. Simulated data from a

Gaussian distribution N(2,6);n=20; ¯xn=1.667. E[λ]=3 (ﬁrst row) and E[λ]=4 (second row).

Legenda as for Fig. 1

ˆ

ELC[θ|x1,...,xn]=1−(2α+1)/2α

1+¯x2

n/2βσ2

n¯xn(6)

that is a special case of the Laplace approximation with error term O(n−3/2).

Tabl e 1compares E[θ|ˆ

λn,x1,...,xn]and ˆ

ELC[θ|x1,...,xn]as approximations of the

hierarchical Bayes’ point estimate E[θ|x1,...,xn]in a simulation study where θ0=2and

σ2=1. Along the columns, the value of αis ﬁxed at 4, while βvaries, thus resulting into

different prior guesses E[λ].SinceE[λ]=V(θ), increasing values of βcorrespond to smaller

precision of the hierarchical prior. When β=12, the prior guess equals the oracle value, i.e.,

E[λ]=λ∗=4. In this case, the empirical Bayes’ point estimate provides a clearly better

approximation of E[θ|x1,...,xn]than ˆ

ELC[θ|x1,...,xn]. For example, Table 1bshows

how E[θ|x1,...,xn]and E[θ|ˆ

λn,x1,...,xn]coincide up to the thousandths digit for

n=50 and E[λ]=4. This suggests a higher-order form of merging between the empirical

Bayes posterior distribution and the hierarchical posterior distribution of a “more informed”

Bayesian statistician, i.e., the one who assigns a hyper-prior such that E[λ]=λ∗.Inorderto

shade light on this point, we now consider density approximation.

We ﬁrst want to check whether the empirical Bayes posterior distribution provides a

better approximation of the hierarchical Bayesian posterior distribution than the Bernstein–

von Mises Gaussian approximating distribution, N(¯xn,σ

2/n). This comparison has been

investigated in several simulation studies, each one giving similar indications. We report

the results for simulated data from a Gaussian distribution with mean θ0=2 and variance

123

Author's personal copy

214 S. Petrone et al.

σ2=6(Figs.1,2). The hierarchical Bayesian posterior densities are computed by Gibbs

sampling. The ﬁrst column in the plots shows the posterior density h(λ |x1,...,xn)of

λ. This appears to slowly concentrate towards the oracle value λ∗=4. The second column

shows the MCMC approximation of the hierarchical Bayesian posterior density of θ, together

with the empirical Bayes posterior density (dashed curve) and the limit Gaussian density

N(¯xn,σ

2/n)(bold curve). What emerges is that for a prior guess of λclose to the oracle value,

the empirical Bayes posterior density provides a better approximation of the hierarchical

Bayesian posterior density already for the small sample size n=20. This seems to conﬁrm

the previously formulated conjecture: for ﬁnite sample sizes, empirical Bayes provides a

good approximation of the hierarchical Bayesian procedure adopted by the more informed

statistician and strong merging may hold up to a higher-order approximation.

References

1. Antoniak, C.E.: Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems.

Ann. Stat. 2, 1152–1174 (1974)

2. Belitser, E., Enikeeva, F.:Empirical Bayesian test of the smoothness. Math. Methods Stat. 17, 1–18 (2008)

3. Belitser, E., Levit, B.: On the empirical Bayes approach to adaptive ﬁltering. Math. Methods Stat. 12,

131–154 (2003)

4. Berry,D.A., Christensen, R.: Empirical Bayes estimation of a binomial parameter via mixtures of Dirichlet

processes. Ann. Stat. 7, 558–568 (1979)

5. Clyde, M.A., George, E.I.: Flexible empirical Bayes estimation for wavelets. J. R. Stat. Soc. Ser. B 62,

681–698 (2000)

6. Copas, J.B.: Compound decisions and empirical Bayes (with discussion). J. R. Stat. Soc. Ser. B 31,

397–425 (1969)

7. Cui, W., George, E.I.: Empirical Bayes vs. fully Bayes variable selection. J. Stat. Plann. Inference 138,

888–900 (2008)

8. Deely, J.J., Lindley, D.V.: Bayes empirical Bayes. J. Am. Stat. Assoc. 76, 833–841 (1981)

9. Diaconis, P., Freedman, D.: On the consistency of Bayes estimates. Ann. Stat. 14, 1–26 (1986)

10. Efron, B.: Large-scale inference. Empirical Bayes methods for estimation, testing, and prediction. Cam-

bridge University Press, Cambridge (2010)

11. Efron, B., Morris, C.: Limiting the risk of Bayes and empirical Bayes estimators. II. The empirical Bayes

case. J. Am. Stat. Assoc. 67, 130–139 (1972a)

12. Efron, B., Morris, C.: Empirical Bayes on vector observations: an extension of Stein’s method. Biometrika

59, 335–347 (1972b)

13. Efron, B., Morris, C.: Stein’s estimation rule and its competitors-an empirical Bayes approach. J. Am.

Stat. Assoc. 68, 117–130 (1973a)

14. Efron, B., Morris, C.: Combining possibly related estimation problems. (With discussion by Lindley, D.V.,

Copas, J.B., Dickey, J.M., Dawid, A.P., Smith, A.F.M., Birnbaum, A., Bartlett, M.S., Wilkinson, G.N.,

Nelder, J.A., Stein, C., Leonard, T., Barnard, G.A., Plackett, R.L.). J. R. Stat. Soc. Ser. B 35, 379–421

(1973b)

15. Efron, B., Morris, C.N.: Data analysis using Stein’s estimator and its generalizations. J. Am. Stat. Assoc.

70, 311–319 (1973c)

16. Favaro, S., Lijoi, A., Mena, R.H., Prünster, I.: Bayesian nonparametric inference for species variety with

a two parameter Poisson–Dirichlet process prior. J. R. Stat. Soc. Ser. B 71, 993–1008 (2009)

17. Fisher, R.A., Corbet, A.S., Williams, C.B.: The relation between the number of species and the number

of individuals in a random sample of an animal population. J. Anim. Ecol. 12, 42–58 (1943)

18. Forcina, A.: Gini’s contributions to the theory of inference. Int. Stat. Rev. 50, 65–70 (1982)

19. George, E.I., Foster, D.P.: Calibration and empirical Bayes variable selection. Biometrika 87, 731–747

(2000)

20. Ghosh, J.K., Ramamoorthi, R.V.: Bayesian nonparametrics. Springer, New York (2003)

21. Gini, C.: Considerazioni sulla probabilità a posteriori e applicazioni al rapporto dei sessi nelle nascite

umane. Studi Economico-Giuridici. Università di Cagliari. III. Reprinted in Metron, vol. 15, pp. 133–172

(1911)

123

Author's personal copy

Empirical Bayes methods 215

22. Good, I.J.: Breakthroughs in statistics: foundations and basic theory. In: Johnson, N.L., Kotz, S. (eds.)

Introduction to Robbins (1992) An empirical Bayes approach to statistics, pp. 379–387. Springer, Berlin

(1995)

23. James, W., Stein, C.: Estimation with quadratic loss. In: Proceedings of Fourth Berkeley Symposium on

Mathematics Statistics and Probability, vol. 1, pp. 361–379. University of California Press, California

(1961)

24. Lehmann, E.L., Casella, G.: Theory of point estimation, 2nd edn. Springer, New York (1998)

25. Liang, F., Paulo, R., Molina, G., Clyde, M.A., Berger, J.O.: Mixtures of g-priors for Bayesian variable

selection. J. Am. Stat. Assoc. 103, 410–423 (2008)

26. Liu, J.S.: Nonparametric hierarchical Bayes via sequential imputation. Ann. Stat. 24, 911–930 (1996)

27. Maritz, J.S., Lwin, T.: Empirical Bayes methods, 2nd edn. Chapman and Hall, London (1989)

28. McAuliffe, J.D., Blei, D.M., Jordan, M.I.: Nonparametric empirical Bayes for the Dirichlet process

mixture model. Stat. Comput. 16, 5–14 (2006)

29. Morris, C.N.: Parametric empirical Bayes inference: theory and applications. J. Am. Stat. Assoc. 78,

47–55 (1983)

30. Petrone, S., Rousseau, J., Scricciolo, C.: Bayes and empirical Bayes: do they merge? Biometrika 101(2),

285–302 (2014)

31. Robbins, H.: An empirical Bayes approach to statistics. In: Proceedings of Third Berkeley Symposium

on Mathematics, Statistics and Probability, vol. 1, pp. 157–163. University of California Press, California

(1956)

32. Scott, J.G., Berger, J.O.: Bayes and empirical-Bayes multiplicity adjustment in the variable-selection

problem. Ann. Stat. 38, 2587–2619 (2010)

33. Stein, C.: Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In:

Proceedings of Third Berkeley Symposium on Mathematics, Statistics and Probability, vol. 1, pp. 197–

206. University of California Press, California (1956)

34. Szabó, B.T., van der Vaart, A.W., van Zanten, J.H.: Empirical Bayes scaling of Gaussian priors in the

white noise model. Electron. J. Stat. 7, 991–1018 (2013)

123

Author's personal copy