ArticlePDF Available

Abstract

Empirical Bayes methods are often thought of as a bridge between classical and Bayesian inference. In fact, in the literature the term empirical Bayes is used in quite diverse contexts and with different motivations. In this article, we provide a brief overview of empirical Bayes methods highlighting their scopes and meanings in different problems. We focus on recent results about merging of Bayes and empirical Bayes posterior distributions that regard popular, but otherwise debatable, empirical Bayes procedures as computationally convenient approximations of Bayesian solutions.
1 23
METRON
ISSN 0026-1424
Volume 72
Number 2
METRON (2014) 72:201-215
DOI 10.1007/s40300-014-0044-1
Empirical Bayes methods in classical and
Bayesian inference
Sonia Petrone, Stefano Rizzelli, Judith
Rousseau & Catia Scricciolo
1 23
Your article is protected by copyright and
all rights are held exclusively by Sapienza
Università di Roma. This e-offprint is for
personal use only and shall not be self-
archived in electronic repositories. If you wish
to self-archive your article, please use the
accepted manuscript version for posting on
your own website. You may further deposit
the accepted manuscript version in any
repository, provided it is only made publicly
available 12 months after official publication
or later and provided acknowledgement is
given to the original source of publication
and a link is inserted to the published article
on Springer's website. The link must be
accompanied by the following text: "The final
publication is available at link.springer.com”.
METRON (2014) 72:201–215
DOI 10.1007/s40300-014-0044-1
Empirical Bayes methods in classical and Bayesian
inference
Sonia Petrone ·Stefano Rizzelli ·
Judith Rousseau ·Catia Scricciolo
Received: 27 April 2014 / Accepted: 5 May 2014 / Published online: 3 June 2014
© Sapienza Università di Roma 2014
Abstract Empirical Bayes methods are often thought of as a bridge between classical and
Bayesian inference. In fact, in the literature the term empirical Bayes is used in quite diverse
contexts and with different motivations. In this article, we provide a brief overview of empiri-
cal Bayes methods highlighting their scopes and meanings in different problems. We focus on
recent results about merging of Bayes and empirical Bayes posterior distributions that regard
popular, but otherwise debatable, empirical Bayes procedures as computationally convenient
approximations of Bayesian solutions.
Keywords Bayesian weak merging ·Compound experiments ·Frequentist strong
merging ·Hyper-parameter oracle value ·Latent distribution ·Maximum marginal
likelihood estimation ·Shrinkage estimation
1 Introduction
Empirical Bayes methods are popularly employed by researchers and practitioners and are
attractive in appearing to bridge frequentist and Bayesian approaches to inference. In fact, a
frequentist statistician would find just a formal Bayesian flavor in empirical Bayes methods,
while a Bayesian statistician would say that there is nobody less Bayesian than an empirical
Bayesian (Lindley, in [6]). Further confusing, in the literature the term empirical Bayes is
used in quite diverse contexts, with different motivations. Classical empirical Bayes methods
arose in the context of compound experiments, where a latent distribution driving experiment-
specific parameters formally acts as a prior on each one such parameter and is estimated
from the data, usually by maximum likelihood. The term empirical Bayes is also used in the
context of purely Bayesian inference when hyper-parameters of a subjective prior distribu-
S. Petrone (B)·S. Rizzelli ·C. Scricciolo
Bocconi University, Milan, Italy
e-mail: sonia.petrone@unibocconi.it
J. Rousseau
CREST-ENSAE and CEREMADE, Université Paris Dauphine, Paris, France
123
Author's personal copy
202 S. Petrone et al.
tion are selected through the data. Empirical Bayes estimates are also popularly employed
to deal with nuisance parameters. All these situations are different and require specific
analysis.
In this article, we give a brief overview of classical and recent results on empirical Bayes
methods, discussing their use in these different contexts. Section 2recalls classical empirical
Bayes methods in compound sampling problems and mixture models. Although arising as
a way to by-pass the need of specifying a prior distribution in computing optimal Bayesian
solutions [31], here the approach is purely frequentist. The empirical Bayes solution is basi-
cally shrinkage estimation, where the introduction of a latent distribution may facilitate
interpretation and modeling thus helping to design efficient shrinkage.
In a broader sense, the term empirical Bayes is used to denote a data-driven selection
of prior hyper-parameters in Bayesian inference. We discuss this case in Sect. 3. Here, the
prior distribution can only have an interpretation in terms of subjective probability on an
unknown, but fixed, parameter. Although not rigorous, it is a common practice to try to
overcome difficulties in specifying the prior distribution by plugging in some estimates of
the prior hyper-parameters. From a Bayesian viewpoint, in such cases one should rather
assign a hyper-prior distribution, which however makes computations more involved. The
empirical Bayes selection thus appears as a convenient way out that is expected to give similar
inferential results as the hierarchical Bayesian solution for large sample sizes and better results
for finite samples than a “wrong choice” of the prior hyper-parameters. Although commonly
trusted, these facts are not rigorously proved. Recent results [30] address the presumed
asymptotic equivalence of Bayesian and empirical Bayes solutions in terms of merging.
Roughly speaking, they show that, in regular parametric problems, the empirical Bayes and
the Bayesian posterior distributions generally tend to merge, that is, to be asymptotically
close, but also that possible divergent behavior may arise. Thus, the use of empirical Bayes
prior selection requires much care.
Section 4discusses another popular use of empirical Bayes methods in problems with
nuisance parameters. We extend, in particular, the results of [30] on weak merging of
empirical Bayes procedures to nuisance parameter problems, which we illustrate with partial
linear regression models.
The result about merging recalled in Sect. 3only gives first-order asymptotic comparison
between empirical Bayes and any Bayesian posterior distributions. A higher-order compar-
ison would be needed to distinguish among them. We conclude the article with some hints
on the finite-sample behavior of the empirical Bayes posterior distribution in a simple but
insightful example (Sect. 5). The results suggest that, when merging holds, the empirical
Bayes posterior distribution can indeed be a computationally convenient approximation of
an efficient, in a sense to be specified, Bayesian solution.
2 Classical empirical Bayes
The introduction of the empirical Bayes method is traditionally associated with Robbins’
article [31] on compound sampling problems. Compound sampling models arise in a variety
of situations including multi-site clinical trials, estimation of disease rates in small geo-
graphical areas, longitudinal studies. In this setting, nvalues θ1,...,θ
nare drawn at random
from a latent distribution G. Then, conditionally on θ1,...,θ
n, observable random variables
X1,...,Xnare drawn independently from probability distributions p(·|θ1), ..., p(·|θn),
respectively. The framework can be thus described:
123
Author's personal copy
Empirical Bayes methods 203
Xi|θi
indep
p(·|θi)
θi|Giid
G(·), i=1,...,n,
where the index irefers to the ith experiment. Interest lies in estimating an experiment spe-
cific parameter θiwhen all the nobservations X1,...,Xnare available. For the generic ith
experiment, one has Xi|θip(·|θi)and θiG; thus, the latent distribution Gformally
plays the role of a prior distribution on θiin a Bayesian flavor. Were Gknown, inference on θi
would be carried out through the Bayes’ rule, computing the posterior distribution of θigiven
Xi,dGi|Xi)p(Xi|θi)dGi),andθicould be estimated by the Bayes’ estimator
with respect to squared error loss, i.e., the posterior mean EG[θi|Xi]=θdG|Xi).
In fact, in general Gis unknown and the Bayes’ estimator EG[θi|Xi]is not computable.
One can however use an estimate of the “prior distribution” Gbased on the available obser-
vations X1,...,Xn, which is what originated the term “empirical Bayes”. Were θ1,...,θ
n
observable, their common distribution Gcould be pointwise consistently estimated by the
empirical cumulative distribution function (cdf)
Gn) =n
i=11(−∞]i).Astheθiare not
observable, the empirical Bayes approach suggests estimating Gfrom the data X1,..., Xn
exploiting the fact that
Xi|Giid
fG(·)=p(·|θ)dG ), i=1,...,n.
We still denote by
Gnany estimator for Gbased on X1,...,Xn.Asin[31], consider i=n,
that is, estimating θn. The unit-specific unknown θncan be estimated by the empirical Bayes
version E
Gn[θ|Xn]of the posterior mean. Empirical Bayes methods considered in [31]
have been named nonparametric empirical Bayes, because Gis assumed to be completely
unknown, to distinguish them from parametric empirical Bayes methods later developed
by Efron and Morris [1115], where Gis assumed to be known up to a finite-dimensional
parameter. If Gis completely unknown, then the cdf FG(x)=x
−∞ fG(u)du,xR, can
be estimated from the empirical cdf
Fn(x)=n
i=11(−∞,x](Xi)which, for every fixed x,
tends to FG(x),asn→∞, whichever the mixing distribution G. Thus, depending on the
kernel density p(·|θ) and the class Gto which Gbelongs, the estimator
Gnentailed by
Fn
approximates Gfor large nand the corresponding empirical Bayes’ estimator E
Gn[θ|Xn]for
θnapproximates the posterior mean EG[θ|Xn]. To illustrate this, we consider the following
example due to Robbins [31], which deals with the Poisson case. Here, whatever the unknown
distribution G, the posterior mean can be written as the ratio of the probability mass function
fG(·)evaluated at different points. These terms can be estimated by the corresponding values
of the empirical mass function.
Example 1 Let Xi|θiPoissoni)independently, with θi|Giid
G,i=1,...,n,where
Gis a cdf on R+. In this case, EG[θ|X=x]=(x+1)fG(x+1)/ fG(x),x=0,1,...,which
can be estimated by ϕn(x)=(x+1)n
i=11{x+1}(Xi)/ n
i=11{x}(Xi). Then, whatever the
unknown distribution G,foranyfixedx,ϕn(x)EG[θ|X=x]as n→∞, with proba-
bility 1. This naturally suggests using ϕn(Xn)as an estimator for θn. Robbins [31] extended
this technique to the cases where the Xihas geometric, binomial or Laplace distribution.
As discussed by Morris [29], parametric empirical Bayes procedures are needed to deal
with those cases where nis too small to well approximate the Bayesian solution, but still
a substantial improvement over standard methods can be made as for the James–Stein’s
estimator. When the mixing distribution is assumed to have a specific parametric form G(·|
ψ), it is common practice to estimate the unknown parameter ψfrom the data by maximum
123
Author's personal copy
204 S. Petrone et al.
likelihood, computing
ψn
ψ(X1,...,Xn)as
ψn=argmaxψn
i=1p(Xi|θ)dG|
ψ). Inference on θnis then carried out using G(·|
ψn)to compute EG(·|
ψn)[θ|Xn]. Empirical
Bayes estimation of θnhas the advantage of doing asymptotically as well as the Bayes’
estimator without knowing the “prior” distribution. However, the Bayesian approach and the
empirical Bayes approach are only seemingly related: there is, indeed, a clearcut difference
between them. In the empirical Bayes approach to compound problems, although Gformally
acts as a prior distribution on a single parameter, its introduction is motivated in a frequentist
sense, as the common distribution of the random sample 1,...,θ
n); indeed, estimation of
Gis carried out by frequentist methods. In the Bayesian approach, a prior distribution can
be assigned to a fixed unknown parameter, being interpreted as a formalization of subjective
information. In the context of multiple independent experiments, a Bayesian statistician
would rather assume probabilistic dependence across experiments by regarding the θias
exchangeable and assigning a prior probability law to G(in the nonparametric case) or to ψ
(in the parametric case); see, e.g., [1,4,8].
Rather than in comparison with Bayesian inference, the advantage of empirical Bayes
methods can be appreciated in comparison with classical maximum likelihood estimators.
The empirical Bayes estimate of θimakes efficient use of the available information because
all data are used when estimating G. In other terms, empirical Bayes techniques involve
learning from the experience of others or, using Tukey’s evocative expression, “borrowing
strength”. To illustrate this crucial aspect, we consider the following classical example.
Example 2 Let (X1,...,Xp)Np(θ , σ 2Ip)be a p-variate Gaussian distribution, where
θ=1,...,θp)and Ipis the p-dimensional identity matrix. The Xican be the mean of
a random sample Xi,j,j=1,...,n, within the ith experiment. Suppose σ2is known. Let
θi|ψiid
N(0), with unknown variance ψ. Then, the maximum marginal likelihood
estimator for ψis
ψp=max{0,s2σ2},wheres2=p
i=1x2
i/p. The empirical Bayes’
estimator for θiis EN(0,
ψp)[θ|Xi]=[1(p22/p
i=1x2
i]Xi, which coincides with the
James–Stein’s estimator [23,33], that dominates the maximum likelihood estimator ˆ
θi=Xi
for p3 with respect to the overall quadratic loss p
i=1iˆ
θi)2.
As remarked by Morris [29], James–Stein’s estimator is minimax-optimal for the sum of
the individual squared error loss functions only in the equal variances case. Optimality is
lost, for example, if global loss functions that weight differently the individual squared losses
are used. Other forms of shrinkage, possibly suggested by the empirical Bayes approach, are
then necessary.
We conclude this section with a historical note. Although the introduction of the empirical
Bayes method is traditionally associated with Robbins [31]’s article, the idea was partially
anticipated by, among others, Gini [21] who, as pointed out by Forcina [18], pioneerly
provided empirical Bayes solutions for estimating the parameter of a binomial distribution,
andbyFisheretal.[17] who applied the parametric empirical Bayes technique to the so-called
species sampling problem assuming a Gamma “prior” distribution, see also Good [22]. Since
then, the field has witnessed a tremendous growth both in terms of theoretical developments
as well as in diversity of applications, see, e.g., the monographs [27]and[10].
3 Empirical Bayes selection of prior hyper-parameters
In a broader sense, the term empirical Bayes is commonly associated with general techniques
that make use of a data-driven choice of the prior distribution in Bayesian inference. Here,
123
Author's personal copy
Empirical Bayes methods 205
the basic setting is inference on an exchangeable sequence (Xi). Exchangeability is intended
in a subjective sense: the data are physically independent, but probabilistic dependence is
expressed among them, as past observations give information on future values and such
incomplete information is described probabilistically through the conditional distribution of
Xn+1,Xn+2,...,given X1=x1,...,Xn=xn. Exchangeability is the basic dependence
assumption, which is equivalent to assuming a statistical model p(·|θ) such that the Xi
are conditionally independent and identically distributed (iid) given θand expressing a prior
distribution on θ, by de Finetti’s representation theorem for exchangeable sequences. Thus,
the statistical model and the prior are together a way of expressing the probability law of
the observable sequence (Xi)and in such way they should be chosen. In fact, choosing a
honest subjective prior in Bayesian inference can be a difficult task. A way of formulating
such uncertainty is to assign the prior on θhierarchically, assuming θ|λ(·|λ), a para-
metric distribution depending on hyper-parameters λ,andλH(λ).However,thisoften
complicates computations, so that it is a common practice to plug in some estimate ˆ
λnof
the prior hyper-parameters as a shortcut. The resulting data-dependent prior (·|ˆ
λn),com-
bined with the likelihood, results into a pseudo-posterior distribution (·|ˆ
λn,X1,...,Xn)
that is commonly referred to as empirical Bayes. Many types of estimators for λare
considered, the most popular being the maximum marginal likelihood estimator, defined
as
ˆ
λnargmax
λ¯
n
i=1
p(Xi|θ)(dθ|λ),
where ¯
is the closure of .
Such empirical Bayes approach is appealing in offering the possibility of making Bayesian
inference by-passing a complete specification of the prior and it is largely used in practical
applications and in the literature: see, e.g., [7,19,25,32] in the context of variable selection
in regression, [5] for wavelet shrinkage estimation, [26]and[28] in Bayesian nonparametric
mixture models, [16] in Bayesian nonparametric inference for species diversity, [2,3]and
[34] in Bayesian nonparametric procedures for curve estimation.
Although popular, this mixed approach is not rigorous from a Bayesian point of view. Its
interest mainly lies in being a computationally simpler alternative to a more rigorous, but
usually analytically more complex, hierarchical specification of the prior: one expects that,
when the sample size is large, the empirical Bayes posterior distribution will be close to
some Bayesian posterior distribution. Moreover, for finite samples, a data-driven empirical
Bayes selection of the prior hyper-parameters is expected to give better inferential results
than a “wrong choice” of λ. These commonly believed facts do not seem to be rigorously
proved in the literature. A recent work by Petrone et al. [30] addresses the supposed asymp-
totic equivalence between empirical Bayes and Bayesian posterior distributions in terms of
merging.
Two notions of merging are considered: Bayesian weak merging in the sense of [9], and
frequentist strong merging in the sense of [20]. Bayesian weak merging compares posterior
distributions in terms of weak convergence, with respect to (wrt) the exchangeable probability
law of (Xi). Roughly speaking, we have weak merging of the empirical Bayes and Bayesian
posterior distributions if any Bayesian statistician is sure that her/his posterior distribution
and the empirical Bayes posterior distribution will eventually be close, in the sense of weak
convergence. This is a minimal requirement, but it is not guaranteed. From results in [9], it can
be proved that weak merging holds if and only if the empirical Bayes posterior distribution is
consistent in the frequentist sense at the true value θ0,whateverθ0. Consistency at θ0means
123
Author's personal copy
206 S. Petrone et al.
that the sequence of empirical Bayes posterior distributions weakly converges to a point mass
at θ0, almost surely wrt P
θ0,whereP
θ0denotes the probability law of (Xi)such that the Xi
are iid according to Pθ0.
Sufficient conditions for consistency of empirical Bayes posterior distributions are pro-
vided in [30], Section 3. In general, consistency of Bayesian posterior distributions does not
imply consistency of empirical Bayes posteriors. For the latter, one has to control the asymp-
totic behavior of the estimator ˆ
λn, too. If ˆ
λnis the maximum marginal likelihood estimator,
its properties can be exploited to show that the empirical Bayes posterior distribution is con-
sistent at θ0under essentially the same conditions which ensure consistency of Bayesian
posterior distributions. For more general estimators, conditions become more cumbersome.
When ˆ
λnis a convergent sequence, sufficient conditions are given in Proposition 3 of [30],
based on a change of the prior probability measure such that the dependence on the data is
transferred from the prior to the likelihood.
Even when consistency and weak merging hold, the empirical Bayes posterior distribution
may underestimate the uncertainty on θand diverge from any Bayesian posterior, relatively
to a stronger metric than the one of weak convergence. This behavior is illustrated in the
following example.
Example 3 Let Xi|θN(θ , σ 2)independently, with σ2known, and θN(μ, τ 2).
Consider empirical Bayes inference where the prior variance λ=τ2is estimated by the
maximum marginal likelihood estimator, the prior mean μbeing fixed. Then, see, e.g., [24],
p. 263, σ2+nˆτ2
n=max{σ2,n(¯
Xnμ)2}so that ˆτ2
n=2/n)max{n(¯
Xnμ)221,0}.
The resulting empirical Bayes posterior distribution (·| ˆτ2
n,X1,...,Xn)is Gaussian with
mean μn=2/n)/( ˆτ2
n+σ2/nτ2
n/( ˆτ2
n+σ2/n)¯
Xnand variance (1/ˆτ2
n+n2)1.
Since ˆτ2
ncan be equal to zero with positive probability, the empirical Bayes posterior can be
degenerate at μ. The probability of the event ˆτn=0 converges to zero when θ0= μ,but
remains strictly positive when θ0=μ. This suggests that, if θ0= μ, the hierarchical and the
empirical Bayes posterior densities can asymptotically be close relatively to some distance;
however, if θ0=μ, there is a positive probability that the empirical Bayes and the Bayesian
posterior distributions are singular. The possible degeneracy of the empirical Bayes posterior
distribution is pathological in the sense that the uncertainty on the parameter is a posteriori
underestimated.
Such behaviour is not restricted to the Gaussian distribution and applies more generally to
location-scale families of priors. If the model admits a maximum likelihood estimator ˆ
θnand
the prior density is of the form τ1g((·−μ)/τ ), with λ=(μ, τ ) for some unimodal density
gthat attains the maximum at zero, then ˆ
λn=(ˆ
θn,0)and the empirical Bayes posterior is a
point mass at ˆ
θn. These families of priors should not be jointly used with maximum marginal
likelihood empirical Bayes procedures.
A way to refine the analysis to better understand the impact of a data-dependent prior
on the posterior distribution is to study frequentist strong merging in the sense of [20]. Two
sequences of posterior distributions are said to merge strongly if their total variation distance
converges to zero almost surely wrt P
θ0.
Strong merging of Bayesian posterior distributions in nonparametric contexts is often
impossible since pairs of priors are typically singular. Petrone et al. [30] study the problem
for regular parametric models, comparing Bayesian posterior distributions and empirical
Bayes posterior distributions based on the maximum marginal likelihood estimator of λ.
Informally, their results show that strong merging may hold for some true values θ0,butmay
fail for others. That is, for values of θ0in an appropriate set, say 0, the empirical Bayes
123
Author's personal copy
Empirical Bayes methods 207
posterior distribution strongly merges with any Bayesian posterior distribution corresponding
to a prior distribution qwhich is continuous and bounded at θ0,
(·|ˆ
λn,X1,...,Xn)q(·| X1,..., Xn)TV 0(1)
almost surely wrt P
θ0,where·
TV denotes the total variation distance. However, for
θ0/0, strong merging fails: the empirical Bayes posterior can indeed be singular wrt any
smooth Bayesian posterior distribution.
More precisely, suppose that the prior distribution has density π(·)with respect to some
dominating measure and includes θ0in its Kullback–Leibler support. Furthermore, suppose
that the parameter space is totally bounded. Assume that conditions hold that guarantee con-
sistency for the empirical Bayes and the Bayesian posterior distributions. Under such assump-
tions, and some additional requirements that are satisfied by regular parametric models, it can
be shown ([30], Theorem 1) that the maximum marginal likelihood estimator ˆ
λnconverges
to a value λ(here assumed to be unique for brevity) such that π(θ0|λ)π(θ0|λ) for
every λin the hyper-parameter space . Such value can be interpreted as the “oracle value”
of the hyper-parameter, that is the value of the hyper-parameter for which the prior mostly
favors the true value θ0. Furthermore, it is proved that if θ0is such that π(θ0|λ)<,then
strong merging holds, namely
(·|ˆ
λn,X1,...,Xn)(·|λ,X1,..., Xn)TV 0(2)
almost surely wrt P
θ0.Since(·|λ,X1,..., Xn)q(·|X1,...,Xn)TV goes to zero
P
θ0-almost surely for any prior qthat is continuous and bounded at θ0([20], Theorem 1.3.1),
by the triangular inequality, one has (1). However, if θ0is such that π(θ0|λ)=∞,then
strong merging fails. This is the case if, for such θ0,λis in the boundary of and the
prior distribution is degenerate at θ0for λλ. In this case, the empirical Bayes posterior
distribution is degenerate too, thus it is singular wrt any smooth Bayesian posterior.
Result (1), which holds only in the non-degenerate case, ensures that the empirical Bayes
posterior distribution will be close in total variation to the Bayesian posterior, whatever
the prior distribution. But this result only provides a first-order asymptotic comparison that
does not distinguish among Bayesian solutions. In fact, from (2), one could expect that the
empirical Bayes approach can actually give a closer approximation of an efficient, in the
sense of using the prior distribution that mostly favors the true value of θ0, Bayesian solution.
Higher-order asymptotic results are beyond the scope of this note, but we will return to this
issue in Sect. 5, providing a simple, but we believe insightful, example.
4 Empirical Bayes selection of nuisance parameters
Another relevant context of application of empirical Bayes methods concerns Bayesian analy-
sis in semi-parametric models, where estimation of nuisance parameters is preliminarily
considered to carry out inference on the component of interest. The framework can be thus
described: observations X1,..., Xnare drawn independently from a distribution with density
pψ,λ(·),
Xi|(ψ, λ) iid
pψ,λ(·), i=1,...,n,
where ψRkis the parameter of interest and λRa nuisance parame-
ter. Bayesian inference with nuisance parameters does not conceptually present particular
difficulties: a prior distribution is assigned to the overall parameter (ψ, λ),
123
Author's personal copy
208 S. Petrone et al.
(dψ, dλ) =(dψ|λ)(dλ),
and inference on ψis carried out marginalizing the joint posterior distribution (dψ, dλ|
X1,...,Xn). However, this can be computationally cumbersome. A common approach is
thus to plug in some estimator ˆ
λnof λand use a data-dependent prior
(dψ|ˆ
λn
ˆ
λn(dλ). (3)
We highlight the difference between the present context and the one described in Sect. 3:
there, ˆ
λnis used to estimate a hyper-parameter λ, a parameter of the prior only, whereas here
ˆ
λnis used to estimate λwhich is, in the first place, a component of the overall parameter of
the model, it is part of the model.
The results developed in [30] can be extended to prove the asymptotic equivalence, in
terms of weak merging, between the empirical Bayes posterior and any Bayesian posterior
for ψ, provided ˆ
λnis a sequence of consistent estimators for the true value λ0corresponding
to the density pψ00generating the observations. It is known from Proposition 1 in [30]that
a necessary and sufficient condition for weak merging is that the empirical Bayes posterior
for ψis consistent at ψ0, namely, (·|ˆ
λn,X1,...,Xn)weakly converges to a point mass
at ψ0,(·|ˆ
λn,X1,...,Xn)δψ0along almost all sample paths when sampling from the
infinite product measure P
ψ00. To illustrate how this assertion can be shown, we present an
example on partially linear regression.
Example 4 Suppose we observe a random sample from the distribution of X=(Y,V,W),
in which, for some unobservable error eindependent of (V,W), the relationship among the
components is described by
Y=ψV+ηλ(W)+e.
The independent variable Yis a regression on (V,W)that is linear in Vwith slope ψ,
but may depend on Win a nonlinear way through ηλ(W)which represents an additive
contamination of the linear structure of Y.WeassumethatVand Wtake values in [0,1]
and that, for λR, the function w→ ηλ(w) is known up to λ. If the error eis
assumed to be normal, eN(0
2
0)with known variance σ2
0, then the density of Xis given
by
pψ,λ(x)=φσ0(yψv ηλ(w)) pV,W(v, w), x=(y,v,w)R×[0,1]2,
where φσ0(·)=σ1
0φ(·0), with φthe standard Gaussian density, and pV,Wthe joint den-
sity of (V,W). Consider an empirical Bayes approach that estimates λby any sequence ˆ
λn
of consistent estimators for λ0and use the empirical Bayes posterior (·|ˆ
λn,X1,...,Xn)
corresponding to a prior of the form in (3) to carry out inference on ψ. The empiri-
cal Bayes posterior (·|ˆ
λn,X1,...,Xn)weakly merges with the posterior for ψcor-
responding to any genuine prior on (ψ, λ) if only (·|ˆ
λn,X1,...,Xn)is consis-
tent at ψ0.Weshowthat,foreveryδ>0, the empirical Bayes posterior probability
(|ψψ0||ˆ
λn,X1,...,Xn)0in Pn
ψ00-probability. Let mψ,λ(v, w) =ψv +
ηλ(w). Assume there exists a constant B>0 such that supψ,λ mψ,λB.Since
the Hellinger distance h(pψ,λ0,pψ00)(E[V2])1/2eB2/4σ2
0|ψψ0|/2σ0, the inclu-
sion {ψ:|ψψ0|}⊆{ψ:h(pψ,λ0,pψ00)>Mδ}holds for a suitable positive
constant M. To prove the claim, it is therefore enough to study the asymptotic behavior of
(h(pψ,λ0,pψ00)>Mδ|ˆ
λn,X1,...,Xn)which, if the prior for ψ,givenλ, belongs
to a location family of ν-densities generated by π0(·), i.e., π(·|λ) =π0(·−λ), is equal
to
123
Author's personal copy
Empirical Bayes methods 209
(h(pψ,λ0,pψ00)>Mδ|ˆ
λn,X1,...,Xn)
=h(pψ,λ0,pψ00)>Mδn
i=1φσ0(Yimψ,ˆ
λn(Vi,Wi))π0ˆ
λn)ν(dψ)
n
i=1φσ0(Yimψ00(Vi,Wi))π0ˆ
λn)ν(dψ)
=N(X1,...,Xn)
D(X1,...,Xn).
Assume there exists a continuous function g:[0,1]→Rand α>0 such that, for any
λ, λ0, the difference |ηλ(w)ηλ0(w)|≤|g(w)|λλ0α≤gλλ0αfor every
w∈[0,1]. Then, on the event n=(anminiYimaxiYian,ˆ
λnλ0≤un),
which, for sequences un0andan=O((log n)κ),κ>0, has probability Pn
ψ00(n)=
1+o(1),wehave
N(X1,...,Xn)/ D(X1,..., Xn)exp {2n(un+guα
n)(an+B)+nu2
n}
×(h(pψ,λ0,pψ00)>Mδ|λ0,X1,...,Xn).
If the Bayesian posterior (·|λ0,X1,...,Xn)is Hellinger consistent at Pψ00and the con-
vergence is exponentially fast, then also the empirical Bayes posterior (·|ˆ
λn,X1,...,Xn)
is consistent at Pψ00and the claim that (|ψψ0||ˆ
λn,X1,...,Xn)0fol-
lows.
5 Higher-order comparisons and finite-sample properties
We return to the discussion at the end of Sect. 3, providing a simple example. Although limited
to the Gaussian case, this gives some hints about finer comparisons between Bayesian and
empirical Bayes posterior distributions. The evidence in this example could be extended to
more general contexts, such as Bayesian inference and variable selection in linear regression
with g-priors.
As discussed, an empirical Bayes choice of the prior hyper-parameters in Bayesian infer-
ence is not rigorous, but can be of interest as an approximation of a computationally more
involved hierarchical Bayesian posterior distribution. In fact, the results recalled in Sect. 3
show that, even for regular parametric models, the empirical Bayes posterior distribution can
be singular wrt any smooth Bayesian posterior, depending on the form of the prior distribution
and on the nature of the prior hyper-parameters. Thus, care is needed when using empirical
Bayes methods as an approximation of Bayesian solutions. On a positive side, these results
show that, in non-degenerate cases, the empirical Bayes posterior distribution does merge
strongly with any smooth Bayesian posterior distribution. However, this first-order asymp-
totic comparison does not distinguish among Bayesian posterior distributions arising from
different priors. The aim here is to grasp some evidence for finer comparisons. We explore
the following two issues.
Asymptotically, in regular parametric models, any smooth Bayesian posterior distribution
is approximated by a Gaussian distribution centered at the maximum likelihood estimate
ˆ
θn, by the Bernstein–von Mises theorem. Strong merging of Bayesian and empirical Bayes
posterior distributions implies that, when a Bernstein–von Mises behavior holds for the
Bayesian posterior distribution, it also holds for the empirical Bayes posterior; which is a
particularly interesting implication in nonparametric problems. In fact, one would expect that
the empirical Bayes posterior distribution can provide a closer approximation to a hierarchical
Bayesian posterior than the Bernstein–von Mises Gaussian distribution.
123
Author's personal copy
210 S. Petrone et al.
Less informally, based on the results of Sect. 3, one would conjecture that the hierar-
chical posterior distribution concentrates around the oracle value λof the prior hyper-
parameters for increasing sample sizes and, since ˆ
λnλ, the empirical Bayes posterior
distribution (θ |ˆ
λn,X1,...,Xn)and the hierarchical Bayesian posterior distribution
h|X1,...,Xn)can be close even for moderate sample sizes. The following example
suggests that, although this is the case asymptotically by the results in Sect. 3, the posterior
distribution h|X1,...,Xn)slowly incorporates the sample information so that, for finite
samples, the empirical Bayes posterior distribution (θ |ˆ
λn,X1,...,Xn)is a close approx-
imation of h|X1,...,Xn)only if the prior distribution on λis enough concentrated
around the oracle value λ. In other words, the example suggests that, in the non-degenerate
case, the empirical Bayes posterior distribution is a high-order approximation of the poste-
rior distribution of a “well informed” Bayesian researcher whose prior highly favors the true
value of θ.
Example 5 Consider the simple example of the Gaussian conjugate model introduced in
Sect. 3with now a hierarchical specification of the prior. Let Xi|θN(θ, σ 2)indepen-
dently, with σ2known. Let θ|λN(0)and 1G(α, β), a Gamma distribution where
β>0 is the scale parameter. Then, E(λ) =β/(α 1)and V(λ) =β2/[1)22)].
The prior of θobtained by integrating out λis a Student’s-twith zero mean, 2αdegrees of
freedom and scaling factor β/α. The prior variance of θequals the prior guess on the hyper-
parameter λ,V) =E(λ). Although this is a simple model, computations of the posterior
distribution of θbecome analytically complicated. The conditional distribution of θ,givenλ
and the data, is
θ|(λ, x1,...,xn)N(nλ+σ2)1nλ¯xn,(nλ+σ2)1σ2λ
and the posterior distribution of θis obtained by integrating λout wrt its posterior distribution
h|x1,...,xn). This integration step is not analytically manageable and approximation
by Markov chain Monte Carlo (MCMC) is usually employed.
The empirical Bayes selection of λis an attractive, computationally simpler, shortcut.
Estimation of λvia maximum marginal likelihood gives ˆ
λn=max{0,¯x2
nσ2/n}. Thus,
the maximum marginal likelihood estimator ˆ
λnmay take value zero in the boundary of
=(0,)with positive probability. If ˆ
λn=0, then the empirical Bayes prior distribution
of θis a point mass at the prior guess and the resulting posterior distribution is degenerate. As
seen in Sect. 3, if the true value θ0=E[θ]=0, the probability of degeneracy remains positive
even when n→∞, thus determining an asymptotic divergence between the empirical Bayes
posterior distribution and the hierarchical Bayesian posterior distribution. If θ0= 0, such
probability goes to zero and strong merging holds. Interest is in investigating higher-order
approximations in this case.
We first focus on point estimation with quadratic loss. The Bayes’ estimate is the posterior
expectation
E[θ|x1,...,xn]= (1+θ2/2β)(2α+1)/2exp n1
nlog θ+1
2σ2−¯xn)2dθ
(1+θ2/2β)(2α+1)/2exp n
2σ2−¯xn)2dθ
,
(4)
123
Author's personal copy
Empirical Bayes methods 211
Tab l e 1 Comparing empirical Bayes and Laplace point estimates as approximations of the hierarchical Bayes’ point estimates
E[λ]=1/3E[λ]=1E[λ]=3E[λ]=4E[λ]=10
(a) n=20xn=1.835
ˆ
ELC[θ|x1,...,xn](Laplace appr.) 1.797 1.769 1.749 1.745 1.738
E[θ|x1,...,xn](hierarc. Bayes, by MCMC) 1.683 (0.0029) 1.750 (0.0031) 1.801 (0.0033) 1.805 (0.0034) 1.821 (0.0033)
(standard dev.)
E[λ|x1,...,xn](Bayes, by MCMC) 0.074 (0.0051) 1.301 (0.0113) 3.018 (0.0261) 3.902 (0.0332) 8.915 (0.0721)
(standard dev.)
ˆ
λn(maximum marginal lik.) 3.320 3.320 3.320 3.320 3.320
E[θ|ˆ
λn,x1,...,xn](empirical Bayes) 1.809 1.809 1.809 1.809 1.809
(b) n=50; ¯xn=2.009
ˆ
ELC[θ|x1:n](Laplace appr.) 1.994 1.982 1.972 1.970 1.967
E[θ|x1,...,xn](hierarc. Bayes, by MCMC) 1.951 (0.0018) 1.974 (0.0019) 1.993 (0.0022) 1.999 (0.0022) 2.005 (0.0024)
(standard dev.)
E[λ|x1,...,xn](Bayes, by MCMC) 0.833 (0.0074) 1.403 (0.0123) 3.117 (0.0251) 4.012 (0.0342) 9.103 (0.0911)
(standard dev.)
ˆ
λn(maximum marginal lik.) 4.016 4.016 4.016 4.016 4.016
E[θ|ˆ
λn.x1,...,xn](empirical Bayes) 1.999 1.999 1.999 1.999 1.999
Simulated data with θ0=2andσ2=1
123
Author's personal copy
212 S. Petrone et al.
Density
0 5 10 15
01234
Posterior density of lambda (Gibbs)
Density
−1 0 1 2 3 4
0.0 0.5 1.0 1.5
Posterior densities of theta: hier. Bayes (Gibbs), EB, BvM
Density
0 5 10 15
01234
Posterior density of lambda (Gibbs)
Density
−1 0 1 2 3 4
0.0 0.5 1.0 1.5
Posterior densities of theta: hier. Bayes (Gibbs), EB, BvM
Fig. 1 Comparing empirical Bayes and hierarchical Bayesian posterior densities. Simulated data from a
Gaussian distribution N(2,6);n=20; ¯xn=1.667. E[λ]=1/3 (first row) and E[λ]=1 (second row).
First column: MCMC estimate of the posterior density of λ;thefull square denotes E|x1,...,xn)and the
empty square denotes the marginal maximum likelihood estimate ˆ
λn. Second column: hierarchical Bayesian
posterior density of θ(MCMC estimate; solid curve), empirical Bayes posterior density of θ(dashed curve)
and limit Gaussian density dN(¯xn
2/n)(bold solid curve). The empty triangle denotes E[θ|x1,...,xn];
the star denotes E[θ|ˆ
λn,x1,...,xn];thefull triangle denotes the sample mean, ¯xn
for which a closed form expression is not available. Its empirical Bayes approximation is
obtained by plugging ˆ
λninto the expression of E[θ|λ, x1,...,xn]:
E[θ|ˆ
λn,x1,...,xn]= nˆ
λn
nˆ
λn+σ2¯xn=1σ2
nˆ
λn+σ2¯xn.(5)
We may expect that
E[θ|x1,...,xn]=E[θ|λ, x1,...,xn]h|x1,...,xn)dλ
=E[θ|ˆ
λn,x1,...,xn]+O(nk),
since, as nincreases, ˆ
λntends to the oracle value λand h|x1,...,xn)could collapse
to a point mass at λ. It is interesting to investigate on the order of the error term O(nk).
To grasp some evidence, we compare the empirical Bayes point estimate with the Laplace
approximation developed by [24], p. 270:
123
Author's personal copy
Empirical Bayes methods 213
Density
0 5 10 15
0.0 0.2 0.4 0.6 0.8
Posterior density of lambda (Gibbs)
Density
−1 0 1 2 3 4
0.0 0.5 1.0 1.5
Posterior densities of theta: hier. Bayes (Gibbs), EB, BvM
Density
0 5 10 15
0.0 0.2 0.4 0.6 0.8
Posterior density of lambda (Gibbs)
Density
101234
0.0 0.5 1.0 1.5
Posterior densities of theta: hier. Bayes (Gibbs), EB, BvM
Fig. 2 Comparing empirical Bayes and hierarchical Bayesian posterior densities. Simulated data from a
Gaussian distribution N(2,6);n=20; ¯xn=1.667. E[λ]=3 (first row) and E[λ]=4 (second row).
Legenda as for Fig. 1
ˆ
ELC[θ|x1,...,xn]=1(2α+1)/2α
1x2
n/2βσ2
n¯xn(6)
that is a special case of the Laplace approximation with error term O(n3/2).
Tabl e 1compares E[θ|ˆ
λn,x1,...,xn]and ˆ
ELC[θ|x1,...,xn]as approximations of the
hierarchical Bayes’ point estimate E[θ|x1,...,xn]in a simulation study where θ0=2and
σ2=1. Along the columns, the value of αis fixed at 4, while βvaries, thus resulting into
different prior guesses E[λ].SinceE[λ]=V), increasing values of βcorrespond to smaller
precision of the hierarchical prior. When β=12, the prior guess equals the oracle value, i.e.,
E[λ]=λ=4. In this case, the empirical Bayes’ point estimate provides a clearly better
approximation of E[θ|x1,...,xn]than ˆ
ELC[θ|x1,...,xn]. For example, Table 1bshows
how E[θ|x1,...,xn]and E[θ|ˆ
λn,x1,...,xn]coincide up to the thousandths digit for
n=50 and E[λ]=4. This suggests a higher-order form of merging between the empirical
Bayes posterior distribution and the hierarchical posterior distribution of a “more informed”
Bayesian statistician, i.e., the one who assigns a hyper-prior such that E[λ]=λ.Inorderto
shade light on this point, we now consider density approximation.
We first want to check whether the empirical Bayes posterior distribution provides a
better approximation of the hierarchical Bayesian posterior distribution than the Bernstein–
von Mises Gaussian approximating distribution, N(¯xn
2/n). This comparison has been
investigated in several simulation studies, each one giving similar indications. We report
the results for simulated data from a Gaussian distribution with mean θ0=2 and variance
123
Author's personal copy
214 S. Petrone et al.
σ2=6(Figs.1,2). The hierarchical Bayesian posterior densities are computed by Gibbs
sampling. The first column in the plots shows the posterior density h|x1,...,xn)of
λ. This appears to slowly concentrate towards the oracle value λ=4. The second column
shows the MCMC approximation of the hierarchical Bayesian posterior density of θ, together
with the empirical Bayes posterior density (dashed curve) and the limit Gaussian density
N(¯xn
2/n)(bold curve). What emerges is that for a prior guess of λclose to the oracle value,
the empirical Bayes posterior density provides a better approximation of the hierarchical
Bayesian posterior density already for the small sample size n=20. This seems to confirm
the previously formulated conjecture: for finite sample sizes, empirical Bayes provides a
good approximation of the hierarchical Bayesian procedure adopted by the more informed
statistician and strong merging may hold up to a higher-order approximation.
References
1. Antoniak, C.E.: Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems.
Ann. Stat. 2, 1152–1174 (1974)
2. Belitser, E., Enikeeva, F.:Empirical Bayesian test of the smoothness. Math. Methods Stat. 17, 1–18 (2008)
3. Belitser, E., Levit, B.: On the empirical Bayes approach to adaptive filtering. Math. Methods Stat. 12,
131–154 (2003)
4. Berry,D.A., Christensen, R.: Empirical Bayes estimation of a binomial parameter via mixtures of Dirichlet
processes. Ann. Stat. 7, 558–568 (1979)
5. Clyde, M.A., George, E.I.: Flexible empirical Bayes estimation for wavelets. J. R. Stat. Soc. Ser. B 62,
681–698 (2000)
6. Copas, J.B.: Compound decisions and empirical Bayes (with discussion). J. R. Stat. Soc. Ser. B 31,
397–425 (1969)
7. Cui, W., George, E.I.: Empirical Bayes vs. fully Bayes variable selection. J. Stat. Plann. Inference 138,
888–900 (2008)
8. Deely, J.J., Lindley, D.V.: Bayes empirical Bayes. J. Am. Stat. Assoc. 76, 833–841 (1981)
9. Diaconis, P., Freedman, D.: On the consistency of Bayes estimates. Ann. Stat. 14, 1–26 (1986)
10. Efron, B.: Large-scale inference. Empirical Bayes methods for estimation, testing, and prediction. Cam-
bridge University Press, Cambridge (2010)
11. Efron, B., Morris, C.: Limiting the risk of Bayes and empirical Bayes estimators. II. The empirical Bayes
case. J. Am. Stat. Assoc. 67, 130–139 (1972a)
12. Efron, B., Morris, C.: Empirical Bayes on vector observations: an extension of Stein’s method. Biometrika
59, 335–347 (1972b)
13. Efron, B., Morris, C.: Stein’s estimation rule and its competitors-an empirical Bayes approach. J. Am.
Stat. Assoc. 68, 117–130 (1973a)
14. Efron, B., Morris, C.: Combining possibly related estimation problems. (With discussion by Lindley, D.V.,
Copas, J.B., Dickey, J.M., Dawid, A.P., Smith, A.F.M., Birnbaum, A., Bartlett, M.S., Wilkinson, G.N.,
Nelder, J.A., Stein, C., Leonard, T., Barnard, G.A., Plackett, R.L.). J. R. Stat. Soc. Ser. B 35, 379–421
(1973b)
15. Efron, B., Morris, C.N.: Data analysis using Stein’s estimator and its generalizations. J. Am. Stat. Assoc.
70, 311–319 (1973c)
16. Favaro, S., Lijoi, A., Mena, R.H., Prünster, I.: Bayesian nonparametric inference for species variety with
a two parameter Poisson–Dirichlet process prior. J. R. Stat. Soc. Ser. B 71, 993–1008 (2009)
17. Fisher, R.A., Corbet, A.S., Williams, C.B.: The relation between the number of species and the number
of individuals in a random sample of an animal population. J. Anim. Ecol. 12, 42–58 (1943)
18. Forcina, A.: Gini’s contributions to the theory of inference. Int. Stat. Rev. 50, 65–70 (1982)
19. George, E.I., Foster, D.P.: Calibration and empirical Bayes variable selection. Biometrika 87, 731–747
(2000)
20. Ghosh, J.K., Ramamoorthi, R.V.: Bayesian nonparametrics. Springer, New York (2003)
21. Gini, C.: Considerazioni sulla probabilità a posteriori e applicazioni al rapporto dei sessi nelle nascite
umane. Studi Economico-Giuridici. Università di Cagliari. III. Reprinted in Metron, vol. 15, pp. 133–172
(1911)
123
Author's personal copy
Empirical Bayes methods 215
22. Good, I.J.: Breakthroughs in statistics: foundations and basic theory. In: Johnson, N.L., Kotz, S. (eds.)
Introduction to Robbins (1992) An empirical Bayes approach to statistics, pp. 379–387. Springer, Berlin
(1995)
23. James, W., Stein, C.: Estimation with quadratic loss. In: Proceedings of Fourth Berkeley Symposium on
Mathematics Statistics and Probability, vol. 1, pp. 361–379. University of California Press, California
(1961)
24. Lehmann, E.L., Casella, G.: Theory of point estimation, 2nd edn. Springer, New York (1998)
25. Liang, F., Paulo, R., Molina, G., Clyde, M.A., Berger, J.O.: Mixtures of g-priors for Bayesian variable
selection. J. Am. Stat. Assoc. 103, 410–423 (2008)
26. Liu, J.S.: Nonparametric hierarchical Bayes via sequential imputation. Ann. Stat. 24, 911–930 (1996)
27. Maritz, J.S., Lwin, T.: Empirical Bayes methods, 2nd edn. Chapman and Hall, London (1989)
28. McAuliffe, J.D., Blei, D.M., Jordan, M.I.: Nonparametric empirical Bayes for the Dirichlet process
mixture model. Stat. Comput. 16, 5–14 (2006)
29. Morris, C.N.: Parametric empirical Bayes inference: theory and applications. J. Am. Stat. Assoc. 78,
47–55 (1983)
30. Petrone, S., Rousseau, J., Scricciolo, C.: Bayes and empirical Bayes: do they merge? Biometrika 101(2),
285–302 (2014)
31. Robbins, H.: An empirical Bayes approach to statistics. In: Proceedings of Third Berkeley Symposium
on Mathematics, Statistics and Probability, vol. 1, pp. 157–163. University of California Press, California
(1956)
32. Scott, J.G., Berger, J.O.: Bayes and empirical-Bayes multiplicity adjustment in the variable-selection
problem. Ann. Stat. 38, 2587–2619 (2010)
33. Stein, C.: Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In:
Proceedings of Third Berkeley Symposium on Mathematics, Statistics and Probability, vol. 1, pp. 197–
206. University of California Press, California (1956)
34. Szabó, B.T., van der Vaart, A.W., van Zanten, J.H.: Empirical Bayes scaling of Gaussian priors in the
white noise model. Electron. J. Stat. 7, 991–1018 (2013)
123
Author's personal copy
... Implementers of naïve Bayes and other Bayesian network models usually rely on noninformative uniform priors such as Bayes-Laplace and Jeffreys' priors, widely used by Bayesian statisticians (Jaynes 1968;Good 1983). Assigning meaningful priors is always a challenging task (Petrone et al. 2014). When dealing with Big Data or data on which we have no insight, assigning meaningful priors becomes infeasible; hence, the reliance of the machine learning community on these rather arbitrary, noninformative priors is understandable. ...
... The traditional approach rejects the idea of determining priors from data (Deely and Lindley 1981), while others do not have such a strict adherence to that philosophy (Casella 1985;Lwin and Maritz 1989). The traditional approach either assigns priors based on belief and prior knowledge, which can be challenging (Petrone et al. 2014), or assigns noninformative uniform priors as described above. ...
Preprint
Full-text available
Noninformative uniform priors are staples of Bayesian inference, especially in Bayesian machine learning. This study challenges the assumption that they are optimal and their use in Bayesian inference yields optimal outcomes. Instead of using arbitrary noninformative uniform priors, we propose a machine learning based alternative method, learning optimal priors from data by maximizing a target function of interest. Applying na\"ive Bayes text classification methodology and a search algorithm developed for this study, our system learned priors from data using the positive predictive value metric as the target function. The task was to find Wikipedia articles that had not (but should have) been categorized under certain Wikipedia categories. We conducted five sets of experiments using separate Wikipedia categories. While the baseline models used the popular Bayes-Laplace priors, the study models learned the optimal priors for each set of experiments separately before using them. The results showed that the study models consistently outperformed the baseline models with a wide margin of statistical significance (p < 0.001). The measured performance improvement of the study model over the baseline was as high as 443% with the mean value of 193% over five Wikipedia categories.
... Previous studies conducted by Yan and Gendai [36] and Shi et al. [37] employed the MLE method to estimate the hyper-parameter of the prior distribution. This estimation was utilized to analyze the Bayesian reliability quantitative indexes of the cold standby system, for more details see [38]. In (14) and (17), the hyper-parameter a is an unknown constant, which renders direct utilization of η impossible. ...
Article
Full-text available
This study addresses the issue of estimating the shape parameter of the inverted exponentiated Rayleigh distribution, along with the assessment of reliability and failure rate, by utilizing Type-I progressive hybrid censored data. The study explores the estimators based on maximum likelihood, Bayes, and empirical Bayes methodologies. Additionally, the study focuses on the development of Bayes and empirical Bayes estimators with balanced loss functions. A concrete example based on actual data from the field of medicine is used to illustrate the theoretical insights provided in this study. Monte Carlo simulations are employed to conduct numerical comparisons and evaluate the performance and accuracy of the estimation methods.
... Thus, we can use g(θ|ν * ) as a prior over the parameter θ in our inferences (Liang et al., 2008;Petrone et al., 2014). Note that, in this approach, the choice of the prior is in some sense datadriven, since ν * is obtained by the maximization of p(y|ν) (see also Section 5.3). ...
Article
Full-text available
The application of Bayesian inference for the purpose of model selection is very popular nowadays. In this framework, models are compared through their marginal likelihoods, or their quotients, called Bayes factors. However, marginal likelihoods depend on the prior choice. For model selection, even diffuse priors can be actually very informative, unlike for the parameter estimation problem. Furthermore, when the prior is improper, the marginal likelihood of the corresponding model is undetermined. In this work, we discuss the issue of prior sensitivity of the marginal likelihood and its role in model selection. We also comment on the use of uninformative priors, which are very common choices in practice. Several practical suggestions are discussed and many possible solutions, proposed in the literature, to design objective priors for model selection are described. Some of them also allow the use of improper priors. The connection between the marginal likelihood approach and the well‐known information criteria is also presented. We describe the main issues and possible solutions by illustrative numerical examples, providing also some related code. One of them involving a real‐world application on exoplanet detection. This article is categorized under: Statistical Models > Bayesian Models Statistical Models > Fitting Models Statistical Models > Model Selection
... In any case, note that, from a practical viewpoint, data are often used to specify priors, either informally (e.g., driving the choice of the prior mean) or more formally, as in empirical Bayes. A thorough discussion about empirical Bayes methods can be found in Petrone et al. [28]. Alternative approaches to handling the variances include the use of discount factors to define them or placing priors over them [10,11]. ...
Article
Monitoring is a major step in policy analysis used to assess whether a policy is actually working as desired. We provide a general policy monitoring approach based on Bayesian forecasting models. These are employed to predict the evolution of relevant monitoring variables over time and support expected utility calculations to assess the efficiency of the policy. We illustrate the approach by monitoring the Free Maternal Health Care and MDG5 Acceleration Framework policies aimed to reduce maternal and neonatal mortality in Ghana, using dynamic linear models for forecasting purposes. Despite major investments, results at national level suggest no significant improvement in maternal and neonatal survival between pre- and post-policy periods. However, regional analyses show that gains have actually been attained in certain regions, suggesting possible directions for improvements nationwide.
... Thus, we can use g(θ|ν * ) as a prior over the parameter θ in our inferences (Liang et al., 2008;Petrone et al., 2014). Note that, in this approach, the choice of the prior is in some sense datadriven, since ν * is obtained by the maximization of p(y|ν) (see also Section 5.3). ...
Preprint
Full-text available
The application of Bayesian inference for the purpose of model selection is very popular nowadays. In this framework, models are compared through their marginal likelihoods, or their quotients, called Bayes factors. However, marginal likelihoods depends on the prior choice. For model selection, even diffuse priors can be actually very informative, unlike for the parameter estimation problem. Furthermore, when the prior is improper, the marginal likelihood of the corresponding model is undetermined. In this work, we discuss the issue of prior sensitivity of the marginal likelihood and its role in model selection. We also comment on the use of uninformative priors, which are very common choices in practice. Several practical suggestions are discussed and many possible solutions, proposed in the literature, to design objective priors for model selection are described. Some of them also allow the use of improper priors. The connection between the marginal likelihood approach and the well-known information criteria is also presented. We describe the main issues and possible solutions by illustrative numerical examples, providing also some related code. One of them involving a real-world application on exoplanet detection.
... 29,31 When the sample size of random sample goes to infinity, p → under mild regularity conditions. [38][39][40] Suppose the eBass primary thresholdẑ (i) exists on support Ω . In the empirical Bayes estimated objective func- ...
Article
Clusterwise statistical inference is the most widely used technique for functional magnetic resonance imaging (fMRI) data analyses. Clusterwise statistical inference consists of two steps: (i) primary thresholding that excludes less significant voxels by a prespecified cut-off (eg, p < . 001 ); and (ii) clusterwise thresholding that controls the familywise error rate caused by clusters consisting of false positive suprathreshold voxels. The selection of the primary threshold is critical because it determines both statistical power and false discovery rate (FDR). However, in most existing statistical packages, the primary threshold is selected based on prior knowledge (eg, p < . 001 ) without taking into account the information in the data. In this article, we propose a data-driven approach to algorithmically select the optimal primary threshold based on an empirical Bayes framework. We evaluate the proposed model using extensive simulation studies and real fMRI data. In the simulation, we show that our method can effectively increase statistical power by 20% to over 100% while effectively controlling the FDR. We then investigate the brain response to the dose-effect of chlorpromazine in patients with schizophrenia by analyzing fMRI scans and generate consistent results.
... This can be seen as an empirical Bayes approach which approximates the fully Bayesian approach that we also consider. Petrone et al. (2014) provide a discussion on empirical Bayesian methods, including asymptotic results. The use of MAP estimates for hyperparameters is becoming increasingly popular in Bayesian nonparametrics, see e.g. ...
Article
We present a novel Bayesian nonparametric model for regression in survival analysis. Our model builds on the classical neutral to the right model of Doksum (1974) and on the Cox proportional hazards model of Kim and Lee (2003). The use of a vector of dependent Bayesian nonparametric priors allows us to efficiently model the hazard as a function of covariates whilst allowing nonproportionality. The model can be seen as having competing latent risks. We characterize the posterior of the underlying dependent vector of completely random measures and study the asymptotic behavior of the model. We show how an MCMC scheme can provide Bayesian inference for posterior means and credible intervals. The method is illustrated using simulated and real data.
Article
In some recent works, the authors have proposed and developed an Empirical Bayes framework for frequency estimation. The unknown frequencies in a noisy oscillatory signal are modeled as uniform random variables supported on narrow frequency bands. The bandwidth and the relative band centers are known as hyperparameters which can be efficiently estimated using techniques from subspace identification. In the current paper, we examine carefully how the estimated frequency prior can be used to produce a Bayesian estimate of the unknown frequencies based on the same data (for hyperparameter estimation). To this end, we formulate the Bayesian Maximum A Posteriori (MAP) optimization problem and propose an iterative algorithm to compute its solution. Then, we do extensive simulations under various parameter configurations, showing that the MAP estimate of the frequencies are asymptotically close to the band centers of the frequency priors. These results provide an attractive link between the conventional Bayesian method and the Empirical Bayes method for frequency estimation, and in retrospect justify the use of the latter.
Article
We consider the common setting where one observes probability estimates for a large number of events, such as default risks for numerous bonds. Unfortunately, even with unbiased estimates, selecting events corresponding to the most extreme probabilities can result in systematically underestimating the true level of uncertainty. We develop an empirical Bayes approach “Excess Certainty Adjusted Probabilities” (ECAP), using a variant of Tweedie’s formula, which updates probability estimates to correct for selection bias. ECAP is a flexible non-parametric method, which directly estimates the score function associated with the probability estimates, so it does not need to make any restrictive assumptions about the prior on the true probabilities. ECAP also works well in settings where the probability estimates are biased. We demonstrate through theoretical results, simulations, and an analysis of two real world data sets, that ECAP can provide significant improvements over the original probability estimates.
Article
Full-text available
The performance of nonparametric estimators is heavily dependent on a bandwidth parameter. In nonparametric Bayesian methods this parameter can be specified as a hyperparameter of the nonparametric prior. The value of this hyperparameter may be made dependent on the data. The empirical Bayes method is to set its value by maximizing the marginal likelihood of the data in the Bayesian framework. In this paper we analyze a particular version of this method, common in practice, that the hyperparameter scales the prior variance. We characterize the behavior of the random hyperparameter, and show that a nonparametric Bayes method using it gives optimal recovery over a scale of regularity classes. This scale is limited, however, by the regularity of the unscaled prior. While a prior can be scaled up to make it appropriate for arbitrarily rough truths, scaling cannot increase the nominal smoothness by much. Surprisingy the standard empirical Bayes method is even more limited in this respect than an oracle, deterministic scaling method. The same can be said for the hierarchical Bayes method.
Article
The analysis of a sequence of decision problems of identical structure is considered, and the similarities and differences between the compound decision and empirical Bayes approaches explored. The paper includes a brief review of the literature and examines the criteria which have been used in comparing compound decision and empirical Bayes procedures with conventional methods. The performance of one of these procedures is examined in detail, and the previously accepted view that the compound decision approach is appropriate for situations outside those covered by empirical Bayes is questioned.
Article
We have two sets of parameters we wish to estimate, and wonder whether the James‐Stein estimator should be applied separately to the two sets or once to the combined problem. We show that there is a class of compromise estimators, Bayesian in nature, which will usually be preferred to either alternative. “The difficulty here is to know what problems are to be combined together — why should not all our estimation problems be lumped together into one grand melée?” G eorge B arnard commenting on the James–Stein estimator, 1962.
Article
We live in a new age for statistical inference, where modern scientific technology such as microarrays and fMRI machines routinely produce thousands and sometimes millions of parallel data sets, each with its own estimation or testing problem. Doing thousands of problems at once is more than repeated application of classical methods. Taking an empirical Bayes approach, Bradley Efron, inventor of the bootstrap, shows how information accrues across problems in a way that combines Bayesian and frequentist ideas. Estimation, testing, and prediction blend in this framework, producing opportunities for new methodologies of increased power. New difficulties also arise, easily leading to flawed inferences. This book takes a careful look at both the promise and pitfalls of large-scale statistical inference, with particular attention to false discovery rates, the most successful of the new statistical techniques. Emphasis is on the inferential ideas underlying technical developments, illustrated using a large number of real examples.
Article
In this paper we consider Gini's work in the field of statistical inference and reconstruct his approach as an early Bayesian. We also show that certain solutions put forward by Gini anticipate the empirical Bayes line of approach. On the other hand we highlight the fact that Gini's knowledge of the debate current at his time was rather narrow and that he was probably not too well equipped to discuss the more subtle issues. /// Dans cet article on examine les apports de C. Gini au débat sur le problème de l'inférence statistique et l'on cherche à reconstituer sa pensée comme l'un des précurseurs de la position, à ce sujet, inspirée par Bayes. On montre aussi que des méthodes proposées par Gini sont tout à fait analogues à celles que Robbins (1956) a théorisées par la suite. D'autre côté on fait ressortir que Gini avait une connaissance insuffisante du débat auquel il voulait contribuer.
Article
A Bayesian approach is given for various kinds of empirical Bayes problems. In particular it is shown that empirical Bayes procedures are really non-Bayesian, asymptotically optimal, classical procedures for mixtures. In some situations these procedures are Bayes with respect to some prior and in other situations, there is no prior for which they are Bayes. Several examples of these concepts are given as well as a general theory showing the difference between an empirical Bayes model and a Bayes empirical Bayes model.