Content uploaded by Santosh Srivastava

Author content

All content in this area was uploaded by Santosh Srivastava on Dec 29, 2015

Content may be subject to copyright.

Bayesian Estimation of the Entropy of the

Multivariate Gaussian

Santosh Srivastava

Fred Hutchinson Cancer Research Center

Seattle, WA 98109, USA

Email: ssrivast@fhcrc.org

Maya R. Gupta

Department of Electrical Engineering

University of Washington

Seattle, WA 98195, USA

Email: gupta@ee.washington.edu

Abstract— Estimating the entropy of a Gaussian distribution

from samples drawn from the distribution is a difﬁcult problem

when the number of samples is smaller than the number of

dimensions. A new Bayesian entropy estimator is proposed

using an inverted Wishart distribution and a data-dependent

prior that handles the small-sample case. Experiments for six

different cases show that the proposed estimator provides good

performance for the small-sample case compared to the standard

nearest-neighbor entropy estimator. Additionally, it is shown that

the Bayesian estimate formed by taking the expected entropy

minimizes expected Bregman divergence.

I. INTRODUCTION

Entropy is a useful description of the predictability of a

distribution, and has use in many applications of coding,

machine learning, signal processing, communications, and

chemistry [1]–[5]. In practice, many continuous generating

distributions are modeled as Gaussian distributions. One rea-

son for this is the central limit theorem, another reason is

because the Gaussian is the maximum entropy distribution

given an empirical mean and covariance, and as such is a least

assumptive model. For the multivariate Gaussian distribution,

the entropy goes as the log determinant of the covariance;

speciﬁcally, the differential entropy of a d-dimensional random

vector Xdrawn from the Gaussian N(µ, Σ) is

h(X) = ZN(x) ln N(x)dx =d

2+dln(2π)

2+ln |Σ|

2.(1)

In this paper we consider the practical problem of estimating

the entropy of a generating normal distribution Ngiven

samples x1, x2, . . . , xnthat are assumed to be realizations

of random vectors X1, X2, . . . , Xndrawn iid from N(µ, Σ).

In particular, we focus on the difﬁcult limited-data case that

n≤d, which occurs often in practice with high-dimensional

data.

One approach to estimating h(X)is to ﬁrst estimate the

Gaussian distribution, for example with maximum likelihood

(ML) estimates of µand Σ, and then plug-in the covariance

estimate to (1) [6]. Such estimates are usually infeasible when

n≤d. The ML estimate is also negatively biased so that

it underestimates the entropy on average [5]. We believe it is

more effective and will create fewer numerical problems if one

directly estimates a single value for the entropy, rather than

ﬁrst estimating the entire covariance matrix in order to produce

the scalar entropy estimate. To this end, we propose a data-

dependent Bayesian estimate using an inverted Wishart prior.

The choice of prior in Bayesian estimation can dramatically

change the results, but there is little practical guidance for

choosing priors. The main contributions of this work are

showing that the inverted Wishart prior enables estimates for

n≤d, and that using a rough data-dependent estimate as a

parameter to the inverted Wishart prior yields a more robust

estimator than ignoring the data when creating the prior.

II. REL ATE D WORK

First we review related work in parametric entropy estima-

tion, then in nonparametric estimation.

Ahmed and Gokhale investigated uniformly minimum vari-

ance unbiased (UMVU) entropy estimators for parametric dis-

tributions [6]. They showed that the UMVU entropy estimate

for the Gaussian is,

ˆ

hUMVU =dln π

2+ln |S|

2−1

2

d

X

i=1

ψn+ 1 −i

2,(2)

where S=Pn

i=1(xi−¯x)(xi−¯x)T, and the digamma function

is deﬁned ψ(z) = d

dz ln Γ(z), where Γis the standard gamma

function.

Bayesian entropy estimation for the multivariate normal was

ﬁrst proposed in 2005 by Misra et al. [5]. They form an entropy

estimate by substituting [

ln |Σ|for ln |Σ|in (1) where [

ln |Σ|

minimizes the squared-error risk of the entropy estimate. That

is, [

ln |Σ|solves,

arg min

δ∈R Eµ,Σh(δ−ln |Σ|)2i,(3)

where here Σdenotes a random covariance matrix and µa

random vector, and the expectation is taken with respect to the

posterior. We will denote by ˜µand ˜

Σrespectively realizations

of the random µand Σ. Misra et al. consider different priors

with support over the set of symmetric positive deﬁnite ˜

Σ.

They show that given the prior p(˜µ, ˜

Σ) = 1

|˜

Σ|d+1

2

, the solution

to (3) yields the Bayesian entropy estimate

ˆ

hBayesian =dln π

2+ln |S|

2−1

2

d

X

i=1

ψn−i

2.(4)

For discrete random variables, a Bayesian approach that uses

binning for estimating functionals such as entropy was pro-

posed the same year as Misra et al.’s work [7].

It is interesting to note that the estimates ˆ

hUMVU and

ˆ

hBayesian were derived from different perspectives, but differ

only slightly in the digamma argument. Like ˆ

hUMVU, Misra et

al. show that ˆ

hBayesian is an unbiased entropy estimate. They

also show that ˆ

hBayesian is dominated by a Stein-type estimator,

ˆ

hStein = ln |S+n¯x¯xT| − c1,

where c1is a function of dand n. Further, they also show that

the estimate ˆ

hBayesian is dominated by a Brewster-Zidek-type

estimator ˆ

hBZ ,

ˆ

hBZ = ln |S+n¯x¯xT| − c2.

where c2is a function of |S|and ¯x¯xTthat requires calculating

the ratio of two deﬁnite integrals, stated in full in (4.3) of [5].

Misra et al. found that on simulated numerical experiments

their Stein-type and Brewster-Zidek-type estimators achieved

roughly only 6% improvement over the simpler Bayesian

estimate ˆ

hBayesian, and thus they recommend using the com-

putationally much simpler ˆ

hBayesian in applications.

There are two practical problems with the previously pro-

posed parametric estimators. First, the estimates given by (2),

(4), and the other proposed Misra et al. estimators require

calculating the determinant of Sor S+ ¯x¯xT, which can

be numerically infeasible if there are few samples. Second,

the Bayesian estimate ˆ

hBayesian uses the digamma function of

n−dwhich requires n>dsamples so that the digamma

has a non-negative argument, and similarly ˆ

hUMVU uses the

digamma of n−d+ 1, which requires n≥dsamples. Thus,

although the knowledge that one is estimating the entropy of

a Gaussian should be of use, for the n≤dcase one must turn

to nonparametric entropy estimators.

A thorough review of work in nonparametric entropy esti-

mation up to 1997 was written by Beirlant et al. [4], including

density estimation approaches, sample-spacing approaches,

and nearest-neighbor estimators. Recently, Nilsson and Kleijn

show that high-rate quantization approximations of Zador and

Gray can be used to estimate Renyi entropy, and that the

limiting case of Shannon entropy produces a nearest-neighbor

estimate that depends on the number of quantization cells [8].

The special case of their nearest-neighbor estimate that best

validates the high-rate quantization assumptions is when the

number of quantization cells is as large as possible. They show

that this special case produces the nearest-neighbor differential

entropy estimator originally proposed by Kozachenko and

Leonenko in 1987 [9]:

ˆ

hNN =d

n

n

X

i=1

ln kxi−xi,1k2+ ln(n−1) + γ+ ln Vd(5)

where xi,1is xi’s nearest neighbor in the sample set, γis

the Euler-Mascheroni constant, and Vdis the volume of the

d-dimensional hypersphere with radius 1:Vd=πd/2

Γ(1+d/2) . A

problem with this approach is that in practice data samples

may not be in general position; for example, image pixel data

are usually quantized to 8 bits or 10 bits. Thus, it can happen in

practice that two samples have the exact same measured value

and thus kxn−xn,1kis zero and thus the entropy estimate

could be ill-deﬁned. Though there are various ﬁxes, such as

pre-dithering the quantized data, it is not clear what effect

these ﬁxes could have on the estimated entropy.

A different approach is taken by Costa and Hero [2], [3],

[10]; they use the Beardwood Halton Hammersley result that

the function of the length of a minimum spanning graph

converges to the Renyi entropy [11] to form an estimator based

on the empirical length of a minimum spanning tree of data.

Unfortunately, how to use this approach to estimate Shannon

entropy remains an open question.

III. BAYESIAN ESTIMATE WITH INVERTED WISHART

PRI OR

We propose to estimate the entropy as, EN[h(N)], where

Nis a random Gaussian, and the prior p(N)is an inverted

Wishart distribution with scale parameter qand parameter

matrix B. We use a Fisher information metric to deﬁne a

measure over the Riemannian manifold formed by the set of

Gaussian distributions. These choices for prior and measure

are very similar to the choices that we found worked very

well for Bayesian quadratic discriminant analysis [12], and

further details on this framework can be found in that work.

The resulting proposed inverted Wishart Bayesian entropy

estimate is

ˆ

hiW Bayesian =dln π

2+ln |S+B|

2

−1

2

d

X

i=1

ψn+q+ 1 −i

2.(6)

Proof: To show that (6) is EN[h(N)], we will need to integrate

Z˜

Σ>0

ln |˜

Σ|exp[−tr(˜

Σ−1V)]

|˜

Σ|q

2

d˜

Σ(7)

≡EΣ[ln |Σ|]|V|q−d−1

2

Γd(q−d−1

2)(8)

where Σis a random covariance matrix drawn from an inverted

Wishart distribution with scale parameter qand matrix param-

eter 2V. Recall that for any matrix 2V,|Σ−1|/|(2V)−1| ∼

Qd

i=1 χ2

q−d−i[13, Corollary 7.3], where χ2denotes the chi-

squared random variable. Take the natural log of both sides

and use the fact that |A−1|=|A|−1to show that ln |Σ| ∼

ln |2V| − Pd

i=1 ln χ2

q−d−i. Then after taking EΣ[ln |Σ|], (8)

becomes

Γd(q−d−1

2)

|V|q−d−1

2 ln |2V| −

d

X

i=1

Eln χ2

q−d−i!

=Γd(q−d−1

2)

|V|q−d−1

2 ln |V| −

d

X

i=1

ψq−d−i

2!,(9)

where the second line uses the property of the χ2distribution

that E[ln χ2

q] = ln 2 + ψq

2.

Now we will use the above integral identity to ﬁnd

ˆ

hiWBayesian =EN[h(N)]. Solving for EN[h(N)] only requires

computing

EN[ln |Σ|] = Z˜

Σ

ln |˜

Σ|p(˜

Σ|x1, x2, . . . , xn)d˜

Σ

|˜

Σ|d+2

2

,

where the term 1/|˜

Σ|(d+2)/2results from the Fisher in-

formation metric which converts the integral EN[h(N)]

from an integral over the statistical manifold of Gaussians

to an integral over covariance matrices, and the posterior

p(˜

Σ|x1, x2, . . . , xn)is given in [12] such that EN[ln |Σ|]

= |S+B|n+q

2

2(n+q)d

2Γd(n+q

2)!

· Z˜

Σ>0

ln |˜

Σ|exp[−1

2tr(˜

Σ−1(S + B))]

|˜

Σ|n+q+d+1

2

d˜

Σ!

=

|S+B|n+q

2

2(n+q)d

2Γd(n+q

2)

Γd(n+q

2)

S+B

2

n+q

2

· ln

S+B

2

−

d

X

i=1

ψn+q+ 1 −i

2!(10)

= ln |S+B| − dln 2 −

d

X

i=1

ψn+q+ 1 −i

2,

where equation (10) follows by using the fact that (7) is given

by (9) for V= (S+B)/2. Then replacing ln |Σ|in (1) with

the computed EN[ln |Σ|]produces the estimator (6).

A. Choice of Prior Parameters qand B

The inverted Wishart distribution is an unimodal prior that

gives maximum a priori probability to Gaussians with Σ =

B/q. Previous work using the inverted Wishart for Bayesian

estimation of Gaussians has used the identity matrix for B

[14], or a scaled identity where the scale factor was learned by

cross-validation given labeled training data (for classiﬁcation)

[15]. Using B=Isets the maximum of the prior at I/q,

regardless of whether the data are measured in nanometers

or trillions of dollars. To a rough approximation, the prior

regularizes the likelihood towards the prior maximum at B/q,

and thus the bias added by using the prior with B/q =Ican

be ill-suited to the problem. Instead, setting B/q to be a rough

estimate of the covariance can add bias that is more appropriate

for the problem. For example, we have shown that using

B=qdiag(S)can work well when estimating Gaussians for

classiﬁcation by Bayesian quadratic discriminant analysis [12].

For entropy estimation, B=qdiag(S)/n works excellently if

the true covariance is diagonal, but can perform poorly if the

true covariance is a full covariance because the determinant

of B=diag(S)/n can be signiﬁcantly higher than the

determinant of S, biasing the entropy estimate to be too high.

An optimal choice of Bwhen there is no prior information

about Sremains an open question; we propose using

B=qln (diag(S) + 1)

n,(11)

for n≤d, and for n>dwe let Bbe the d×dmatrix of

zeros.

The scalar prior parameter qchanges the peaked-ness of the

inverted Wishart prior, with higher qcorresponding to a more

peaked prior, and thus higher bias. For the entropy problem

there is only one number to estimate, and thus we believe that

as little bias as possible should be used. To achieve this, we

set q= min(d−n, −1). Letting q=−1when n>dand

using the zero matrix for Bin this range makes ˆ

hiWBayesian

equivalent to ˆ

hBayesian when n>d.

B. Bregman Divergences and Bayesian Estimation

Misra et al. showed that EΣ[|ln Σ|]minimizes the squared

error loss as stated in (3). Here we show that EN[h(N)]

minimizes any Bregman divergence loss. The Bregman

divergences are a class of loss functions that include squared

error and relative entropy [16], [17].

Lemma: The mean entropy with respect to uncertainty in the

generating distribution EN[h(N)] solves

arg min

ˆ

h∈R

ENhdφ(h(N),ˆ

h)i,

where dφ(a, b) = φ(a)−φ(b)−φ(b)0(a−b)for a, b, ∈Ris

any Bregman divergence with strictly convex φ.

Proof: Set the ﬁrst derivative to zero:

0 = d

dˆ

hENφ(h(N)) −φ(ˆ

h)−d

dˆ

hφ(ˆ

h)(h(N)−ˆ

h)

=−EN d2

dˆ

h2φ(ˆ

h)h(N)−ˆ

h.

Because φis strictly convex, d2

dˆ

h2φ(ˆ

h)>0, and thus it must

be that EN[h(N)−ˆ

h]=0, and thus by linearity that the

minimizer is ˆ

h=EN[h(N)].

IV. EXPERIMENTS

We compare the proposed inverted Wishart Bayesian es-

timator ˆ

hiWBayesian with the data-dependent given in (11) to

ˆ

hiWBayesian with B=I, to the nearest-neighbor estimator

given in (5), to the maximum likelihood estimator formed by

replacing µand Σin (1) by maximum likelihood estimates

of µand Σ, to ˆ

hUMVU given in (2), and to ˆ

hBayesian given

in (4). All results were computed with Matlab 7.0. For the

digamma function and Euler-Mascheroni constant we used the

corresponding built-in Matlab commands.

Simulations were run for a ﬁxed dimension of d= 20

with varying number of iid samples ndrawn from random

Gaussians. For n≤dwe it was not possible to calculate

the digamma functions for ˆ

hBayesian and ˆ

hUMVU or |Σ|, thus

ˆ

hBayesian,ˆ

hUMVU, and the maximum likelihood estimates are

only reported for n>d. The nearest-neighbor estimate and

the inverted Wishart Bayesian estimates are compared down

to n= 2 samples.

In the ﬁrst simulation, each generating Gaussian had a

diagonal covariance matrix with elements drawn iid from the

0 5 10 15 20 25 30

−1

0

1

2

3

4

5

6

7

# Randomly Drawn Samples

Log of Mean−Squared Estimation Error

ML

Bayesian

UMVU

NN

Inv. Wishart Bayesian

Data−dependent Inv. Wishart Bayesian

0 5 10 15 20 25 30

−1

0

1

2

3

4

5

6

7

8

9

# Randomly Drawn Samples

Log of Mean−Squared Estimation Error

Diagonal Covariance, Elements ∼U(0,10] Full Covariance RTR, Elements of R∼ N (0,102)

0 5 10 15 20 25 30

−1

0

1

2

3

4

5

6

7

# Randomly Drawn Samples

Log of Mean−Squared Estimation Error

0 5 10 15 20 25 30

−1

0

1

2

3

4

5

6

# Randomly Drawn Samples

Log of Mean−Squared Estimation Error

Diagonal Covariance, Elements ∼U(0,1] Full Covariance RTR, Elements of R∼ N (0,12)

0 5 10 15 20 25 30

−1

0

1

2

3

4

5

6

7

8

9

# Randomly Drawn Samples

Log of Mean−Squared Estimation Error

0 5 10 15 20 25 30

−1

0

1

2

3

4

5

6

7

# Randomly Drawn Samples

Log of Mean−Squared Estimation Error

Diagonal Covariance, Elements ∼U(0, .1] Full Covariance RTR, Elements of R∼ N (0, .12)

Fig. 1. Comparison of entropy estimators averaged over 10,000 iid runs of each simulation.

uniform distribution on (0, α]. The results are shown in Fig.

1 (left) for α= 10 (top), for α= 1 (middle), and for α=.1

(bottom). Fig. 2 shows the average eigenvalues.

In the second simulation, each generating Gaussian had

a full covariance matrix RTR, where each of the 20 ×20

elements of Rwas drawn iid from a normal distribution

N(0, α2). The results are shown in Fig. 1 (right) for α= 10

(top), for α= 1 (middle), and for α=.1(bottom). Fig. 2

shows the average eigenvalues.

For each nand each of the six cases, the simulation was run

10,000 times and the results averaged to produce the plots.

0 5 10 15 20

−8

−6

−4

−2

0

2

4

6

Eigenvalue Rank

Average Log Eigenvalue

Diagonal covariance from ﬁrst simulation

0 5 10 15 20

−5

−4

−3

−2

−1

0

1

2

3

4

5

Eigenvalue Rank

Average Log Eigenvalue

Full covariance from second simulation

Fig. 2. Average log ranked eigenvalues for the ﬁrst simulation (top) and

second simulation (bottom).

For n > d the three Bayesian estimates are equivalent and

perform consistently better than ˆ

hUM V U , the maximum likeli-

hood estimate, or the nearest-neighbor estimate. The maximum

likelihood estimator is always the worst parametric estimator.

Throughout the simulations, the nearest-neighbor estimate

makes the least use of additional samples, improving its

estimate only slowly. This is reasonable because the nearest-

neighbor estimator does not explicitly use the information that

the true distribution is Gaussian.

Given very few samples the nearest-neighbor estimator is

the best performer for two of the full covariance cases. This

suggests that different prior parameter settings could be more

effective when there are few samples, perhaps a prior that adds

more bias.

As expected, in general the data-dependent ˆ

hiWBayesian

achieves lower error than ˆ

hiWBayesian with B=I, sometimes

signiﬁcantly better, as in the case of true diagonal covariance

with elements drawn from U(0, .1], shown in Fig. 1 (bottom).

V. CONCLUSIONS

A data-dependent approach to Bayesian entropy estimation

was given that minimizes expected Bregman divergence and

performs consistently well compared to other possible estima-

tors for the high-dimensional/limited data case that n≤d.

REFERENCES

[1] T. Cover and J. Thomas, Elements of Information Theory. United States

of America: John Wiley and Sons, 1991.

[2] A. Hero and O. Michel, “Asymptotic theory of greedy approximations

to minimal k-point random graphs,” IEEE Trans. Information Theory,

vol. 45, pp. 1921–1939, 1999.

[3] J. Costa and A. Hero, “Geodesic entropic graphs for dimension and en-

tropy estimation in manifold learning,” IEEE Trans. Signal Processing,

vol. 52, no. 8, pp. 2210–2221, 2004.

[4] J. Beirlant, E. Dudewicz, L. Gy¨

orﬁ, and E. van der Meulen, “Non-

parametric entropy estimation: An overview,” International Journal

Mathematical and Statistical Sciences, vol. 6, pp. 17–39, 1987.

[5] N. Misra, H. Singh, and E. Demchuk, “Estimation of the entropy of

a multivariate normal distribution,” Journal of Multivariate Analysis,

vol. 92, pp. 324–342, 2005.

[6] N. A. Ahmed and D. V. Gokhale, “Entropy expressions and their esti-

mators for multivariate distributions,” IEEE Trans. Information Theory,

pp. 688–692, 1989.

[7] D. Endres and P. F¨

oldi`

ak, “Bayesian bin distribution inference and

mutual information,” IEEE Trans. Information Theory, pp. 3766–3779,

2005.

[8] M. Nilsson and W. B. Kleijn, “On the estimation of differential entropy

from data located on embedded manifolds,” IEEE Trans. on Information

Theory, vol. 53, no. 7, pp. 2330–2341, 2007.

[9] L. F. Kozachenko and N. N. Leonenko, “Sample estimate of entropy of

a random vector,” Problems in Information Transmission, vol. 23, no. 1,

pp. 95–101, 1987.

[10] A. Hero, B. Ma, O. Michel, and J. Gorman, “Applications of entropic

spanning graphs,” IEEE Signal Processing Magazine, pp. 85–95, 2002.

[11] J. Beardwood, J. H. Halton, and J. M. Hammersley, “The shortest path

through many points,” Proc. Cambridge Philo. Soc., vol. 55, pp. 299–

327, 1959.

[12] S. Srivastava, M. R. Gupta, and B. A. Frigyik, “Bayesian quadratic

discriminant analysis,” Journal of Machine Learning Research, vol. 8,

pp. 1287–1314, 2007.

[13] M. Bilodeau and D. Brenner, Theory of Multivariate Statistics. New

York: Springer Texts in Statistics, 1999.

[14] D. G. Keehn, “A note on learning for Gaussian properties,” IEEE Trans.

on Information Theory, vol. 11, pp. 126–132, 1965.

[15] P. J. Brown, T. Fearn, and M. S. Haque, “Discrimination with many

variables,” Journal of the American Statistical Association, vol. 94, no.

448, pp. 1320–1329, 1999.

[16] A. Banerjee, X. Guo, and H. Wang, “On the optimality of conditional

expectation as a Bregman predictor,” IEEE Trans. on Information

Theory, vol. 51, no. 7, pp. 2664–2669, 2005.

[17] B. A. Frigyik, S. Srivastava, and M. R. Gupta, “Functional Bregman

divergence and Bayesian estimation of distributions,” arXiv preprint

cs.IT/0611123, 2006.