Conference PaperPDF Available

# Bayesian Estimation of the Entropy of the Multivariate Gaussian

Authors:
• IBM Research, New Delhi

## Abstract

Estimating the entropy of a Gaussian distribution from samples drawn from the distribution is a difficult problem when the number of samples is smaller than the number of dimensions. A new Bayesian entropy estimator is proposed using an inverted Wishart distribution and a data-dependent prior that handles the small-sample case. Experiments for six different cases show that the proposed estimator provides good performance for the small-sample case compared to the standard nearest-neighbor entropy estimator. Additionally, it is shown that the Bayesian estimate formed by taking the expected entropy minimizes expected Bregman divergence.
Bayesian Estimation of the Entropy of the
Multivariate Gaussian
Santosh Srivastava
Fred Hutchinson Cancer Research Center
Seattle, WA 98109, USA
Email: ssrivast@fhcrc.org
Maya R. Gupta
Department of Electrical Engineering
University of Washington
Seattle, WA 98195, USA
Email: gupta@ee.washington.edu
Abstract Estimating the entropy of a Gaussian distribution
from samples drawn from the distribution is a difﬁcult problem
when the number of samples is smaller than the number of
dimensions. A new Bayesian entropy estimator is proposed
using an inverted Wishart distribution and a data-dependent
prior that handles the small-sample case. Experiments for six
different cases show that the proposed estimator provides good
performance for the small-sample case compared to the standard
nearest-neighbor entropy estimator. Additionally, it is shown that
the Bayesian estimate formed by taking the expected entropy
minimizes expected Bregman divergence.
I. INTRODUCTION
Entropy is a useful description of the predictability of a
distribution, and has use in many applications of coding,
machine learning, signal processing, communications, and
chemistry [1]–[5]. In practice, many continuous generating
distributions are modeled as Gaussian distributions. One rea-
son for this is the central limit theorem, another reason is
because the Gaussian is the maximum entropy distribution
given an empirical mean and covariance, and as such is a least
assumptive model. For the multivariate Gaussian distribution,
the entropy goes as the log determinant of the covariance;
speciﬁcally, the differential entropy of a d-dimensional random
vector Xdrawn from the Gaussian N(µ, Σ) is
h(X) = ZN(x) ln N(x)dx =d
2+dln(2π)
2+ln |Σ|
2.(1)
In this paper we consider the practical problem of estimating
the entropy of a generating normal distribution Ngiven
samples x1, x2, . . . , xnthat are assumed to be realizations
of random vectors X1, X2, . . . , Xndrawn iid from N(µ, Σ).
In particular, we focus on the difﬁcult limited-data case that
nd, which occurs often in practice with high-dimensional
data.
One approach to estimating h(X)is to ﬁrst estimate the
Gaussian distribution, for example with maximum likelihood
(ML) estimates of µand Σ, and then plug-in the covariance
estimate to (1) [6]. Such estimates are usually infeasible when
nd. The ML estimate is also negatively biased so that
it underestimates the entropy on average [5]. We believe it is
more effective and will create fewer numerical problems if one
directly estimates a single value for the entropy, rather than
ﬁrst estimating the entire covariance matrix in order to produce
the scalar entropy estimate. To this end, we propose a data-
dependent Bayesian estimate using an inverted Wishart prior.
The choice of prior in Bayesian estimation can dramatically
change the results, but there is little practical guidance for
choosing priors. The main contributions of this work are
showing that the inverted Wishart prior enables estimates for
nd, and that using a rough data-dependent estimate as a
parameter to the inverted Wishart prior yields a more robust
estimator than ignoring the data when creating the prior.
II. REL ATE D WORK
First we review related work in parametric entropy estima-
tion, then in nonparametric estimation.
Ahmed and Gokhale investigated uniformly minimum vari-
ance unbiased (UMVU) entropy estimators for parametric dis-
tributions [6]. They showed that the UMVU entropy estimate
for the Gaussian is,
ˆ
hUMVU =dln π
2+ln |S|
21
2
d
X
i=1
ψn+ 1 i
2,(2)
where S=Pn
i=1(xi¯x)(xi¯x)T, and the digamma function
is deﬁned ψ(z) = d
dz ln Γ(z), where Γis the standard gamma
function.
Bayesian entropy estimation for the multivariate normal was
ﬁrst proposed in 2005 by Misra et al. [5]. They form an entropy
estimate by substituting [
ln |Σ|for ln |Σ|in (1) where [
ln |Σ|
minimizes the squared-error risk of the entropy estimate. That
is, [
ln |Σ|solves,
arg min
δ∈R Eµ,Σh(δln |Σ|)2i,(3)
where here Σdenotes a random covariance matrix and µa
random vector, and the expectation is taken with respect to the
posterior. We will denote by ˜µand ˜
Σrespectively realizations
of the random µand Σ. Misra et al. consider different priors
with support over the set of symmetric positive deﬁnite ˜
Σ.
They show that given the prior p(˜µ, ˜
Σ) = 1
|˜
Σ|d+1
2
, the solution
to (3) yields the Bayesian entropy estimate
ˆ
hBayesian =dln π
2+ln |S|
21
2
d
X
i=1
ψni
2.(4)
For discrete random variables, a Bayesian approach that uses
binning for estimating functionals such as entropy was pro-
posed the same year as Misra et al.’s work [7].
It is interesting to note that the estimates ˆ
hUMVU and
ˆ
hBayesian were derived from different perspectives, but differ
only slightly in the digamma argument. Like ˆ
hUMVU, Misra et
al. show that ˆ
hBayesian is an unbiased entropy estimate. They
also show that ˆ
hBayesian is dominated by a Stein-type estimator,
ˆ
hStein = ln |S+n¯x¯xT| − c1,
where c1is a function of dand n. Further, they also show that
the estimate ˆ
hBayesian is dominated by a Brewster-Zidek-type
estimator ˆ
hBZ ,
ˆ
hBZ = ln |S+n¯x¯xT| − c2.
where c2is a function of |S|and ¯x¯xTthat requires calculating
the ratio of two deﬁnite integrals, stated in full in (4.3) of [5].
Misra et al. found that on simulated numerical experiments
their Stein-type and Brewster-Zidek-type estimators achieved
roughly only 6% improvement over the simpler Bayesian
estimate ˆ
hBayesian, and thus they recommend using the com-
putationally much simpler ˆ
hBayesian in applications.
There are two practical problems with the previously pro-
posed parametric estimators. First, the estimates given by (2),
(4), and the other proposed Misra et al. estimators require
calculating the determinant of Sor S+ ¯x¯xT, which can
be numerically infeasible if there are few samples. Second,
the Bayesian estimate ˆ
hBayesian uses the digamma function of
ndwhich requires n>dsamples so that the digamma
has a non-negative argument, and similarly ˆ
hUMVU uses the
digamma of nd+ 1, which requires ndsamples. Thus,
although the knowledge that one is estimating the entropy of
a Gaussian should be of use, for the ndcase one must turn
to nonparametric entropy estimators.
A thorough review of work in nonparametric entropy esti-
mation up to 1997 was written by Beirlant et al. [4], including
density estimation approaches, sample-spacing approaches,
and nearest-neighbor estimators. Recently, Nilsson and Kleijn
show that high-rate quantization approximations of Zador and
Gray can be used to estimate Renyi entropy, and that the
limiting case of Shannon entropy produces a nearest-neighbor
estimate that depends on the number of quantization cells [8].
The special case of their nearest-neighbor estimate that best
validates the high-rate quantization assumptions is when the
number of quantization cells is as large as possible. They show
that this special case produces the nearest-neighbor differential
entropy estimator originally proposed by Kozachenko and
Leonenko in 1987 [9]:
ˆ
hNN =d
n
n
X
i=1
ln kxixi,1k2+ ln(n1) + γ+ ln Vd(5)
where xi,1is xi’s nearest neighbor in the sample set, γis
the Euler-Mascheroni constant, and Vdis the volume of the
Γ(1+d/2) . A
problem with this approach is that in practice data samples
may not be in general position; for example, image pixel data
are usually quantized to 8 bits or 10 bits. Thus, it can happen in
practice that two samples have the exact same measured value
and thus kxnxn,1kis zero and thus the entropy estimate
could be ill-deﬁned. Though there are various ﬁxes, such as
pre-dithering the quantized data, it is not clear what effect
these ﬁxes could have on the estimated entropy.
A different approach is taken by Costa and Hero [2], [3],
[10]; they use the Beardwood Halton Hammersley result that
the function of the length of a minimum spanning graph
converges to the Renyi entropy [11] to form an estimator based
on the empirical length of a minimum spanning tree of data.
Unfortunately, how to use this approach to estimate Shannon
entropy remains an open question.
III. BAYESIAN ESTIMATE WITH INVERTED WISHART
PRI OR
We propose to estimate the entropy as, EN[h(N)], where
Nis a random Gaussian, and the prior p(N)is an inverted
Wishart distribution with scale parameter qand parameter
matrix B. We use a Fisher information metric to deﬁne a
measure over the Riemannian manifold formed by the set of
Gaussian distributions. These choices for prior and measure
are very similar to the choices that we found worked very
well for Bayesian quadratic discriminant analysis [12], and
further details on this framework can be found in that work.
The resulting proposed inverted Wishart Bayesian entropy
estimate is
ˆ
hiW Bayesian =dln π
2+ln |S+B|
2
1
2
d
X
i=1
ψn+q+ 1 i
2.(6)
Proof: To show that (6) is EN[h(N)], we will need to integrate
Z˜
Σ>0
ln |˜
Σ|exp[tr(˜
Σ1V)]
|˜
Σ|q
2
d˜
Σ(7)
EΣ[ln |Σ|]|V|qd1
2
Γd(qd1
2)(8)
where Σis a random covariance matrix drawn from an inverted
Wishart distribution with scale parameter qand matrix param-
eter 2V. Recall that for any matrix 2V,|Σ1|/|(2V)1| ∼
Qd
i=1 χ2
qdi[13, Corollary 7.3], where χ2denotes the chi-
squared random variable. Take the natural log of both sides
and use the fact that |A1|=|A|1to show that ln |Σ| ∼
ln |2V| − Pd
i=1 ln χ2
qdi. Then after taking EΣ[ln |Σ|], (8)
becomes
Γd(qd1
2)
|V|qd1
2 ln |2V| −
d
X
i=1
Eln χ2
qdi!
=Γd(qd1
2)
|V|qd1
2 ln |V| −
d
X
i=1
ψqdi
2!,(9)
where the second line uses the property of the χ2distribution
that E[ln χ2
q] = ln 2 + ψq
2.
Now we will use the above integral identity to ﬁnd
ˆ
hiWBayesian =EN[h(N)]. Solving for EN[h(N)] only requires
computing
EN[ln |Σ|] = Z˜
Σ
ln |˜
Σ|p(˜
Σ|x1, x2, . . . , xn)d˜
Σ
|˜
Σ|d+2
2
,
where the term 1/|˜
Σ|(d+2)/2results from the Fisher in-
formation metric which converts the integral EN[h(N)]
from an integral over the statistical manifold of Gaussians
to an integral over covariance matrices, and the posterior
p(˜
Σ|x1, x2, . . . , xn)is given in [12] such that EN[ln |Σ|]
= |S+B|n+q
2
2(n+q)d
2Γd(n+q
2)!
· Z˜
Σ>0
ln |˜
Σ|exp[1
2tr(˜
Σ1(S + B))]
|˜
Σ|n+q+d+1
2
d˜
Σ!
=
|S+B|n+q
2
2(n+q)d
2Γd(n+q
2)
Γd(n+q
2)
S+B
2
n+q
2
· ln
S+B
2
d
X
i=1
ψn+q+ 1 i
2!(10)
= ln |S+B| − dln 2
d
X
i=1
ψn+q+ 1 i
2,
where equation (10) follows by using the fact that (7) is given
by (9) for V= (S+B)/2. Then replacing ln |Σ|in (1) with
the computed EN[ln |Σ|]produces the estimator (6).
A. Choice of Prior Parameters qand B
The inverted Wishart distribution is an unimodal prior that
gives maximum a priori probability to Gaussians with Σ =
B/q. Previous work using the inverted Wishart for Bayesian
estimation of Gaussians has used the identity matrix for B
[14], or a scaled identity where the scale factor was learned by
cross-validation given labeled training data (for classiﬁcation)
[15]. Using B=Isets the maximum of the prior at I/q,
regardless of whether the data are measured in nanometers
or trillions of dollars. To a rough approximation, the prior
regularizes the likelihood towards the prior maximum at B/q,
and thus the bias added by using the prior with B/q =Ican
be ill-suited to the problem. Instead, setting B/q to be a rough
estimate of the covariance can add bias that is more appropriate
for the problem. For example, we have shown that using
B=qdiag(S)can work well when estimating Gaussians for
classiﬁcation by Bayesian quadratic discriminant analysis [12].
For entropy estimation, B=qdiag(S)/n works excellently if
the true covariance is diagonal, but can perform poorly if the
true covariance is a full covariance because the determinant
of B=diag(S)/n can be signiﬁcantly higher than the
determinant of S, biasing the entropy estimate to be too high.
An optimal choice of Bwhen there is no prior information
about Sremains an open question; we propose using
B=qln (diag(S) + 1)
n,(11)
for nd, and for n>dwe let Bbe the d×dmatrix of
zeros.
The scalar prior parameter qchanges the peaked-ness of the
inverted Wishart prior, with higher qcorresponding to a more
peaked prior, and thus higher bias. For the entropy problem
there is only one number to estimate, and thus we believe that
as little bias as possible should be used. To achieve this, we
set q= min(dn, 1). Letting q=1when n>dand
using the zero matrix for Bin this range makes ˆ
hiWBayesian
equivalent to ˆ
hBayesian when n>d.
B. Bregman Divergences and Bayesian Estimation
Misra et al. showed that EΣ[|ln Σ|]minimizes the squared
error loss as stated in (3). Here we show that EN[h(N)]
minimizes any Bregman divergence loss. The Bregman
divergences are a class of loss functions that include squared
error and relative entropy [16], [17].
Lemma: The mean entropy with respect to uncertainty in the
generating distribution EN[h(N)] solves
arg min
ˆ
h∈R
ENhdφ(h(N),ˆ
h)i,
where dφ(a, b) = φ(a)φ(b)φ(b)0(ab)for a, b, Ris
any Bregman divergence with strictly convex φ.
Proof: Set the ﬁrst derivative to zero:
0 = d
dˆ
hENφ(h(N)) φ(ˆ
h)d
dˆ
hφ(ˆ
h)(h(N)ˆ
h)
=EN d2
dˆ
h2φ(ˆ
h)h(N)ˆ
h.
Because φis strictly convex, d2
dˆ
h2φ(ˆ
h)>0, and thus it must
be that EN[h(N)ˆ
h]=0, and thus by linearity that the
minimizer is ˆ
h=EN[h(N)].
IV. EXPERIMENTS
We compare the proposed inverted Wishart Bayesian es-
timator ˆ
hiWBayesian with the data-dependent given in (11) to
ˆ
hiWBayesian with B=I, to the nearest-neighbor estimator
given in (5), to the maximum likelihood estimator formed by
replacing µand Σin (1) by maximum likelihood estimates
of µand Σ, to ˆ
hUMVU given in (2), and to ˆ
hBayesian given
in (4). All results were computed with Matlab 7.0. For the
digamma function and Euler-Mascheroni constant we used the
corresponding built-in Matlab commands.
Simulations were run for a ﬁxed dimension of d= 20
with varying number of iid samples ndrawn from random
Gaussians. For ndwe it was not possible to calculate
the digamma functions for ˆ
hBayesian and ˆ
hUMVU or |Σ|, thus
ˆ
hBayesian,ˆ
hUMVU, and the maximum likelihood estimates are
only reported for n>d. The nearest-neighbor estimate and
the inverted Wishart Bayesian estimates are compared down
to n= 2 samples.
In the ﬁrst simulation, each generating Gaussian had a
diagonal covariance matrix with elements drawn iid from the
0 5 10 15 20 25 30
−1
0
1
2
3
4
5
6
7
# Randomly Drawn Samples
Log of Mean−Squared Estimation Error
ML
Bayesian
UMVU
NN
Inv. Wishart Bayesian
Data−dependent Inv. Wishart Bayesian
0 5 10 15 20 25 30
−1
0
1
2
3
4
5
6
7
8
9
# Randomly Drawn Samples
Log of Mean−Squared Estimation Error
Diagonal Covariance, Elements U(0,10] Full Covariance RTR, Elements of R N (0,102)
0 5 10 15 20 25 30
−1
0
1
2
3
4
5
6
7
# Randomly Drawn Samples
Log of Mean−Squared Estimation Error
0 5 10 15 20 25 30
−1
0
1
2
3
4
5
6
# Randomly Drawn Samples
Log of Mean−Squared Estimation Error
Diagonal Covariance, Elements U(0,1] Full Covariance RTR, Elements of R N (0,12)
0 5 10 15 20 25 30
−1
0
1
2
3
4
5
6
7
8
9
# Randomly Drawn Samples
Log of Mean−Squared Estimation Error
0 5 10 15 20 25 30
−1
0
1
2
3
4
5
6
7
# Randomly Drawn Samples
Log of Mean−Squared Estimation Error
Diagonal Covariance, Elements U(0, .1] Full Covariance RTR, Elements of R N (0, .12)
Fig. 1. Comparison of entropy estimators averaged over 10,000 iid runs of each simulation.
uniform distribution on (0, α]. The results are shown in Fig.
1 (left) for α= 10 (top), for α= 1 (middle), and for α=.1
(bottom). Fig. 2 shows the average eigenvalues.
In the second simulation, each generating Gaussian had
a full covariance matrix RTR, where each of the 20 ×20
elements of Rwas drawn iid from a normal distribution
N(0, α2). The results are shown in Fig. 1 (right) for α= 10
(top), for α= 1 (middle), and for α=.1(bottom). Fig. 2
shows the average eigenvalues.
For each nand each of the six cases, the simulation was run
10,000 times and the results averaged to produce the plots.
0 5 10 15 20
−8
−6
−4
−2
0
2
4
6
Eigenvalue Rank
Average Log Eigenvalue
Diagonal covariance from ﬁrst simulation
0 5 10 15 20
−5
−4
−3
−2
−1
0
1
2
3
4
5
Eigenvalue Rank
Average Log Eigenvalue
Full covariance from second simulation
Fig. 2. Average log ranked eigenvalues for the ﬁrst simulation (top) and
second simulation (bottom).
For n > d the three Bayesian estimates are equivalent and
perform consistently better than ˆ
hUM V U , the maximum likeli-
hood estimate, or the nearest-neighbor estimate. The maximum
likelihood estimator is always the worst parametric estimator.
Throughout the simulations, the nearest-neighbor estimate
makes the least use of additional samples, improving its
estimate only slowly. This is reasonable because the nearest-
neighbor estimator does not explicitly use the information that
the true distribution is Gaussian.
Given very few samples the nearest-neighbor estimator is
the best performer for two of the full covariance cases. This
suggests that different prior parameter settings could be more
effective when there are few samples, perhaps a prior that adds
more bias.
As expected, in general the data-dependent ˆ
hiWBayesian
achieves lower error than ˆ
hiWBayesian with B=I, sometimes
signiﬁcantly better, as in the case of true diagonal covariance
with elements drawn from U(0, .1], shown in Fig. 1 (bottom).
V. CONCLUSIONS
A data-dependent approach to Bayesian entropy estimation
was given that minimizes expected Bregman divergence and
performs consistently well compared to other possible estima-
tors for the high-dimensional/limited data case that nd.
REFERENCES
[1] T. Cover and J. Thomas, Elements of Information Theory. United States
of America: John Wiley and Sons, 1991.
[2] A. Hero and O. Michel, “Asymptotic theory of greedy approximations
to minimal k-point random graphs,” IEEE Trans. Information Theory,
vol. 45, pp. 1921–1939, 1999.
[3] J. Costa and A. Hero, “Geodesic entropic graphs for dimension and en-
tropy estimation in manifold learning,” IEEE Trans. Signal Processing,
vol. 52, no. 8, pp. 2210–2221, 2004.
[4] J. Beirlant, E. Dudewicz, L. Gy¨
orﬁ, and E. van der Meulen, “Non-
parametric entropy estimation: An overview,” International Journal
Mathematical and Statistical Sciences, vol. 6, pp. 17–39, 1987.
[5] N. Misra, H. Singh, and E. Demchuk, “Estimation of the entropy of
a multivariate normal distribution,Journal of Multivariate Analysis,
vol. 92, pp. 324–342, 2005.
[6] N. A. Ahmed and D. V. Gokhale, “Entropy expressions and their esti-
mators for multivariate distributions,IEEE Trans. Information Theory,
pp. 688–692, 1989.
[7] D. Endres and P. F¨
oldi`
ak, “Bayesian bin distribution inference and
mutual information,” IEEE Trans. Information Theory, pp. 3766–3779,
2005.
[8] M. Nilsson and W. B. Kleijn, “On the estimation of differential entropy
from data located on embedded manifolds,” IEEE Trans. on Information
Theory, vol. 53, no. 7, pp. 2330–2341, 2007.
[9] L. F. Kozachenko and N. N. Leonenko, “Sample estimate of entropy of
a random vector,Problems in Information Transmission, vol. 23, no. 1,
pp. 95–101, 1987.
[10] A. Hero, B. Ma, O. Michel, and J. Gorman, “Applications of entropic
spanning graphs,” IEEE Signal Processing Magazine, pp. 85–95, 2002.
[11] J. Beardwood, J. H. Halton, and J. M. Hammersley, “The shortest path
through many points,” Proc. Cambridge Philo. Soc., vol. 55, pp. 299–
327, 1959.
[12] S. Srivastava, M. R. Gupta, and B. A. Frigyik, “Bayesian quadratic
discriminant analysis,” Journal of Machine Learning Research, vol. 8,
pp. 1287–1314, 2007.
[13] M. Bilodeau and D. Brenner, Theory of Multivariate Statistics. New
York: Springer Texts in Statistics, 1999.
[14] D. G. Keehn, “A note on learning for Gaussian properties,” IEEE Trans.
on Information Theory, vol. 11, pp. 126–132, 1965.
[15] P. J. Brown, T. Fearn, and M. S. Haque, “Discrimination with many
variables,Journal of the American Statistical Association, vol. 94, no.
448, pp. 1320–1329, 1999.
[16] A. Banerjee, X. Guo, and H. Wang, “On the optimality of conditional
expectation as a Bregman predictor,IEEE Trans. on Information
Theory, vol. 51, no. 7, pp. 2664–2669, 2005.
[17] B. A. Frigyik, S. Srivastava, and M. R. Gupta, “Functional Bregman
divergence and Bayesian estimation of distributions,arXiv preprint
cs.IT/0611123, 2006.
... The differential entropy has a wide range of applications in many areas including coding, machine learning, signal processing, communications, biosciences and chemistry. See [14,15,16,17,18]. For example, in molecular biosciences, the evaluation of entropy of a molecular system is important for understanding its thermodynamic properties. ...
... In the conventional fixed dimensional case, estimation of the differential entropy has been considered by using both Bayesian and frequentist methods. See, for example, [18,14,21]. A Bayesian estimator was proposed in [14] using the inverse Wishart prior which works without the restriction that dimension is smaller than the sample size. ...
... See, for example, [18,14,21]. A Bayesian estimator was proposed in [14] using the inverse Wishart prior which works without the restriction that dimension is smaller than the sample size. However, how to choose good parameter values for the inverse Wishart prior remains an open question when the population covariance matrix is nondiagonal. ...
Article
Differential entropy and log determinant of the covariance matrix of a multivariate Gaussian distribution have many applications in coding, communications, signal processing and statistical inference. In this paper we consider in the high dimensional setting optimal estimation of the differential entropy and the log-determinant of the covariance matrix. We first establish a central limit theorem for the log determinant of the sample covariance matrix in the high dimensional setting where the dimension $p(n)$ can grow with the sample size $n$. An estimator of the differential entropy and the log determinant is then considered. Optimal rate of convergence is obtained. It is shown that in the case $p(n)/n \rightarrow 0$ the estimator is asymptotically sharp minimax. The ultra-high dimensional setting where $p(n) > n$ is also discussed.
... To illustrate this "curse of dimensionality" paradigm, we present one of the important consequences of this deviation in the study of the sample generalized variance log(|Σ|) appearing in many statistical methods (Quadratic Discriminant Analysis (Tharwat, 2016), hypothesis tests in multivariate statistic (Anderson et al., 1958), differential entropy in probability and information theory (Srivastava & Gupta, 2008), etc). These aspects are closely related to one of the topics covered in this thesis (Chapter 3). ...
Thesis
Machine Learning (ML) has been quite successful to solve many real-world applications going from supervised to unsupervised tasks due to the development of powerful algorithms (Support Vector Machine (SVM), Deep Neural Network, Spectral Clustering, etc). These algorithms are based on optimization schemes motivated by low dimensional intuitions which collapse in high dimension, a phenomenon known as the "curse of dimensionality''. Nonetheless, by assuming the data dimension and their number to be both large and comparable, Random Matrix Theory (RMT) provides a systematic approach to assess the (statistical) behavior of these large learning systems, to properly understand and improve them when applied to large dimensional data. Previous random matrix analyses (cf. Mai & Couillet, 2018 ; Liao & Couillet, 2019 ; Deng et al., 2019) have shown that asymptotic performances of most machine learning and signal processing methods depend only on first and second-order statistics (means and covariance matrices of the data). This makes covariance matrices extremely rich objects that need to be "well treated and understood". The thesis demonstrates first how poorly naive covariance matrix processing can destroy machine learning algorithms by introducing biases that are difficult to clean, whereas consistent random-matrix estimation of the functionals of interest avoids biases. We then exemplify how means and covariance matrix statistics of the data are sufficient (through simple functionals) to handle the statistical behavior of even quite involved algorithms of modern interest, such as multi-task and transfer learning methods. The large dimensional analysis allows furthermore for an improvement of multi-task and transfer learning schemes.
... Estimating differential entropy has long been an active area in Information Theory [22][23] [10] [8]. In principle, the differential entropy of multivariate Gaussian distribution can be represented by the logarithm determinant of its covariance matrix. ...
Full-text available
Preprint
Understanding the informative behaviour of deep neural networks is challenged by misused estimators and the complexity of network structure, which leads to inconsistent observations and diversified interpretation. Here we propose the LogDet estimator -- a reliable matrix-based entropy estimator that approximates Shannon differential entropy. We construct informative measurements based on LogDet estimator, verify our method with comparable experiments and utilize it to analyse neural network behaviour. Our results demonstrate the LogDet estimator overcomes the drawbacks that emerge from highly diverse and degenerated distribution thus is reliable to estimate entropy in neural networks. The Network analysis results also find a functional distinction between shallow and deeper layers, which can help understand the compression phenomenon in the Information bottleneck theory of neural networks.
... On the Bayesian side, [38] and [24] suggested a Bayes estimator for log-determinant of the covariance matrix of the multivariate normal. They proposed using the inverse-Wishart prior and showed that the posterior mean minimizes expected Bregman divergence. ...
Full-text available
Article
We obtain the optimal Bayesian minimax rate for the unconstrained large covariance matrix of multivariate normal sample with mean zero, when both the sample size, n, and the dimension, p, of the covariance matrix tend to infinity. Traditionally the posterior convergence rate is used to compare the frequentist asymptotic performance of priors, but defining the optimality with it is elusive. We propose a new decision theoretic framework for prior selection and define Bayesian minimax rate. Under the proposed framework, we obtain the optimal Bayesian minimax rate for the spectral norm for all rates of p. We also considered Frobenius norm, Bregman divergence and squared log-determinant loss and obtain the optimal Bayesian minimax rate under certain rate conditions on p. A simulation study is conducted to support the theoretical results.
... Knuth et al. (2006) developed tools to describe the degree of uncertainty in entropy estimates due to the number of bins. Despite these advances, the estimation of entropy remains an open problem (Nilsson and Kleijn 2007, Srivastava and Gupta 2008, Larson 2010. Calculation of entropy from flow data introduces a second aspect of discretisation, i.e. discrete sampling of flow and truncation of the flow measurement. ...
Full-text available
Article
This paper explores the use of entropy-based measures in catchment hydrology, and provides an importance-weighted numerical descriptor of the flow duration curve. Although entropy theory is being applied in a wide spectrum of areas (including environmental and water resources), artefacts arising from the discrete, under-sampled and uncertain nature of hydrological data are rarely acknowledged, and have not been adequately explored. Here, we examine challenges to extracting hydrologically meaningful entropy measures from a flow signal; the effect of binning resolution on calculation of entropy is investigated, along with artefacts caused by 1) emphasis of information theoretic measures towards flow ranges having more data (statistically dominant information), and 2) effects of discharge measurement truncation errors. We introduce an importance-weighted entropy-based measure to counter the tendency of common binning approaches to over-emphasise information contained in the low flows which dominate the record. The measure uses a novel binning method, and overcomes artefacts due to data resolution and under-sampling. Our analysis reveals a fundamental problem with the extraction of information at high flows, due to the lack of statistically significant samples in this range. By separating the flow duration curve into segments, our approach constrains the computed entropy to better respect distributional properties over the data range. When used as an objective function for model calibration, this approach constrains high flow predictions as well as the commonly used Nash-Sutcliffe efficiency, but provides much better predictions of low flow behaviour.
Full-text available
Article
Current approaches explore bacterial genes that change transcriptionally upon stress exposure as diagnostics to predict antibiotic sensitivity. However, transcriptional changes are often specific to a species or antibiotic, limiting implementation to known settings only. While a generalizable approach, predicting bacterial fitness independent of strain, species or type of stress, would eliminate such limitations, it is unclear whether a stress-response can be universally captured. By generating a multi-stress and species RNA-Seq and experimental evolution dataset, we highlight the strengths and limitations of existing gene-panel based methods. Subsequently, we build a generalizable method around the observation that global transcriptional disorder seems to be a common, low-fitness, stress response. We quantify this disorder using entropy, which is a specific measure of randomness, and find that in low fitness cases increasing entropy and transcriptional disorder results from a loss of regulatory gene-dependencies. Using entropy as a single feature, we show that fitness and quantitative antibiotic sensitivity predictions can be made that generalize well beyond training data. Furthermore, we validate entropy-based predictions in 7 species under antibiotic and non-antibiotic conditions. By demonstrating the feasibility of universal predictions of bacterial fitness, this work establishes the fundamentals for potentially new approaches in infectious disease diagnostics. Bacterial transcriptomic data have been used to predict antibiotic susceptibility in a species- or antibiotic-specific manner. Here, the authors show that global transcriptional disorder is a common stress response in bacteria with low fitness, and present a general approach that can predict bacterial fitness independently of species or type of stress.
Article
It is well known that when the dimension of the data becomes very large, the sample covariance matrix will not be a good estimator of the population covariance matrix . One typical consequence of such is that the estimated eigenvalues from will be distorted. Many existing methods tried to solve the problem, and examples of which include regularizing by thresholding or banding. In this paper, we estimate by maximizing the likelihood using a new penalization on the matrix logarithm of (denoted by ) of the form: , where is the -th eigenvalue of . This penalty aims at shrinking the estimated eigenvalues of toward the mean eigenvalue . The merits of our method are that it guarantees to be non-negative definite and is computational efficient. The simulation study and applications on portfolio optimization and classification of genomic data show that the proposed method outperforms existing methods.
Article
Maximum entropy models have become popular statistical models in neuroscience and other areas in biology, and can be useful tools for obtaining estimates of mu-tual information in biological systems. However, maximum entropy models fit to small data sets can be subject to sampling bias; i.e. the true entropy of the data can be severely underestimated. Here we study the sampling properties of estimates of the entropy obtained from maximum entropy models. We show that if the data is generated by a distribution that lies in the model class, the bias is equal to the number of parameters divided by twice the number of observations. However, in practice, the true distribution is usually outside the model class, and we show here that this misspecification can lead to much larger bias. We provide a perturba-tive approximation of the maximally expected bias when the true model is out of model class, and we illustrate our results using numerical simulations of an Ising model; i.e. the second-order maximum entropy distribution on binary data.
Article
Here the concept of Doppler information geometry is summarized and introduced to evaluate the richness of Doppler velocity components of radar signals and applied for wake turbulence monitoring. With the methods of information geometry, we consider all the Toeplitz Hermitian positive definite covariance matrices of order n as a manifold Ω n . Geometries of covariance matrices based on two kinds of radar data models are presented. Finally, a radar detector based on Doppler entropy assessment is analyzed and applied for wake turbulence monitoring. This advanced Doppler processing chain is also implemented by CUDA codes for GPU parallel computation
Full-text available
Article
Many statistical methods for discriminant analysis do not adapt well or easily to situations where the number of variables is large, possibly even exceeding the number of cases in the training set. We explore a variety of methods for providing robust identification of future samples in this situation. We develop a range of flexible Bayesian methods, and primarily a new hierarchical covariance compromise method, akin to regularized discriminant analysis. Although the methods are much more widely applicable, the motivating problem was that of discriminating between groups of samples on the basis of their near-infrared spectra. Here the ability of the Bayesian methods to take account of continuity of the spectra may be beneficial. The spectra may consist of absorbances or reflectances at as many as 1,000 wavelengths, and yet there may be only tens or hundreds of training samples in which both sample spectrum and group identity are known. Such problems arise in the food and pharmaceutical industries; for example, authentication of foods (e.g., detecting the adulteration of orange juice) and identification of pharmaceutical ingredients. Our illustrating example concerns the discrimination of 39 microbiological taxa and 8 aggregate genera. Simulations also illustrate the effectiveness of the hierarchical Bayes covariance method. We discuss a number of scoring rules, both local and global, for judging the fit of data to the Bayesian models, and adopt a cross-classiflcatory approach for estimating hyperparameters.
Full-text available
Article
Our object in writing this book is to present the main results of the modern theory of multivariate statistics to an audience of advanced students who would appreciate a concise and mathematically rigorous treatment of that material. It is intended for use as a textbook by students taking a first graduate course in the subject, as well as for the general reference of interested research workers who will find, in a readable form, developments from recently published work on certain broad topics not otherwise easily accessible, as, for instance, robust inference (using adjusted likelihood ratio tests) and the use of the bootstrap in a multivariate setting. The references contains over 150 entries post-1982. The main development of the text is supplemented by over 135 problems, most of which are original with the authors. A minimum background expected of the reader would include at least two courses in mathematical statistics, and certainly some exposure to the calculus of several variables together with the descriptive geometry of linear
Full-text available
Article
An overview is given of the several methods in use for the nonparametric estimation of the dierential entropy of a continuous random variable. The properties of various methods are compared. Several applications are given such as tests for goodness-of-t, parameter estimation, quantization theory and spectral estimation.
Full-text available
Article
Let Xn = fx 1 ; : : : ; xn g, be an i.i.d. sample having multivariate distribution P . We derive a.s. limits for the power weighted edge weight function of greedy approximations to a class of minimal graphs spanning k of the n samples. The class includes minimal k-point graphs constructed by the partitioning method of Ravi, Sundaram, Marathe, Rosenkrantz and Ravi [43] where the edge weight function satises the quasi-additive property of Redmond and Yukich [45]. In particular this includes greedy approximations to the k-point minimal spanning tree (k-MST), Steiner tree (k-ST), and the traveling salesman problem (k-TSP). An expression for the inuence function of the minimal weight function is given which characterizes the asymptotic sensitivity of the graph weight to perturbations in the underlying distribution. The inuence function takes a form which indicates that the k-point minimal graph in d > 1 dimensions has robustness properties in IR d which are analogous to those of rank ...
Article
We prove that the length of the shortest closed path through n points in a bounded plane region of area v is ‘almost always’ asymptotically proportional to √(nv) for large n; and we extend this result to bounded Lebesgue sets in k–dimensional Euclidean space. The constants of proportionality depend only upon the dimensionality of the space, and are independent of the shape of the region. We give numerical bounds for these constants for various values of k; and we estimate the constant in the particular case k = 2. The results are relevant to the travelling-salesman problem, Steiner's street network problem, and the Loberman—Weinberger wiring problem. They have possible generalizations in the direction of Plateau's problem and Douglas' problem.
Chapter
Half-title pageSeries pageTitle pageCopyright pageDedicationPrefaceAcknowledgementsContentsList of figuresHalf-title pageIndex
Chapter
Information theory answers two fundamental questions in communication theory: what is the ultimate data compression (answer: the entropy H), and what is the ultimate transmission rate of communication (answer: the channel capacity C). For this reason some consider information theory to be a subset of communication theory. We will argue that it is much more. Indeed, it has fundamental contributions to make in statistical physics (thermodynamics), computer science (Kolmogorov complexity or algorithmic complexity), statistical inference (Occam's Razor: “The simplest explanation is best”) and to probability and statistics (error rates for optimal hypothesis testing and estimation). The relationship of information theory to other fields is discussed. Information theory intersects physics (statistical mechanics), mathematics (probability theory), electrical engineering (communication theory) and computer science (algorithmic complexity). We describe these areas of intersection in detail.