PreprintPDF Available

# On Sub-Gaussian Concentration of Missing Mass

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

## Abstract

The statistical inference on missing mass aims to estimate the weight of elements \emph{not observed} during sampling. Since the pioneer work of Good and Turing, the problem has been studied in many areas including statistical linguistic, ecology, and machine-learning. Proving the sub-gaussian behavior of the missing mass has been notoriously hard, and a number of complicated arguments have been proposed: logarithmic Sobolev inequalities, thermodynamic approaches, and information-theoretic transportation methods. Prior works have argued that the difficulty is inherit, and classical tools are inadequate. We show that this common belief is false, and all that we need to establish the sub-gaussian concentration, is the classical inequality of Bernstein. The strong educational value of this work is in demonstrating this inequality in its full generality, not well recognized by researches.
On Sub-Gaussian Concentration of Missing Mass
Maciej Skorski
University of Luxembourg
maciej.skorski@uni.lu
Abstract. The statistical inference on missing mass aims to estimate
the weight of elements not observed during sampling. Since the pioneer
work of Good and Turing, the problem has been studied in many areas
including statistical linguistic, ecology, and machine-learning.
Proving the sub-gaussian behavior of the missing mass has been no-
toriously hard, and a number of complicated arguments have been pro-
posed: logarithmic Sobolev inequalities, thermodynamic approaches, and
information-theoretic transportation methods. Prior works have argued
that the diﬃculty is inherit, and classical tools are inadequate.
We show that this common belief is false, and all that we need to estab-
lish the sub-gaussian concentration, is the classical inequality of Bern-
stein. The strong educational value of this work is in demonstrating this
inequality in its full generality, not well recognized by researches.
Keywords: Missing Mass ·Measure Concentration ·Heterogenic Bern-
stein’s Inequality
1 Introduction
1.1 Background
The missing mass problem is about estimating properties of elements that have
not occurred in a sample, as illustrated in Figure 1. The problem goes back
to the work of Good and Turing [13,14] on attacking Enigma codes. It has
been later studied in statistics theory [23] and several applied disciplines such
as ecology [25,7,8,6], quantitative linguistic [11,20,12], archaeology [21] network
design [4,5], information theory [26,27], and bio-molecular modeling [16,15].
To state the problem, let {X1, . . . , Xn}be an iid sample from a discrete
distribution X, and denote pi= Pr[X=i]. Then the missing mass is deﬁned as
M,X
isupp(X)
piI(i6∈ {X1, . . . , Xn}).(1)
The fundamental problem is to establish the concentration properties of (1),
that is prove that ME[M] with high probability. However, the missing mass
is a sum of (weakly correlated) components of diﬀerent magnitudes. This made
the task very challenging to prior works, which developed fairly complicated
arguments. This paper tackles the following challenge:
2 Maciej Skorski
0 5 10 15 20
0.00
0.05
0.10
0.15
0.20
empirical
true
Fig. 1: The true and empirical frequency from 100 samples, for the distribution
Poiss(10). The value of 12 has not occurred in the sample (see Appendix A).
Obtain sub-gaussian property for missing mass by classical inequalities.
In response to this challenge, we demonstrate that Bernstein’s inequality is
suﬃcient to prove the sub-gaussian concentration of the missing mass [18]:
Theorem 1. For some constant K > 0and any  > 0it holds that:
max {Pr[MEM < ],Pr[MEM > ]}6eKn2.(2)
1.2 Related Work
Up to a constant, this is best gaussian-like concentration we can have, and best
exponential oblivious tail bound (that is when no structure information about X
is used) 1. The result was proved ﬁrst in [19] by a somewhat intricate approach.
The authors argued in the subsequent work [18] that the standard inequalities
of Hoeﬀding, Angluin-Valiant, Bernstein and Bennett are inadequate to obtain
the theorem; as a remedy they developed a statistical thermodynamic approach,
continued later in [17]. The work [2] took another path, using sharp log-Sobolev
inequalities for Bernoulli distributions. Finally, [22] gave yet another alternative
argument utilizing transportation inequalities from information theory.
1They exist distribution-depended bounds [1] with sub-gamma tails; with no speciﬁc
information about the distribution they are not better than the sub-gaussian bound.
On Sub-Gaussian Concentration of Missing Mass 3
2 Results
2.1 Proof with Bernstein’s Inequality
To prove Theorem 1, we use just the classical Bernstein inequality with ”het-
erogenic” variances [3,9]. This inequality, surprisingly, has been never used in
prior works on the sub-gaussian concentration of the misssing mass. We believe
that our alternative proof is of interest and of much educational value, given the
widely spread belief on the insuﬃciency of classical inequalities.
Lemma 1 (General Bernstein Inequality). Suppose that independent ran-
dom variables {Zi}isatisfy the following Bernstein condition
E[|ZiE[Zi]|d]6d!σ2
icd2
2(3)
for each integer d>2with some positive constants σiand c. Then for Z=PiZi
and σ2=Piσ2
iit holds that:
E[exp(t(ZE[Z]))] 6exp σ2t2
2(1 ct),|t|<1/c. (4)
In particular, the following tail bound is valid for any  > 0:
max {Pr[ZE[Z]<],Pr[ZE[Z]> ]}6exp 2
2v2+ 2c.(5)
3 Proof of Theorem 1
3.1 Step 1: Establishing Negative Dependency
Denote Mi,piI(i6∈ {X1, . . . , Xn}), so that M=PiMi. These random vari-
ables are correlated, but one can observe this is negative dependency [10]. What
it means is that, roughly speaking, we can apply standard concentration bounds
developed for independent random variables. The intuition behind is that nega-
tive dependency can only help in probability concentration.
This fact is considered standard in analysis related to the missing mass prob-
lem [18,2], but we sketch the argument for the reader convenience.
Let ξi,k =I(iXk) (that is, the i-th element occurs in the k-th sam-
ple). Then {ξi,k}i, with ﬁxed k, are 0-1 random variables with the property
that Piξi,k = 1, hence thet are negatively dependent (the ”0-1 law”, see [10]).
Furthermore, collections {ξi,k}iare independent for diﬀerent k, so the whole
collection {ξi,k}i,k is negatively dependent (the ”augmentation law”, see [10]).
Since Mi=pi·Pi(1 maxiξi,k), we have that Mi=fi((ξi,k)i,k ) for coordinate-
decreasing functions fi; thus Miare negatively dependent (”co-monotone trans-
forms preserve negative dependency”, see [10]). Applying again the property of
co-monotone transforms, we see that {MiE[Mi]}iare negatively dependent.
4 Maciej Skorski
3.2 Step 2: Majorizing by IID Sum
Let Zihave same marginal distributions as Mi, but be independent. Then
E[f(X
i
(MiE[Mi]))] 6E[f(X
i
(ZiE[Zi]))] (6)
holds for any convex real function f. This is the well-known convex majorization
property of negatively-dependent random variables [24].
3.3 Step 3: Bernstein’s Condition
By the deﬁnition of Mi:
E[Md
i] = pd
i(1 pi)n,(7)
We will use the following fact: the expression za(1 z)bwith ﬁxed a, b > 0
for z(0,1) is maximized at z=a
a+b(to avoid the derivative test, one can
notice that the expression is proportional to the density of the beta distribution
with parameters a+ 1, b + 1, and its mode is at a
a+b). Writing pd
i(1 pi)n=
pi·pi(1 pi)n/2·pd2
i(1 pi)n/2for d>2 and applying it twice we obtain:
E[Md
i]6pi·O(1/n)·O(d/n)d2.(8)
Since E[|MiE[Mi]|]d62dE[|Mi|d], we also have
E[|ME[Mi]|d
i]6pi·O(1/n)·O(d/n)d2,(9)
with the constant under O(·) changed by a factor of 2. This proves that Mi, and
hence also Zi, satisﬁes the Bernstein condition with
σ2
i=O(pi/n)
c=O(1/n).(10)
3.4 Step 4: Bernstein’s Inequality
Let Z=PiZi. From the previous step and Lemma 1 we obtain:
Eexp (t· |ZE[Z]]) 6exp t2σ2
2(1 c|t|),|t|<1/c. (11)
with σ2=Piσ2
i=O(1/n)Pipi= 1 and c=O(1/n).
By (6) this also holds when Zis replaced by M. Thus, the bound from
Lemma 1 holds for Mwith σ2=O(1/n) and c=O(1/n). Observe that for the
missing mass 0 6M61, so the bound in Theorem 1 trivially holds when >1.
But when 0 <<1 we have v2+c 6v2+c=O(1/n). Thus we get
Pr[±(ME[M]) > ]6exp((n2)),(12)
which ﬁnishes the proof.
On Sub-Gaussian Concentration of Missing Mass 5
4 Conclusion
We have shown how to obtain sub-gaussian concentration for the missing mass,
using a classical inequality. This solves the challenge set by prior works.
References
1. Ben-Hamou, A., Boucheron, S., Ohannessian, M.I., et al.: Concentration inequal-
ities in the inﬁnite urn scheme for occupancy counts and the missing mass, with
applications. Bernoulli 23(1), 249–287 (2017)
2. Berend, D., Kontorovich, A.: On the concentration of the missing mass. Electron.
Commun. Probab. 18, 7 pp. (2013)
3. Bernstein, S.: The theory of probabilities (1946)
4. Budianu, C., Tong, L.: Estimation of the number of operating sensors in sensor
network. In: The Thrity-Seventh Asilomar Conference on Signals, Systems & Com-
puters, 2003. vol. 2, pp. 1728–1732. IEEE (2003)
5. Budianu, C., Tong, L.: Good-turing estimation of the number of operating sensors:
a large deviations analysis. In: 2004 IEEE International Conference on Acoustics,
Speech, and Signal Processing. vol. 2, pp. ii–1029. IEEE (2004)
6. Chao, A., Colwell, R.K., Chiu, C.H., Townsend, D.: Seen once or more than once:
Applying good–turing theory to estimate species richness using only unique ob-
servations and a species list. Methods in Ecology and Evolution 8(10), 1221–1232
(2017)
7. Chao, A., Shen, T.J.: Nonparametric estimation of shannon’s index of diversity
when there are unseen species in sample. Environmental and ecological statistics
10(4), 429–443 (2003)
8. Chao, A., Wang, Y., Jost, L.: Entropy and the species accumulation curve: a novel
entropy estimator via discovery rates of new species. Methods in Ecology and
Evolution 4(11), 1091–1100 (2013)
9. Craig, C.C.: On the tchebychef inequality of bernstein. The Annals of Mathemat-
ical Statistics 4(2), 94–102 (1933)
10. Dubhashi, D.P., Ranjan, D.: Balls and bins: A study in negative dependence.
BRICS Report Series 3(25) (1996)
11. Efron, B., Thisted, R.: Estimating the number of unseen species: How many words
did shakespeare know? Biometrika 63(3), 435–447 (1976)
12. Gale, W.A., Sampson, G.: Good-turing frequency estimation without tears. Journal
of quantitative linguistics 2(3), 217–237 (1995)
13. Good, I.J.: The population frequencies of species and the estimation of population
parameters. Biometrika 40(3-4), 237–264 (1953)
14. Good, I.J.: Turing’s anticipation of empirical bayes in connection with the crypt-
analysis of the naval enigma. J Stat Comput Simul 66(2) (2000)
15. Koukos, P.I., Glykos, N.M.: On the application of good-turing statistics to quan-
tify convergence of biomolecular simulations. Journal of chemical information and
modeling 54(1), 209–217 (2014)
16. Mao, C.X., Lindsay, B.G.: A poisson model for the coverage problem with a ge-
nomic application. Biometrika 89(3), 669–682 (2002)
17. Maurer, A., et al.: Thermodynamics and concentration. Bernoulli 18(2), 434–454
(2012)
6 Maciej Skorski
18. McAllester, D., Ortiz, L.: Concentration inequalities for the missing mass and for
histogram rule error. Journal of Machine Learning Research 4(Oct), 895–911 (2003)
19. McAllester, D.A., Schapire, R.E.: On the convergence rate of good-turing estima-
tors. In: COLT. pp. 1–6 (2000)
20. McNeil, D.R.: Estimating an author’s vocabulary. Journal of the American Statis-
tical Association 68(341), 92–96 (1973)
21. Myrberg Burstr¨om, N.: A tale of buried treasure, some good estimations, and
golden unicorns: The numismatic connections of alan turing. (2015)
22. Raginsky, M., Sason, I.: Concentration of measure inequalities in information the-
ory, communications and coding. arXiv preprint arXiv:1212.4663 (2012)
23. Robbins, H.E., et al.: Estimating the total probability of the unobserved outcomes
of an experiment. The Annals of Mathematical Statistics 39(1), 256–257 (1968)
24. Shao, Q.M.: A comparison theorem on moment inequalities between negatively
associated and independent random variables. J. Theor. Probab 13(2), 343–356
(2000)
25. Shen, T.J., Chao, A., Lin, C.F.: Predicting the number of new species in further
taxonomic sampling. Ecology 84(3), 798–804 (2003)
26. Vu, V.Q., Yu, B., Kass, R.E.: Coverage-adjusted entropy estimation. Statistics in
medicine 26(21), 4039–4060 (2007)
27. Zhang, Z.: Entropy estimation in turing’s perspective. Neural computation 24(5),
1368–1389 (2012)
On Sub-Gaussian Concentration of Missing Mass 7
A Missing Mass Experiment
from s c i p y . s t a t s import poisson
from matplotlib import p y pl o t as p l t
import numpy as np
np . r ando m . s e e d ( 0 )
n = 1 00
mu = 1 0
x m i s s = 1 2
# f i n d t h e m i s s i n g m ass e v e n t
h i t = F a l s e
wh ile not h i t :
X = p o i s s o n ( mu=mu ) . r v s ( s i z e = (1 0 00 0 0 , n ) )
i d x s = (X == x m i s s ) . sum(1)==0
h i t = i d x s . any ( ) >0
# p l o t e m pi r ic a l d i s t r i b u t i o n
h e i g h t s , b i n s =np . h i st o g r a m ( X[ i d x s ] [ 2 ] , b i n s=np . a ra n g e ( 1 , 2 0 ) )
heights = heights/sum(heights)
b i n s = b i n s [ : 1 ]
p l t . f i g u r e ( f i g s i z e = (1 0 , 1 0) )
p l t . b a r ( b i n s , h e ig h t s , l a b e l = ’ e m p i r i c a l f r e q u e nc y ’ )
# p l o t t r ue d i s t r i b i t i o n b a s i c s t y l e
h e i g h t s = p o i s s o n ( mu=mu) . p mf ( b i n s )
p l t . p l ot ( b i ns , h ei g h ts , ’ o r an g e , l a b e l = ’ t r u e f r e qu e n cy ’ )
p l t . l eg e n d ( )
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
An infinite urn scheme is defined by a probability mass function $(p_j)_{j\geq 1}$ over positive integers. A random allocation consists of a sample of $N$ independent drawings according to this probability distribution where $N$ may be deterministic or Poisson-distributed. This paper is concerned with occupancy counts, that is with the number of symbols with $r$ or at least $r$ occurrences in the sample, and with the missing mass that is the total probability of all symbols that do not occur in the sample. Without any further assumption on the sampling distribution, these random quantities are shown to satisfy Bernstein-type concentration inequalities. The variance factors in these concentration inequalities are shown to be tight if the sampling distribution satisfies a regular variation property. This regular variation property reads as follows. Let the number of symbols with probability larger than $x$ be $\vec{\nu}(x) = |\{ j \colon p_j \geq x\}|$. In a regularly varying urn scheme, $\vec{\nu}$ satisfies $\lim_{\tau\rightarrow 0} \vec{\nu}(\tau x)/\vec\nu(\tau) = x^{-\alpha}$ for $\alpha \in [0,1]$ and the variance of the number of distinct symbols in a sample tends to infinity as the sample size tends to infinity. Among other applications, these concentration inequalities allow us to derive tight confidence intervals for the Good-Turing estimator of the missing mass.
Article
Full-text available
Estimating Shannon entropy and its exponential from incomplete samples is a central objective of many research fields. However, empirical estimates of Shannon entropy and its exponential depend strongly on sample size and typically exhibit substantial bias. This work uses a novel method to obtain an accurate, low‐bias analytic estimator of entropy, based on species frequency counts. Our estimator does not require prior knowledge of the number of species.We show that there is a close relationship between Shannon entropy and the species accumulation curve, which depicts the cumulative number of observed species as a function of sample size. We reformulate entropy in terms of the expected discovery rates of new species with respect to sample size, that is, the successive slopes of the species accumulation curve. Our estimator is obtained by applying slope estimators derived from an improved Good‐Turing frequency formula. Our method is also applied to estimate mutual information.Extensive simulations from theoretical models and real surveys show that if sample size is not unreasonably small, the resulting entropy estimator is nearly unbiased. Our estimator generally outperforms previous methods in terms of bias and accuracy (low mean squared error) especially when species richness is large and there is a large fraction of undetected species in samples.We discuss the extension of our approach to estimate Shannon entropy for multiple incidence data. The use of our estimator in constructing an integrated rarefaction and extrapolation curve of entropy (or mutual information) as a function of sample size or sample coverage (an aspect of sample completeness) is also discussed.
Article
Full-text available
A random variable is sampled from a discrete distribution. The missing mass is the probability of the set of points not observed in the sample. We sharpen and simplify McAllester and Ortiz's results (JMLR, 2003) bounding the probability of large deviations of the missing mass. Along the way, we refine and rigorously prove a fundamental inequality of Kearns and Saul (UAI, 1998).
Article
Full-text available
In evaluating the effectiveness of further sampling in species taxonomic surveys, a practical and important problem is predicting the number of new species that would be observed in a second survey, based on data from an initial survey. This problem can also be approached by estimating the corresponding expected number of new species. A. R. Solow and S. Polasky recently proposed a predictor (or estimator), with the form of a sum of many terms, that was derived under the assumption that all unobserved species in the initial sample have equal relative abundances. We show in this paper that the sum- mation can be expressed as only one term. We provide a direct justification for the simplified estimator and connect it to an extrapolation formula based on a special type of species accumulation curve. Using the proposed justification, we show that, for large sample sizes, the estimator is also valid under an alternative condition, i.e., species that are represented the same number of times in the initial sample have equal relative abundances in the community. This condition is statistically justified from a Bayesian approach, although the estimator exhibits moderate negative bias for predicting larger samples in highly hetero- geneous communities. In such situations, we recommend the use of a modified estimator that incorporates a measure of heterogeneity among species abundances. An example using field data from the extant rare vascular plant species patterns in the southern Appalachians is presented to compare the various methods.
Article
Full-text available
Assume that a random sample is drawn from a population with unknown number of classes and possibly unequal class probabilities. A nonparametric estimation technique is proposed to estimate the number of classes using the idea of sample coverage, which is defined as the sum of the cell probabilities of the observed classes. Since expected sample coverage can be well estimated, we were motivated to find its role in the estimation of the number of classes. This work generalizes the result of Esty to a nonparametric approach and extends Darroch and Ratcliff to incorporate the heterogeneity of the class probabilities. The coefficient of variation of the class sizes is shown to play an important role in the recommended estimation procedures. The performance of the proposed estimators is investigated by means of Monte Carlo simulations.
Article
Due to sampling limitations, almost every biodiversity survey misses species that are present, but not detected, so that empirical species counts underestimate species richness. A wide range of species richness estimators has been proposed in the literature to reduce undersampling bias. We focus on nonparametric estimators, which make no assumptions about the mathematical form of the underlying species abundance/incidence distributions. For replicated incidence data, in which only species presence/absence (or detection/non-detection) is recorded in multiple sampling units, most existing nonparametric estimators of the number of undetected species are based on the frequency counts of the uniques (species detected in only one sampling unit) and duplicates (species detected in exactly two sampling units). Some survey methods, however, record only uniques and super-duplicates (species observed in more than one sampling unit). Using the Good–Turing frequency formula, we developed a method to estimate the number of duplicates for such data, allowing estimation of true species richness, including undetected species. We test our estimators on several empirical datasets for which doubletons were recorded and on simulated sampling data, then apply them to an extensive, but previously unusable survey of coral reef fishes, for which only uniques and super-duplicates were recorded. We extend the method to abundance data and discuss other potential applications.
Article
The problem of estimating an author's vocabulary, given a sample of the author's writings, is considered. It is assumed that the vocabulary is fixed and finite, and that the author writes a composition by successively drawing words from this collection, independently of the previous configuration. Attention is focussed on the random variable X(n), the total number of different words used in a sample of n. It is shown that under fairly general conditions, the distribution of X(n), suitably normalized and scaled, is asymptotically Gaussian, and this result may be used to obtain a large sample estimator of vocabulary size.
Article
Quantifying convergence and sufficient sampling of macromolecular molecular dynamics simulations is more often than not a source of controversy (and of various ad hoc solutions) in the field. Clearly, the only reasonable, consistent and satisfying way to infer convergence (or otherwise) of a molecular dynamics trajectory must be based on probability theory. Ideally, the question we would wish to answer is the following : "What is the probability that a molecular configuration important for the analysis in hand has not yet been observed ?". Here we propose a method for answering a variant of this question by using the Good-Turing formalism for frequency estimation of unobserved species in a sample. Although several approaches may be followed in order to deal with the problem of discretizing the configurational space, for this work we use the classical RMSD matrix as a means to answering the following question: "What is the probability that a molecular configuration with an RMSD (from all other already observed configurations) higher than a given threshold has not actually been observed ?". We apply the proposed method to several different trajectories and show that the procedure appears to be both computationally stable and internally consistent. A free, open-source program implementing these ideas is immediately available for download via public repositories.
Article
The Enigma was a cryptographic (enciphering) machine used by the German military during WWII. The German navy changed part of the Enigma keys every other day. One of the important cryptanalytic attacks against the naval usage was called Banburismus, a sequentiai Bayesian procedure (anticipating sequential analysis) which was used from the sorine of 1941 until the middle of 1943. It was invented mainlv bv A. M. Turina and was perhaps the first important sequential Bayesian IE is unnecessab to describe it here. Before Banburismus could be started on a given day it was necessary to identifv which of nine ‘biaram’ (or ‘diaraph’) tables was in use on that day. In Turing’s approach to this identification hk had io istimate the probabilities of certain ‘trigraphs’. rrhese trigraphs were used. as described below. for determinine the initial wheel settings of messages). For estimatidg the probabilities, Turing inventedin important special case o the nonparametric (nonhypermetric) Empirid Bayes method independently of Herbert Robbins. The techniaue is the sumxisine form of Emdrical Baves in which a physical prior is assumed to eist but no apbroxiGate functional fonn is assumed for it.