ArticlePDF Available

Algorithms for gaussian bandwidth selection in kernel density estimators

Authors:

Abstract

In this paper we study the classical statistical problem of choos-ing an appropriate bandwidth for Kernel Density Estimators. For the special case of Gaussian kernel, two algorithms are proposed for the spherical covariance matrix and for the general case, respec-tively. These methods avoid the unsatisfactory procedure of tuning the bandwidth while evaluating the likelihood, which is impractical with multivariate data in the general case. The convergence con-ditions are provided together with the algorithms proposed. We measure the accuracy of the models obtained by a set of classifica-tion experiments.
Algorithms for Gaussian Bandwidth
Selection in Kernel Density Estimators
José Miguel Leiva Murillo and Antonio Artés Rodríguez
Department of Signal Theory and Communications,
Universidad Carlos III de Madrid
E-mail: {leiva,antonio}@ieee.org.
Abstract
In this paper we study the classical statistical problem of choos-
ing an appropriate bandwidth for Kernel Density Estimators. For
the special case of Gaussian kernel, two algorithms are proposed
for the spherical covariance matrix and for the general case, respec-
tively. These methods avoid the unsatisfactory procedure of tuning
the bandwidth while evaluating the likelihood, which is impractical
with multivariate data in the general case. The convergence con-
ditions are provided together with the algorithms proposed. We
measure the accuracy of the models obtained by a set of classifica-
tion experiments.
1 Introduction
A Kernel Density Estimator (KDE) is a non-parametric Probability Density Func-
tion (PDF) model that consists of a linear combination of kernel functions centered
on the training data {xi}i=1,...,N , i.e.:
ˆpθ(x) = 1
N
N
X
i=1
kθ(xxi)(1)
where kθ(x)is the kernel function, which must be unitary, i.e.: Rkθ(x)dx= 1 and
x RD. Although the KDEs are commonly considered as non-parametric models,
the kernel function is characterized by a bandwidth that determines the accuracy
of the model: ˆpθ(x) = ˆp(x|θ). Kernels too narrow of wide lead to overfitted or
underfitted models, respectively.
Classical bandwidth selection methods have mainly focused on the unidimensional
case. In [1], some first and second generation methods are compilated. Some exam-
ples of first generation criteria are the Mean Square Error (MSE), the Mean Inte-
grated Squared Error (MISE), and the asymptotical MISE (AMISE) [1] [2]. Second
generation methods include plug-in techniques and bootstrap methods. Kullback-
Leibler divergence has also been considered [3].
We are interested in the Maximum-Likelihood (ML) criterion. Cross-validation
allows us to apply the ML criterion so that a model built from N1samples is
evaluated on the point left. The model evaluated on each training sample has the
form:
ˆpθ(xi) = 1
N1
N
X
j=1
j6=i
G(xixj|θ)(2)
where we make explicit the use of a Gaussian kernel. This framework was first
proposed in [4] and later studied by other authors [2], [5]. However, these studies
lack a closed optimization procedure, so that the bandwidth σ2is obtained by
a greedy tuning along its possible values. Besides, the multivariate case is only
considered in these previous works under a spherical kernel assumption. In this
paper, proposed two algorithms that overcome these difficulties.
In a multidimensional Gaussian kernel, the set of parameters consists of the covari-
ance matrix of the Gaussian. In the following, we consider two different degrees
of complexity assumed for this matrix: a spherical shape, so that C=σ2ID-only
one parameter to adjust-, and an unconstrained kernel, in which a general form is
considered for Cwith D(D+ 1)/2parameters.
Sections 2 and 3 describe the bandwidth optimization for the both cases mentioned
as presented in [6] and establish their convergence conditions. Some classification
experiments are presented in Section 4 to measure the accuracy of the models.
Section 5 closes the paper with the most important conclusions.
2 The spherical case
The expression for the kernel function is, for the spherical case:
Gij (σ2) = G(xixj|σ2) = (2π)D/2σDexp 1
2σ2kxixjk2
We want to find the σthat maximizes the log-likelihood log L(X|σ2) =
Pilog ˆpθ(xi). The derivative of this likelihood is:
σlog L(X|σ2) = 1
N1X
i
1
ˆp(xi)X
j6=ikxixjk2
σ3D
σGij (σ2)
We now search for the point that makes the derivative null:
X
i
1
ˆp(xi)X
j6=i
kxixjk2
σ3Gij (σ2) = X
i
1
ˆp(xi)
D
σX
j6=i
Gij (σ2) = N(N1)D
σ
The second equality has been obtained by the fact that, by definition,
Pj6=iGij = (N1)ˆp(xi). Then we obtain the following fixed-point algorithm:
σ2
t+1 =1
N(N1)DX
i
1
ˆpt(xi)X
j6=i
kxixjk2Gij (σ2
t)(3)
where ˆptdenotes the KDE obtained in iteration t, i.e. the one that makes use of
the width σ2
t.
We prove the convergence of the algorithm in (3) by means of the following conver-
gence theorem:
Theorem 1 There is a fixed point in the interval d2
NN
D,2 tr{Σx}
D, being d2
NN the
mean quadratic distance to the nearest neighbor and Σxthe covariance matrix of x.
Besides, the fixed point is unique and the algorithm converges to it in the mentioned
interval if the following condition holds:
1
2σ4N(N1)2DX
i
1
ˆpl(xi)2X
j6=iX
k6=i,j
(d2
ij d2
ik)2exp(d2
ij +d2
ik
2σ2)<1(4)
Proof 1 Let σ2=g(σ2)the function in (3), whose fixed point is to be obtained.
The proof of the fixed point existence is based on the search of an interval (a, b)such
that a<g(σ2)< b if σ2(a, b).
In order to demonstrate that the interval d2
NN
D,2 tr{Σx}
Dholds that condition, we
need to prove these three facts:
1. gd2
NN
D>d2
NN
D
2. g2 tr{Σx}
D<2 tr{Σx}
D
3. g(σ2)is monotonic in the interval.
This way we are guaranteed that there is at least one crossing point between the
function g(σ2)and the line g(σ2) = σ2.
To prove the first point, we rewrite (3) as:
g(σ2) = 1
ND X
iX
j6=i
d2
ij
1
1 + Pk6=i,j exp(d2
ij d2
ik
2σ2)
(5)
The limit at 0is given by:
lim
σ20f(σ2) = 1
ND X
i
min
j6=id2
ij =d2
NN
D
because the elements in the denominator of (5) are null with the exception of the
cases in which d2
ij < d2
ik,k6=j. The first point is already proved because d2
NN
Dis
the minimum value that g(σ2)can reach.
To prove the second point, we take the limit at infinite:
lim
σ2→∞ g(σ2) = lim
σ2→∞
1
ND X
iX
j6=i
d2
ij
exp(d2
ij
2σ2)
Pk6=iexp(d2
ik
2σ2)
=1
ND X
iX
j6=i
d2
ij
1
N1
The sum of distances may be expressed in terms of expectation, as:
1
N(N1) X
iX
j6=i
d2
ij =Ei,j {(xixj)T(xixj)}= 2Ei{xT
ixi} 2µx
Tµx
According to a property of linear algebra, if µx=Ex{x}and Σx=Ex{xxT}
µxµx
T, then E{xTx}= tr{Σx}+µx
Tµx.
2E{xTx} 2µx
Tµx= 2 tr{Σx}
We obtain:
lim
σ2→∞ g(σ2) = 2 tr{Σx}
D
The second point is then proved since the maximum value of g(σ2)is reached at the
infinite.
To demonstrate the last point, we compute the derivative of g0(σ2)and check out
that it is positive:
dg(σ2)
2=1
2σ4ND X
iX
j6=i
d2
ij Pk(d2
ij d2
ik) exp(d2
ij +d2
ik
2σ2)
(Pl6=iexp(d2
il
2σ2))2
=1
2σ4N(N1)2DX
i
1
ˆpl(xi)2X
j6=iX
k6=i,j
(d2
ij d2
ik)2exp(d2
ij +d2
ik
2σ2)0
(6)
The existence of a unique fixed point is then proved. To demonstrate the convergence
of the algorithm in such interval, we need to check out the condition |g0(σ2)|<1
[7]. In that case, we are guaranteed that only a crossing point between g(σ2)and
the line g(σ2) = σ2exists. The convergence condition (4) means that the value of
(6) is lesser than 1.
3 The unconstrained case
The general expression for a Gaussian kernel is:
Gij (C) = |2πC|1/2exp 1
2(xixj)TC1(xixj)
and its derivative w.r.t. C:
CGij (C) = 1
2C1(xixj)(xixj)TIC1Gij (C)
As in the previous cases, we take the derivative of the log-likelihood and make it
equal to zero:
X
i
1
ˆp(xi)
1
N1X
j6=i
1
2C1(xixj)(xixj)TC1Gij =X
i
1
ˆp(xi)
1
N1X
j6=i
1
2C1Gij
By multiplying both members by C, both at the right and the left, we obtain:
X
i
1
ˆp(xi)X
j6=i
(xixj)(xixj)TGij =CX
i
1
ˆp(xi)X
j6=i
Gij
After some simplifications as in the spherical case, we reach the following fixed-point
algorithm:
Ct+1 =1
N(N1) X
i
1
ˆpt(xi)X
j6=i
(xixj)(xixj)TGij (Ct)(7)
The expression in (7) suggests a relationship with the Expectation-Maximization
result for Gaussian Mixture Models (GMM). A GMM is a PDF estimator given by
the expression ˆp(x) = PK
k=1 αkG(x|µk,Ck). The weights of the Kcomponents of
the mixture are given by the αk, and each Gaussian is characterized by its mean
vector µkand its covariance matrix Ck. The solution provided by the EM algorithm
consists of an iterative procedure where the parameters at step tare obtained by the
ones at step t1. To do so, a matrix of auxiliary variables is used, rki =p(k|xi),
expressing the likelihood of the sample to belong to the k-th component of the
mixture. These probabilities must hold Pkrki = 1. The EM solution establishes
the following updating rule for the covariance matrix at step t:
Ct
k=X
kX
i
rt
ki
(xiµt
k)(xiµt
k)T
N(8)
where the rt
ki and µt
kare also iteratively updated. Note that our KDE model can
be considered as a special case of GMM where i) there are as many mixtures as
samples (K=N) with the same weights αk= 1/N; ii) mean vectors are fixed:
µk=xk; iii) the covariance matrix is the same for each of the components, and iv)
rki = 0 if k=iand rk i = 1/(N1) if k6=i.
With these particularizations, the updating rule in (8) becomes equal to the one
given by the iteration in (7).
The EM guarantees the monotonic increase of the likelihood cost and so its conver-
gence to a local minimum, as proved in the literature [8]. The algorithm given in (7)
is subject to the same conditions, so that its convergence is also proved. However,
in situations in which ND, empirical covariance matrices are close to singularity,
so that numerical problems may arise as in GMM design.
4 Application to Parzen classification
We have tested the performance of the obtained models on a set of public classifica-
tion problems from [9]. For doing so, we apply the Parzen classifier, which performs
the simple Bayes criterion:
ˆy= arg max
lˆpθl(x|cl)
with per-class spherical (S-KDE) and unconstrained (U-KDE) models ˆpθl(x|cl)op-
timized according to the proposed method, being cleach of the Lclasses consid-
ered. We have compared these results with the ones obtained by other classification
methods such as K-Nearest-Neighbors (KNN, with K=1) and the one-versus-the-
rest Support Vector Machine (SVM) with RBF kernel. The results are shown in
Data Train Test L D S-KDE U-KDE KNN SVM
Pima 738 - 2 8 71.22 75.13 73.18 76.47
Wine 178 - 3 13 75.84 99.44 76.97 100
Landsat 4435 2000 6 36 89.45 86.10 90.60 90.90
Optdigits 3823 1797 10 64 97.89 93.54 94.38 98.22
Letter 16000 4000 26 16 95.23 92.77 95.20 97.55
Table 1: Classification performance on some public datasets. Leave-one-out accu-
racy is provided when there are not test data.
Table 1. The most remarkable conclusion from this result is that either S-KDE or
U-KDE provides, in each case, a classification performance close to SVM’s. The
comparison between S-KDE and U-KDE performance is closely related, in each
case, to the dimension of the data when compared to the number of samples. In
the datasets with higher dimensionality, the performance of S-KDE is higher due
to its lower risk of overfitting. Parzen classifiers have not enjoyed the popularity of
other methods, mainly due to the difficulty of obtaining a reliable bandwidth for
the kernel. However, in this experiment we have shown how the bandwidth chosen
by our algorithm provides a classification performance close to a state-of-the-art
classifier such as the SVM.
5 Conclusions
We have presented two algorithms for the optimization of the likelihood in the band-
width selection problem for KDE models. Unlike previous results in the literature,
the methods tackle in a natural way the multivariate case, for which we provide
solutions based on both spherical and complete (unconstrained) Gaussian kernel.
The convergence conditions have been described for both algorithms. By a set of
experiments, we have shown that the models obtained are accurate enough to pro-
vide good classification results. This demonstrates that the model do not overfit to
the data, even in problems involving a high number of variables.
References
[1] M. C. Jones, J. S. Marron, and S. J. Sheather, “A brief survey of bandwidth se-
lection for density estimation,” Journal of the American Statistical Association,
vol. 91, no. 433, pp. 401–407, 1996.
[2] B. W. Silverman, Density Estimation for Statistics and Data Analysis, Chapman
& Hall, Londres, 1986.
[3] A. Bowman, “An alternative method of cross-validation for the smoothing of
density estimates,” Biometrika, vol. 71, pp. 353–360, 1984.
[4] R. Duin, “On the choice of smoothing parameters for Parzen estimators of
probability density functions,” IEEE Trans. on Computers, vol. 25, no. 11,
1976.
[5] P. Hall, “Cross-validation in density estimation,” Biometrika, vol. 69, no. 2, pp.
383–390, 1982.
[6] J.M. Leiva-Murillo and A. Artés-Rodríguez, “A fixed-point algorithm for finding
the optimal covariance matrix in kernel density modeling,” in International
Conference on Acoustic, Speech and Signal Processing, Toulouse, Francia, 2006.
[7] R. Fletcher, Practical Methods of Optimization (2nd Edition), John Wiley &
Sons, New York, 1995.
[8] G. McLachlan, The EM algorithm and extensions, John Wiley & Sons, New
York, 1997.
[9] D.J. Newman, S. Hettich, C.L. Blake, and C.J. Merz, UCI repository of machine
learning databases,” Tech. Rep., Univ. of California, Dept. ICS, 1998.
... For this reason, we make use of the maximum-likelihood leave-one-out (ML-LOO) method for bandwidth selection [17], which maximizes the likelihood, measured in each data point, of a KDE model built with the rest of pointŝ ...
... In this Section, we present some experiments to evaluate the four feature extractors whose performance is evaluated in this section are the ones described in Section III. We make use of the cross-validation maximum-likelihood (ML-LOO) rule for bandwidth selection of the KDEs involved, as described in [17]. However, in order to explore the relevance of this kernel bandwidth choice, we include two versions of the MCL/MMI method, one of them based on the aforementioned method and the other one on Scott rule. ...
Article
Full-text available
In this paper, we provide a unified study of the application of kernel density estimators to supervised linear feature extraction by means of criteria inspired by information and detection theory. We enrich this study by the incorporation of two novel criteria to the study: the mutual information and the likelihood ratio test, and perform both a theoretical and a experimental comparison between the new methods and other ones previously described in the literature. The impact of the bandwidth selection of the density estimator in the classification performance is discussed. Some theoretical results that bound classification performance as a function or mutual information are also compiled. A set of experiments on different real-world datasets allow us to perform an empirical comparison of the methods, in terms of both accuracy and computational complexity. We show the suitability of these methods to determine the dimension of the sub-space that contains the discriminative information.
... Bandwidth selection is a common problem, too small a value for h can give a 'spikey' estimate (not enough smoothing) while large values for h can over-smooth the estimate. This value can be tuned [37], or the AMISE (Asymptotic Mean Integrated Square Error) can be used [38]. This also needs to be estimated from the data, so h is an estimate of an asymptotic approximation. ...
Conference Paper
With the recent development of network and sensor technologies, vast amounts of data are being continuously generated in real time from real-world environments. Such data includes in many noise, and it is not easy to predict that distribution underlying the data in advance. Probability density estimation is a critical task of machine learning, but it is difficult to accomplish it for big data in the real world. For handling such data, we propose a robust fast online multivariate non-parametric density estimator. Our proposed method extends the kernel density estimation and Self-Organizing Incremental Neural Network. The experimental results show that our proposed method outperforms or achieves a state-of-the-art performance.
Article
With the ongoing development and expansion of communication networks and sensors, massive amounts of data are continuously generated in real time from real environments. Beforehand, prediction of a distribution underlying such data is difficult; furthermore, the data include substantial amounts of noise. These factors make it difficult to estimate probability densities. To handle these issues and massive amounts of data, we propose a nonparametric density estimator that rapidly learns data online and has high robustness. Our approach is an extension of both kernel density estimation (KDE) and a self-organizing incremental neural network (SOINN); therefore, we call our approach KDESOINN. An SOINN provides a clustering method that learns about the given data as networks of prototype of data; more specifically, an SOINN can learn the distribution underlying the given data. Using this information, KDESOINN estimates the probability density function. The results of our experiments show that KDESOINN outperforms or achieves performance comparable to the current state-of-the-art approaches in terms of robustness, learning time, and accuracy.
Conference Paper
We present a new gesture recognition method using multi-modal data. Our approach solves a labeling problem, which means that gesture categories and their temporal ranges are determined at the same time. For that purpose, a generative probabilistic model is formalized and it is constructed by nonparametrically estimating multi-modal densities from a training dataset. In addition to the conventional skeletal joint based features, appearance information near the active hand in the RGB image is exploited to capture the detailed motion of fingers. The estimated log-likelihood function is used as the unary term for our Markov random field (MRF) model. The smoothness term is also incorporated to enforce temporal coherence of our model. The labeling results can then be obtained by the efficient dynamic programming technique. Experimental results demonstrate that our method provides effective gesture labeling results for the large-scale gesture dataset. Our method scores 0.8268 in the mean Jaccard index and is ranked 3rd in the gesture recognition track of the ChaLearn Looking at People (LAP) Challenge in 2014.
Article
This paper proposes a new inter-view consistent hole filling method in view extrapolation for multi-view image generation. In stereopsis, inter-view consistency regarding structure, color, and luminance is one of the crucial factors that affect the overall viewing quality of three-dimensional image contents. In particular, the inter-view inconsistency could induce visual stress on the human visual system. To ensure the inter-view consistency, the proposed method suggests a hole filling method in an order from the nearest to farthest view to the reference view by propagating the filled color information in the preceding view. In addition, a novel depth map filling method is incorporated to achieve the inter-view consistency. Experimental results show that the proposed method significantly improves the inter-view consistency for multiview images and depth maps, compared to those of previous methods.
Article
In machine learning and statistics, kernel density estimators are rarely used on multivariate data due to the difficulty of finding an appropriate kernel bandwidth to overcome overfitting. However, the recent advances on information-theoretic learning have revived the interest on these models. With this motivation, in this paper we revisit the classical statistical problem of data-driven bandwidth selection by cross-validation maximum likelihood for Gaussian kernels. We find a solution to the optimization problem under both the spherical and the general case where a full covariance matrix is considered for the kernel. The fixed-point algorithms proposed in this paper obtain the maximum likelihood bandwidth in few iterations, without performing an exhaustive bandwidth search, which is unfeasible in the multivariate case. The convergence of the methods proposed is proved. A set of classification experiments are performed to prove the usefulness of the obtained models in pattern recognition.
Article
In this paper, we present a novel depth sensation enhancement algorithm considering the behavior of human visual system (HVS) toward stereoscopic image displays. On the basis of the recent studies on the just noticeable depth difference (JNDD), which represents a threshold that a human can perceive the depth difference between objects, we modify the depth image such that neighboring objects in the depth image can have a depth value difference of at least the JNDD. This modification is modeled via an energy minimization framework using three energy terms defined as depth data preservation, depth-order preservation, and depth difference expansion. The depth data term enforces minimal changes in the depth image with an additional weighting function that controls the direction of depth changes. The depth-order term restricts the inversion of the local and global depth orders among objects, and the JNDD term leads to an increase in the depth differences between segments. Throughout subjective quality evaluation on a stereoscopic image display, it is demonstrated that the human depth sensation is effectively improved by the proposed algorithm.
Article
Full-text available
Parzen estimators are often used for nonparametric estimation of probability density functions. The smoothness of such an estimation is controlled by the smoothing parameter. A problem-dependent criterion for its value is proposed and illustrated by some examples. Especially in multimodal situations, this criterion led to good results.
Article
SUMMARY We examine the cross-validatory method of choosing the window size when estimating a density on a bounded interval. Tt is shown that even if the resulting estimator is consistent it will perform suboptimally for large samples, and should be used cautiously. Alternative techniques, such as that proposed by Silverman (1978a), would be a wiser choice.
Article
There has been major progress in recent years in data-based bandwidth selection for kernel density estimation. Some “second generation” methods, including plug-in and smoothed bootstrap techniques, have been developed that are far superior to well-known “first generation” methods, such as rules of thumb, least squares cross-validation, and biased cross-validation. We recommend a “solve-the-equation” plug-in bandwidth selector as being most reliable in terms of overall performance. This article is intended to provide easy accessibility to the main ideas for nonexperts.
Article
Cross-validation with Kullback-Leibler loss function has been applied to the choice of a smoothing parameter in the kernel method of density estimation. A framework for this problem is constructed and used to derive an alternative method of cross-validation, based on integrated squared error, recently also proposed by Rudemo (1982). Hall (1983) has established the consistency and asymptotic optimality of the new method. For small and moderate sized samples, the performances of the two methods of cross-validation are compared on simulated data and specific examples.
Conference Paper
In this paper, we apply the methodology of cross-validation maximum likelihood estimation to the problem of multivariate kernel density modeling. We provide a fixed point algorithm to find the covariance matrix for a Gaussian kernel according to this criterion. We show that the algorithm leads to accurate models in terms of entropy estimation and Parzen classification. By means of a set of experiments, we show that the method considerably improves the performance traditionally expected from Parzen classifiers. The accuracy obtained in entropy estimation suggests its usefulness in ICA and other information-theoretic signal processing techniques
A fixed-point algorithm for finding the optimal covariance matrix in kernel density modeling
  • J M Leiva-Murillo
  • A Artés-Rodríguez
J.M. Leiva-Murillo and A. Artés-Rodríguez, "A fixed-point algorithm for finding the optimal covariance matrix in kernel density modeling," in International Conference on Acoustic, Speech and Signal Processing, Toulouse, Francia, 2006.