Content uploaded by Antonio Artés Rodríguez
Author content
All content in this area was uploaded by Antonio Artés Rodríguez
Content may be subject to copyright.
Algorithms for Gaussian Bandwidth
Selection in Kernel Density Estimators
José Miguel Leiva Murillo and Antonio Artés Rodríguez
Department of Signal Theory and Communications,
Universidad Carlos III de Madrid
E-mail: {leiva,antonio}@ieee.org.
Abstract
In this paper we study the classical statistical problem of choos-
ing an appropriate bandwidth for Kernel Density Estimators. For
the special case of Gaussian kernel, two algorithms are proposed
for the spherical covariance matrix and for the general case, respec-
tively. These methods avoid the unsatisfactory procedure of tuning
the bandwidth while evaluating the likelihood, which is impractical
with multivariate data in the general case. The convergence con-
ditions are provided together with the algorithms proposed. We
measure the accuracy of the models obtained by a set of classifica-
tion experiments.
1 Introduction
A Kernel Density Estimator (KDE) is a non-parametric Probability Density Func-
tion (PDF) model that consists of a linear combination of kernel functions centered
on the training data {xi}i=1,...,N , i.e.:
ˆpθ(x) = 1
N
N
X
i=1
kθ(x−xi)(1)
where kθ(x)is the kernel function, which must be unitary, i.e.: Rkθ(x)dx= 1 and
x∈ RD. Although the KDEs are commonly considered as non-parametric models,
the kernel function is characterized by a bandwidth that determines the accuracy
of the model: ˆpθ(x) = ˆp(x|θ). Kernels too narrow of wide lead to overfitted or
underfitted models, respectively.
Classical bandwidth selection methods have mainly focused on the unidimensional
case. In [1], some first and second generation methods are compilated. Some exam-
ples of first generation criteria are the Mean Square Error (MSE), the Mean Inte-
grated Squared Error (MISE), and the asymptotical MISE (AMISE) [1] [2]. Second
generation methods include plug-in techniques and bootstrap methods. Kullback-
Leibler divergence has also been considered [3].
We are interested in the Maximum-Likelihood (ML) criterion. Cross-validation
allows us to apply the ML criterion so that a model built from N−1samples is
evaluated on the point left. The model evaluated on each training sample has the
form:
ˆpθ(xi) = 1
N−1
N
X
j=1
j6=i
G(xi−xj|θ)(2)
where we make explicit the use of a Gaussian kernel. This framework was first
proposed in [4] and later studied by other authors [2], [5]. However, these studies
lack a closed optimization procedure, so that the bandwidth σ2is obtained by
a greedy tuning along its possible values. Besides, the multivariate case is only
considered in these previous works under a spherical kernel assumption. In this
paper, proposed two algorithms that overcome these difficulties.
In a multidimensional Gaussian kernel, the set of parameters consists of the covari-
ance matrix of the Gaussian. In the following, we consider two different degrees
of complexity assumed for this matrix: a spherical shape, so that C=σ2ID-only
one parameter to adjust-, and an unconstrained kernel, in which a general form is
considered for Cwith D(D+ 1)/2parameters.
Sections 2 and 3 describe the bandwidth optimization for the both cases mentioned
as presented in [6] and establish their convergence conditions. Some classification
experiments are presented in Section 4 to measure the accuracy of the models.
Section 5 closes the paper with the most important conclusions.
2 The spherical case
The expression for the kernel function is, for the spherical case:
Gij (σ2) = G(xi−xj|σ2) = (2π)−D/2σ−Dexp −1
2σ2kxi−xjk2
We want to find the σthat maximizes the log-likelihood log L(X|σ2) =
Pilog ˆpθ(xi). The derivative of this likelihood is:
∇σlog L(X|σ2) = 1
N−1X
i
1
ˆp(xi)X
j6=ikxi−xjk2
σ3−D
σGij (σ2)
We now search for the point that makes the derivative null:
X
i
1
ˆp(xi)X
j6=i
kxi−xjk2
σ3Gij (σ2) = X
i
1
ˆp(xi)
D
σX
j6=i
Gij (σ2) = N(N−1)D
σ
The second equality has been obtained by the fact that, by definition,
Pj6=iGij = (N−1)ˆp(xi). Then we obtain the following fixed-point algorithm:
σ2
t+1 =1
N(N−1)DX
i
1
ˆpt(xi)X
j6=i
kxi−xjk2Gij (σ2
t)(3)
where ˆptdenotes the KDE obtained in iteration t, i.e. the one that makes use of
the width σ2
t.
We prove the convergence of the algorithm in (3) by means of the following conver-
gence theorem:
Theorem 1 There is a fixed point in the interval d2
NN
D,2 tr{Σx}
D, being d2
NN the
mean quadratic distance to the nearest neighbor and Σxthe covariance matrix of x.
Besides, the fixed point is unique and the algorithm converges to it in the mentioned
interval if the following condition holds:
1
2σ4N(N−1)2DX
i
1
ˆpl(xi)2X
j6=iX
k6=i,j
(d2
ij −d2
ik)2exp(−d2
ij +d2
ik
2σ2)<1(4)
Proof 1 Let σ2=g(σ2)the function in (3), whose fixed point is to be obtained.
The proof of the fixed point existence is based on the search of an interval (a, b)such
that a<g(σ2)< b if σ2∈(a, b).
In order to demonstrate that the interval d2
NN
D,2 tr{Σx}
Dholds that condition, we
need to prove these three facts:
1. gd2
NN
D>d2
NN
D
2. g2 tr{Σx}
D<2 tr{Σx}
D
3. g(σ2)is monotonic in the interval.
This way we are guaranteed that there is at least one crossing point between the
function g(σ2)and the line g(σ2) = σ2.
To prove the first point, we rewrite (3) as:
g(σ2) = 1
ND X
iX
j6=i
d2
ij
1
1 + Pk6=i,j exp(d2
ij −d2
ik
2σ2)
(5)
The limit at 0is given by:
lim
σ2→0f(σ2) = 1
ND X
i
min
j6=id2
ij =d2
NN
D
because the elements in the denominator of (5) are null with the exception of the
cases in which d2
ij < d2
ik,∀k6=j. The first point is already proved because d2
NN
Dis
the minimum value that g(σ2)can reach.
To prove the second point, we take the limit at infinite:
lim
σ2→∞ g(σ2) = lim
σ2→∞
1
ND X
iX
j6=i
d2
ij
exp(−d2
ij
2σ2)
Pk6=iexp(−d2
ik
2σ2)
=1
ND X
iX
j6=i
d2
ij
1
N−1
The sum of distances may be expressed in terms of expectation, as:
1
N(N−1) X
iX
j6=i
d2
ij =Ei,j {(xi−xj)T(xi−xj)}= 2Ei{xT
ixi} − 2µx
Tµx
According to a property of linear algebra, if µx=Ex{x}and Σx=Ex{xxT} −
µxµx
T, then E{xTx}= tr{Σx}+µx
Tµx.
2E{xTx} − 2µx
Tµx= 2 tr{Σx}
We obtain:
lim
σ2→∞ g(σ2) = 2 tr{Σx}
D
The second point is then proved since the maximum value of g(σ2)is reached at the
infinite.
To demonstrate the last point, we compute the derivative of g0(σ2)and check out
that it is positive:
dg(σ2)
dσ2=1
2σ4ND X
iX
j6=i
d2
ij Pk(d2
ij −d2
ik) exp(−d2
ij +d2
ik
2σ2)
(Pl6=iexp(−d2
il
2σ2))2
=1
2σ4N(N−1)2DX
i
1
ˆpl(xi)2X
j6=iX
k6=i,j
(d2
ij −d2
ik)2exp(−d2
ij +d2
ik
2σ2)≥0
(6)
The existence of a unique fixed point is then proved. To demonstrate the convergence
of the algorithm in such interval, we need to check out the condition |g0(σ2)|<1
[7]. In that case, we are guaranteed that only a crossing point between g(σ2)and
the line g(σ2) = σ2exists. The convergence condition (4) means that the value of
(6) is lesser than 1.
3 The unconstrained case
The general expression for a Gaussian kernel is:
Gij (C) = |2πC|−1/2exp −1
2(xi−xj)TC−1(xi−xj)
and its derivative w.r.t. C:
∇CGij (C) = 1
2C−1(xi−xj)(xi−xj)T−IC−1Gij (C)
As in the previous cases, we take the derivative of the log-likelihood and make it
equal to zero:
X
i
1
ˆp(xi)
1
N−1X
j6=i
1
2C−1(xi−xj)(xi−xj)TC−1Gij =X
i
1
ˆp(xi)
1
N−1X
j6=i
1
2C−1Gij
By multiplying both members by C, both at the right and the left, we obtain:
X
i
1
ˆp(xi)X
j6=i
(xi−xj)(xi−xj)TGij =CX
i
1
ˆp(xi)X
j6=i
Gij
After some simplifications as in the spherical case, we reach the following fixed-point
algorithm:
Ct+1 =1
N(N−1) X
i
1
ˆpt(xi)X
j6=i
(xi−xj)(xi−xj)TGij (Ct)(7)
The expression in (7) suggests a relationship with the Expectation-Maximization
result for Gaussian Mixture Models (GMM). A GMM is a PDF estimator given by
the expression ˆp(x) = PK
k=1 αkG(x|µk,Ck). The weights of the Kcomponents of
the mixture are given by the αk, and each Gaussian is characterized by its mean
vector µkand its covariance matrix Ck. The solution provided by the EM algorithm
consists of an iterative procedure where the parameters at step tare obtained by the
ones at step t−1. To do so, a matrix of auxiliary variables is used, rki =p(k|xi),
expressing the likelihood of the sample to belong to the k-th component of the
mixture. These probabilities must hold Pkrki = 1. The EM solution establishes
the following updating rule for the covariance matrix at step t:
Ct
k=X
kX
i
rt
ki
(xi−µt
k)(xi−µt
k)T
N(8)
where the rt
ki and µt
kare also iteratively updated. Note that our KDE model can
be considered as a special case of GMM where i) there are as many mixtures as
samples (K=N) with the same weights αk= 1/N; ii) mean vectors are fixed:
µk=xk; iii) the covariance matrix is the same for each of the components, and iv)
rki = 0 if k=iand rk i = 1/(N−1) if k6=i.
With these particularizations, the updating rule in (8) becomes equal to the one
given by the iteration in (7).
The EM guarantees the monotonic increase of the likelihood cost and so its conver-
gence to a local minimum, as proved in the literature [8]. The algorithm given in (7)
is subject to the same conditions, so that its convergence is also proved. However,
in situations in which N≈D, empirical covariance matrices are close to singularity,
so that numerical problems may arise as in GMM design.
4 Application to Parzen classification
We have tested the performance of the obtained models on a set of public classifica-
tion problems from [9]. For doing so, we apply the Parzen classifier, which performs
the simple Bayes criterion:
ˆy= arg max
lˆpθl(x|cl)
with per-class spherical (S-KDE) and unconstrained (U-KDE) models ˆpθl(x|cl)op-
timized according to the proposed method, being cleach of the Lclasses consid-
ered. We have compared these results with the ones obtained by other classification
methods such as K-Nearest-Neighbors (KNN, with K=1) and the one-versus-the-
rest Support Vector Machine (SVM) with RBF kernel. The results are shown in
Data Train Test L D S-KDE U-KDE KNN SVM
Pima 738 - 2 8 71.22 75.13 73.18 76.47
Wine 178 - 3 13 75.84 99.44 76.97 100
Landsat 4435 2000 6 36 89.45 86.10 90.60 90.90
Optdigits 3823 1797 10 64 97.89 93.54 94.38 98.22
Letter 16000 4000 26 16 95.23 92.77 95.20 97.55
Table 1: Classification performance on some public datasets. Leave-one-out accu-
racy is provided when there are not test data.
Table 1. The most remarkable conclusion from this result is that either S-KDE or
U-KDE provides, in each case, a classification performance close to SVM’s. The
comparison between S-KDE and U-KDE performance is closely related, in each
case, to the dimension of the data when compared to the number of samples. In
the datasets with higher dimensionality, the performance of S-KDE is higher due
to its lower risk of overfitting. Parzen classifiers have not enjoyed the popularity of
other methods, mainly due to the difficulty of obtaining a reliable bandwidth for
the kernel. However, in this experiment we have shown how the bandwidth chosen
by our algorithm provides a classification performance close to a state-of-the-art
classifier such as the SVM.
5 Conclusions
We have presented two algorithms for the optimization of the likelihood in the band-
width selection problem for KDE models. Unlike previous results in the literature,
the methods tackle in a natural way the multivariate case, for which we provide
solutions based on both spherical and complete (unconstrained) Gaussian kernel.
The convergence conditions have been described for both algorithms. By a set of
experiments, we have shown that the models obtained are accurate enough to pro-
vide good classification results. This demonstrates that the model do not overfit to
the data, even in problems involving a high number of variables.
References
[1] M. C. Jones, J. S. Marron, and S. J. Sheather, “A brief survey of bandwidth se-
lection for density estimation,” Journal of the American Statistical Association,
vol. 91, no. 433, pp. 401–407, 1996.
[2] B. W. Silverman, Density Estimation for Statistics and Data Analysis, Chapman
& Hall, Londres, 1986.
[3] A. Bowman, “An alternative method of cross-validation for the smoothing of
density estimates,” Biometrika, vol. 71, pp. 353–360, 1984.
[4] R. Duin, “On the choice of smoothing parameters for Parzen estimators of
probability density functions,” IEEE Trans. on Computers, vol. 25, no. 11,
1976.
[5] P. Hall, “Cross-validation in density estimation,” Biometrika, vol. 69, no. 2, pp.
383–390, 1982.
[6] J.M. Leiva-Murillo and A. Artés-Rodríguez, “A fixed-point algorithm for finding
the optimal covariance matrix in kernel density modeling,” in International
Conference on Acoustic, Speech and Signal Processing, Toulouse, Francia, 2006.
[7] R. Fletcher, Practical Methods of Optimization (2nd Edition), John Wiley &
Sons, New York, 1995.
[8] G. McLachlan, The EM algorithm and extensions, John Wiley & Sons, New
York, 1997.
[9] D.J. Newman, S. Hettich, C.L. Blake, and C.J. Merz, “UCI repository of machine
learning databases,” Tech. Rep., Univ. of California, Dept. ICS, 1998.