ArticlePDF Available

Density Estimation Using the Sinc Kernel

Authors:

Abstract and Figures

This paper deals with the kernel density estimator based on the so-called sinc (or Fourier integral) kernel K(x) = ( x) 1 sin x. We study in detail both asymptotic and nite sample properties of this estimator. It is shown that, contrary to widespread opinion, the sinc estimator is superior to other estimators in many respects: it is more accurate for quite moderate values of the sample size, has better asymptotics in non-smooth case (the density to be estimated has only rst derivative), is more convenient for the bandwidth selection etc.
Content may be subject to copyright.
NORGES TEKNISK-NATURVITENSKAPELIGE
UNIVERSITET
Density Estimation Using the Sinc Kernel
by
Ingrid K. Glad, Nils Lid Hjort and Nikolai Ushakov
PREPRINT
STATISTICS NO. 2/2007
NORWEGIAN UNIVERSITY OF SCIENCE AND
TECHNOLOGY
TRONDHEIM, NORWAY
This report has URL http://www.math.ntnu.no/preprint/statistics/2007/S2-2007.ps
Nikolai Ushakov has homepage: http://www.math.ntnu.no/ushakov
E-mail: ushakov@stat.ntnu.no
Address: Department of Mathematical Sciences, Norwegian University of Science and Technology, N-7491
Trondheim, Norway.
Density Estimation Using the Sinc Kernel
Ingrid K. Glad, Nils Lid Hjort
Department of Mathematics, University of Oslo, Norway
and
Nikolai Ushakov
Department of Mathematical Sciences
Norwegian University of Science and Technology
Trondheim, Norway
Abstract
This paper deals with the kernel density estimator based on the so-called sinc (or Fourier
integral) kernel K (x) = (πx)
1
sin x. We study in detail both asymptotic and finite sample
properties of this estimator. It is shown that, contrary to widespread opinion, the sinc estimator
is superior to other estimators in many respects: it is more accurate for quite moderate values of
the sample size, has better asymptotics in non-smooth case (the density to be estimated has only
first derivative), is more convenient for the bandwidth selection etc.
Key words: Kernel estimation, Sinc kernel, Fourier integral kernel, Mean integrated squared error,
Superkernels, Empirical characteristic function, Finite samples, Inequalities
2
Contents
1. Introduction
2. The MISE of the estimator
3. Comparison of the exact MISE of the sinc estimator and conventional estimators (examples)
4. Asymptotic superiority of the sinc estimator to conventional kernel estimators
5. Comparison of the sinc estimator with superkernel estimators
6. Bandwidth selection
7. Uniform consistency and estimation of the mode
8. Inequalities
9. Sinc estimator of derivatives
1. Introduction
Let X
1
, ..., X
n
be independent and identically distributed random variables with the same proba-
bility density function f(x). We consider the problem of estimation of f(x) nonparametrically. One
of the most popular methods is the kernel estimator
f
n
(x; h) =
1
n
n
X
i=1
K
h
(x X
i
) (1.1)
where K
h
(x) = h
1
K(h
1
x), K(x) is a kernel function (usually symmetric) and h is the smoothing
parameter (bandwith).
Typically, K(x) is taken to be a probability density with at least a couple of finite moments; this
ensures that f
n
(x; h) itself becomes a density function, and methods of Taylor expansion and so on
make it possible to analyse its behaviour to a satisfactory degree. Recent monographs dealing with
general aspects of kernel density estimators include Wand and Jones (1995) and Fan and Gijbels
(1996).
However, the present paper deals with a non-standard choice for K(x), namely the so-called sinc
kernel
K
s
(x) =
sin x
πx
with the Fourier transform (characteristic function)
ψ
s
(t) =
1 for |t| 1,
0 for |t| > 1.
The sinc kernel is not in L
1
, that is its absolute value has infinite integral, but it is square integrable,
and in addition it is integrable in the sense of the Cauchy principal value with
v.p.
Z
−∞
K
s
(x)dx = 1,
3
in which
v.p.
Z
−∞
= lim
T →∞
lim
0
"
Z
T
+
Z
T
#
.
(In the following we will omit integration limits when the integral is to be taken over the full real
line.) Respectively, we have
ψ
s
(t) = v.p.
Z
e
itx
K
s
(x)dx.
Sometimes the sinc kernel is defined as K(x) = sin(πx)/(πx) with the Fourier transform
ψ(t) =
1 for |t| π,
0 for |x| > π.
Both functions sin x/(πx) and sin(πx)/(πx) integrate to one in the sense of the principal value, and
the difference is only in the scale parameter.
The sinc kernel is a “non-conventional” kernel, it takes negative values and is not integrable in the
ordinary sense (we will say that a kernel is conventional if it is a probability density function i.e. it
is non-negative and integrates to one; kernel estimators, based on conventional kernels, will be called
conventional estimators). Respectively, realizations of the kernel estimator, based on the sinc kernel
(we will call it the sinc estimator; in some works, it is called FIE Fourier integral estimator, see for
example Davis, 1975 and Davis, 1977) are not probability density functions. This defect however can
be easily corrected without loss of the performance (see Glad et al. (2003)).
The sinc estimator has excellent asymptotic properties compared to conventional kernel estimators
when the density to be estimated is smooth i.e. has several derivatives, see Davis (1975) and Davis
(1977). If, for example, the density to be estimated is an analytic function of a certain type, then
the mean integrated squared error of the sinc estimator decreases as n
1
as n while no one
conventional estimator can provide the rate of convergence better than n
4/5
. It is believed however
that the performance of the sinc estimator is good, roughly speaking, only for very large n, and only
for very smooth f(x). In addition, it is beleaved that even for very large n and very smooth f(x),
the sinc estimator is inferior to kernel estimators based on so-called superkernels — non-conventional
kernels whose Fourier transform is continuous and equals 1 in some neighbourhood of the origin. In
this work, we try to show that these beliefs are unjust.
In Section 3, we present examples demonstrating that the sinc estimator is more accurate than
conventional estimators for quite moderate values of the sample size. In Section 4, we prove that
the sinc estimator is asymptotically supereor to conventional estimators (has a strictly better order
of consistency) even when f(x) has only one derivative. In Section 5, we make comparison of the
sinc estimator with a superkernel estimator and show that the sinc estimator has better properties,
in particular, it is more accurate. In Section 6, we consider the problem of bandwidth selection.
This problem is solved easier and more effective for the sinc estimator than for other estimators. The
problem of estimation of the mode is studied in Section 7. Some useful inequalities for the MISE of the
sinc estimator are obtained in Section 8. The inequalities also show that the performance of the sinc
estimator is good not only for large sample sizes but for moderate and even small too. In section 9, we
study the problem of estimating derivatives. Here the sinc estimator is especially effective compared
with other kernel estimators.
4
2. The MISE of the estimator
Let
ˆ
f
n
(x) be an estimator of f(x) associated with the sample X
1
, ..., X
n
. The customary perfor-
mance criterion for density estimators is the mean integrated squared error (MISE), which is defined
as
MISE(
ˆ
f
n
(x)) =
E
Z
[
ˆ
f
n
(x) f(x)]
2
dx.
In practice one seeks methods to minimise the MISE function. The MISE is the sum of the integrated
squared bias (denote it by
B
(
ˆ
f
n
(x))) and the integrated variance (denote it by
V
(
ˆ
f
n
(x))) of the
estimator.
Denote the characteristic function of random variables X
j
by ϕ(t) and the empirical characteristic
function associated with the sample X
1
, ..., X
n
by ϕ
n
(t):
ϕ(t) =
E
e
itX
j
, ϕ
n
(t) =
1
n
n
X
j=1
e
itX
j
.
Then the sinc estimator (in the rest of the paper, we denote it by f
n
(x; h)) is
f
n
(x; h) =
1
πn
n
X
j=1
sin[(x X
j
)/h]
x X
j
,
and its characteristic function equals ϕ
n
(t)ψ
s
(ht) = ϕ
n
(t)I
[1/h,1/h]
(t), where, as usually, I
A
(t) de-
notes the indicator of the set A.
For a real valued function g(x) we will use the following notation, provided the integrals exist:
µ
k
(g) =
Z
|x|
k
g(x)dx, k = 0, 1, 2, ..., R(g) =
Z
g
2
(x)dx.
The following lemma will be frequently used in the work.
Lemma 2.1. For the sinc estimator,
B
(f
n
(x; h)) =
1
2π
Z
|t|>1/h
|ϕ(t)|
2
dt, (2.1)
V
(f
n
(x; h)) =
1
n
·
1
2π
Z
1/h
1/h
(1 |ϕ(t)|
2
)dt, (2.2)
and
MISE(f
n
(x; h)) =
1
2π
Z
|t|>1/h
|ϕ(t)|
2
dt +
1
n
·
1
2π
Z
1/h
1/h
(1 |ϕ(t)|
2
)dt. (2.3)
where ϕ(t) is the characteristic function of the density to be estimated.
Corollary.
MISE(f
n
(x; h)) =
1
πnh
+ R(f )
1 +
1
n
1
π
Z
1/h
0
|ϕ(t)|
2
dt. (2.4)
Equalities (2.1)–(2.4) can be found for example in Davis (1977).
Using Lemma 2.1, one can make some general remarks concerning the sinc estimator. It is well
known that in case of the kernel estimator with a conventional kernel, the necessary and sufficient
5
condition of consistency is h 0 as n and nh . In case of the sinc estimator, this
condition is also sufficient, but sometimes, not necessary. If the characteristic function of the density
to be estimated vanishes outside some interval containing the origin: ϕ(t) = 0 for |t| > T , then
the necessary condition is milder: the sinc estimator is consistent if lim sup
n→∞
h < 1/T . This
circumstance was pointed out in a number of works, see for example Davis (1977) or Ibragimov and
Khas’minskii (1982).
The second remark concerns the problem of selection of the smoothing parameter. For a given n,
minimum of the MISE function, as a function of h, can be non-unique. This means that there may
exist several different optimal values of the bandwidth h. This problem is considered more in details
in Section 5.
3. Comparison of the exact MISE of the sinc estimator and conventional estimators
(examples)
In this section, the exact MISE of the sinc estimator is compared with that of estimators based on
some conventional kernels.
3.1. Normal distribution. Consider the standard normal density
f(x) =
1
2π
e
x
2
/2
.
Let f
n
(x; h) be (as above) the sinc estimator of f(x), and denote the kernel estimator of f(x), based
on the normal kernel, by f
(norm)
n
(x; h). In this subsection, we compare the performance of f
n
(x; h)
and f
(norm)
n
(x; h) for several finite values of the sample size.
The MISE of estimators f
n
(x; h) and f
(norm)
n
(x; h) is found explicitly:
MISE(f
n
) =
1
π
"
1
1 +
1
n
Φ
2
h
!
+
1
n
1
h
π
+
1
2
#
and
MISE(f
(norm)
n
) =
1
2
π
1 2
r
2
2 + h
2
+
1
1 + h
2
+
1
nh
1
n
1 + h
2
!
,
where Φ(x) is the standard normal distribution function. Values of inf
h>0
MISE(f
n
) and
inf
h>0
MISE(f
(norm)
n
) (and their ratio) are given in Table 1 for the sample size n = 40, 45, 50,
100 and 1000
Table 1
n sinc normal sinc/normal
40 0.010141 0.01009 1.005
45 0.009203 0.009327 0.987
50 0.008436 0.00869 0.971
100 0.004699 0.005411 0.868
1000 0.000611 0.00103 0.593
6
The Table shows that, under appropriate choice of the smoothing parameter for both estimators,
the sinc estimator is less accurate (but very little, only 0.5%) than the estimator, based on the normal
kernel, when n = 40, but it becomes better already for n = 45. For n = 100, the sinc estimator is
about 15% more accurate than normal. For large sample sizes (> 1000), the sinc estimator becomes
several times better than normal (almost two times for n = 1000).
One more advantage of the sinc estimator is that MISE(f
n
(x; h)) inf
h>0
MISE(f
(norm)
n
(x; h)) for
a wide interval of values of the smoothing parameter h (0.4 < h < 0.53 for n = 100 and 0.25 < h < 0.46
for n = 1000). This means that, even if for the sinc estimator, h is chosen quite far from its optimal
value, the estimator is still better than the normal estimator under the optimal choice of h.
3.2. Cauchy distribution. Now consider the density
f(x) =
1
π(1 + x
2
)
(Cauchy distribution) with the characteristic function
ϕ(t) = e
−|t|
.
Then the MISE of the sinc estimator is
MISE(f
n
) =
1
π
1
2
1 +
1
n
e
2/h
+
1
n
1
h
1
2

.
For the sake of simplicity we consider the conventional estimator with the Cauchy kernel
K(x) =
1
π(1 + x
2
)
.
Denote this estimator by f
(Cauchy)
n
(x; h). Its MISE is found explicitly
MISE(f
(Cauchy)
n
) =
1
π
1
2
2
2 + h
+
1
2(1 + h)
+
1
2nh
1
2n(1 + h)
.
Results are presented in Table 2. They are very similar to those in the previous subsection. Like in
the normal case, the sinc estimator is superior to the conventional estimator for n 45 and becomes
several times more accurate for n > 1000.
Table 2
n sinc Cauchy sinc/Cauchy
40 0.014776 0.014553 1.015
45 0.013542 0.01363 0.993
50 0.012516 0.012845 0.974
100 0.007346 0.00863 0.851
1000 0.0011 0.002126 0.517
4. Asymptotic superiority of the sinc estimator to conventional kernel estimators
In this section, we find conditions under which the sinc estimator f
n
(x; h) has a strictly better
order of consistency than conventional kernel estimators. Let K(x) be a square integrable conventional
7
kernel (a square integrable probability density) and
ˆ
f
n
(x; h) the corresponding estimator. It is
known that if the density to be estimated f (x) is at least three times differentiable, then
MISE(f
n
(x; h))
MISE(
ˆ
f
n
(x; h))
0 as h 0, nh . (4.1)
It turns out that (4.1) holds under much broader conditions and can take place even when f(x) has
only the first derivative.
Since MISE(f
n
(x; h)) = MISE(
ˆ
f
n
(x; h)) = if f(x) is not square integrable, we, in the remainder
of this section, suppose that it is square integrable. In addition, without loss of generality, we suppose
that all conventional kernels under consideration are symmetric.
Denote the characteristic function of K(x) by ψ(t). For the integrated squared bias and integrated
variance of
ˆ
f
n
(x; h), the following representations hold (see for example Watson and Leadbetter, 1963)
B
(
ˆ
f
n
(x; h)) =
1
2π
Z
|ϕ(t)|
2
(1 ψ(ht))
2
dt, (4.2)
V
(
ˆ
f
n
(x; h)) =
1
n
·
1
2π
Z
(1 |ϕ(t)|
2
)(ψ(ht))
2
dt. (4.3)
Lemma 4.1. If
|ϕ(t)| = o
1
t
5/2
as t , (4.4)
then, under appropriate rescaling of the conventional kernel K(x),
B
(f
n
(x; h)) = o(
B
(
ˆ
f
n
(x; h))) as h 0. (4.5)
Proof. There exist two positive numbers ε and c such that
ψ(t) 1 εt
2
for |t| c, (4.6)
see Loeve (1977). Without loss of generality one can suppose that c = 1 (otherwise a rescaling of
K(x) can be used). Rewrite (4.6) in the form
(1 ψ(ht))
2
ε
2
h
4
t
4
, |t| 1/h. (4.7)
Put λ = 1/h,
R
1
(λ) =
1
λ
4
Z
λ
0
t
4
|ϕ(t)|
2
dt,
and
R
2
(λ) =
1
λ
4
Z
λ
|ϕ(t)|
2
dt.
Note that (4.4) implies
|ϕ(λ)|
2
= o
1
λ
5
Z
λ
0
t
4
|ϕ(t)|
2
dt
!
as λ . (4.8)
Making use of (4.7), (4.8) and representations (2.1), (2.2), (4.2), and (4.3), we obtain
= lim
λ→∞
ε
2
"
4
λ
5
R
λ
0
t
4
|ϕ(t)|
2
dt
|ϕ(λ)|
2
1
#
= lim
λ→∞
ε
2
R
0
1
(λ)
R
0
2
(λ)
= lim
λ→∞
ε
2
R
1
(λ)
R
2
(λ)
=
8
= lim
h0
ε
2
h
4
R
1/h
0
t
4
|ϕ(t)|
2
dt
R
1/h
|ϕ(t)|
2
dt
lim
h0
R
1/h
0
|ϕ(t)|
2
(1 ψ(ht))
2
dt
R
1/h
|ϕ(t)|
2
dt
lim
h0
R
0
|ϕ(t)|
2
(1 ψ(ht))
2
dt
R
1/h
|ϕ(t)|
2
dt
= lim
h0
B
(
ˆ
f
n
(x; h))
B
(f
n
(x; h))
that implies (4.5).
Theorem 4.1. Let
|ϕ(t)| = o
1
t
5/2
as t .
Then
inf
h>0
MISE(f
n
(x; h)) = o( inf
h>0
MISE(
ˆ
f
n
(x; h))) as n . (4.9)
Proof. The main asymptotic term of the integrated variance of the both estimators, conventional
and sinc, has form c/(nh), therefore, due to Lemma 4.1 (K(x) is scaled so that (4.5) holds)
MISE(
ˆ
f
n
(x; h)) g(h) +
c
1
nh
(4.10)
and
MISE(f
n
(x; h)) g(h)ε(h) +
c
2
nh
(4.11)
where g(h) 0 as h 0, ε(h) 0 as h 0. (4.10) and (4.11) evidently imply (4.9).
Thus, under condition (4.4), the sinc estimator has strictly better order of consistency than any
conventional kernel. Condition (4.4) can be satisfied even when f(x) has only the first derivative. For
example, the density corresponding to the characteristic function (sin t/t)
3
, that is the convolution of
three standard uniform densities, does not have the second detivative for x = ±3.
5. Comparison of the sinc estimator with superkernel estimators
A superkernel is defined as a nonconventional kernel K(x) whose Fourier transform has form
ψ(t) =
1 for |t| ∆,
g(t) for ∆ |t| c∆,
0 for |x| > c∆,
where g(t) is a real-valued , even function, satisfying the inequality |g(t)| 1 and chosen in such a way
that ψ(t) is continuous. Superkernels were studied by Devroye (1992) in one-dimensional case and by
Politis and Romano (1999) in multidimensional case. The sinc kernel can be considered as a limit case
of superkernels as c 1. Estimators, based on superkernels, have the same order of consistency as the
sinc estimator both are kernels of “infinite order”. Superkernels have one advantage compared to the
sinc kernel: corresponding estimates are integrable, although this advantage is rather technical then
essential. Advantages of the sinc kernel compared to superkernels are simplicity (both for theoretical
analysis and practical use) and better solution of the problem of the bandwidth selection.
Politis and Romano (1999) state that “c = 1 is a bad choice”, that is the sinc estimator is worse
than superkernel estimators. In this section we compare accuracy of the sinc estimator with that of
9
the most recommended by Politis and Romano (1999) superkernel estimator, namely, that for which
c = 2 and g(t) is linear on the interval ∆ < t < 2∆. Thus
ψ(t) =
1 for |t| ∆,
2
|t|
for ∆ |t| 2∆,
0 for |t| > 2∆.
The estimator based on this superkernel is denoted in this section by
ˆ
f
n
(x; h). The sinc estimator is
denoted, as earlier, by f
n
(x; h). The ∆ does not play any role, therefore, for convenience, we suppose
that ∆ < 1.
Now we make a comparison of the MISE of the sinc estimator and the superkernel estimator under
consideration for a broad class of underlying distributions. We consider densities whose characteristic
function ϕ(t) satisfies condition
|ϕ(t)|
2
=
a
|t|
m
for |t| > c
where a, c and m are real positive constants, m > 3. Without loss of generality suppose that a = 1.
Then, for sufficiently small h, namely, for h < /c, the integrated squared bias of the both estimators
is calculated explicitly. Some elementary algebra shows that
B
(f
n
(x; h)) =
h
m1
π(m 1)
(5.1)
and
B
(
ˆ
f
n
(x; h)) =
h
m1
π
m1
1
m 1
2(2
m2
1)
(m 2)2
m2
+
2
m3
1
(m 3)2
m3
. (5.2)
For the integrated variance of the estimators the following equalities hold when h < /c (also
after some elementary algebra)
πn
V
(f
n
(x; h)) =
1
h
Z
/h
0
|ϕ(t)|
2
dt
h
m1
m1
1
m1
m 1
(5.3)
and
πn
V
(
ˆ
f
n
(x; h)) =
4∆
3h
Z
/h
0
|ϕ(t)|
2
dt
h
m1
m1
4(2
m1
1)
(m 1)2
m1
4(2
m2
1)
(m 2)2
m2
+
2
m3
1
(m 3)2
m3
(5.4)
Once again, ∆ does not play any role, so let us take it to be equal 3/4 (then the main asymptotic
term of the integrated variance of the superkernel estimator coincides with that of the sinc estimator,
and the comparison becomes easier). For each m 4,
4(2
m1
1)
(m 1)2
m1
4(2
m2
1)
(m 2)2
m2
+
2
m3
1
(m 3)2
m3
<
1 (3/4)
m1
m 1
and therefore the right hand side of (5.4) is greater than the right hand side of (5.3) that is the
integrated variance of the sinc estimator is less than that of the superkernel estimator uniformly in h.
Consider right hand sides of (5.1) and (5.2). For m 17
B
(f
n
(x; h)) >
B
(
ˆ
f
n
(x; h)),
10
while for m 18
B
(f
n
(x; h)) <
B
(
ˆ
f
n
(x; h))
uniformly in h, and the ratio
B
(f
n
(x; h))/
B
(
ˆ
f
n
(x; h)) decreases with m and tends to zero very fast
as m tends to infinity (if, for example, m > 30, then
B
(f
n
(x; h)) is more than ten times smaller than
B
(
ˆ
f
n
(x; h))).
Thus, for more or less smooth underlying densities (m = 17 approximately corresponds to the case
when f(x) has five derivatives), the sinc estimator is more accurate than the superkernel estimator
under consideration. For really smooth (many times differentiable) densities it is much more accurate.
Moreover, both the integrated variance and the integrated squared bias of the sinc estimator are smaller
than those of the superkernel estimator uniformly in h. In non-smooth case (the fifth derivative of
f(x) does not exist), the accuracy of the two estimators is approximately the same: the integrated
variance of the sinc is smaller for each h, while the integrated squared bias is greater for each h. Now
the estimators can be compared only under condition that the bandwidth is chosen to be optimal for
each of them.
Suming up and taking into account other advantages of the sinc estimator, like simplicity, better
solution of the bandwidth selection etc., we must make conclusion that the sinc estimator is preferable
to the considered superkernel estimator.
6. Bandwidth selection
Representation of the MISE, given by Lemma 1, suggests relatively simple rules of selection of the
smoothing parameter h. Due to Corollary of Lemma 2.1,
MISE(f
n
(x; h)) =
1
πnh
+ R(f )
1 +
1
n
1
π
Z
1/h
0
|ϕ(t)|
2
dt.
Put δ = 1/h. Then
δ
MISE(f
n
(x; h)) =
1
πn
1 +
1
n
1
π
|ϕ(δ)|
2
and
2
δ
2
MISE(f
n
(x; h)) =
1 +
1
n
1
π
δ
|ϕ(δ)|
2
.
Therefore the optimal δ (δ minimizing the MISE) must be a root of the equation
|ϕ(δ)| =
1
n + 1
,
and the optimal bandwidth h
opt
is a solution of
|ϕ(1/h)| =
1
n + 1
. (6.1)
6.1 Normal rule
According to the normal scale rule, the bandwidth is selected so that it minimizes the MISE if
the underlying distribution is normal with the variance σ
2
, where unknown σ
2
is replaced by some its
11
estimator. For the sinc estimator this rule is as follows. For a normal underlying distribution formula
(6.1) becomes
e
σ
2
2h
2
=
1
n + 1
,
therefore the optimal value of h (if σ is known) is
h
opt
=
σ
ln(n + 1)
,
and the normal rule bandwidth is
h
norm
=
ˆσ
ln(n + 1)
where ˆσ is some estimator of σ, for example, the empirical standard deviation:
ˆσ =
1
n
n
X
j=1
(X
j
¯
X)
2
1/2
6.2 Method based on the empirical characteristic function
Note that equation (6.1) always has solution (since |ϕ(t)| 0 as t ) but maybe non-unique.
Consider all solutions of equation (6.1) for which |ϕ(δ 0)| > 1/
n + 1 and |ϕ(δ + 0)| < 1/
n + 1.
Denote them by δ
1
, ..., δ
m
(suppose for simplicity that δ
1
< δ
2
< ... < δ
m
). Since |ϕ(δ)| decreases in
some neighbourhood of each δ
i
,
δ
|ϕ(δ)|
2
δ=δ
i
< 0
and therefore
2
δ
2
MISE(f
n
(x; h))
δ=δ
i
> 0.
Thus each δ
i
is a local minimum of the MISE.
The global minimum can be found by computing and comparison of the MISE at h = 1
1
, ..., 1
m
.
This does not lead to large computational expenses because, if one uses (2.3) for the computation,
the first integral in the right hand side of (2.3) for δ = δ
2
, ..., δ
m
is a part of this integral for δ = δ
1
,
while the second integral for δ = δ
1
, ..., δ
m1
is a part of this integral for δ = δ
m
.
The characteristic function ϕ(t) however is (of course) unknown, therefore the procedure, described
above, is used to the empirical characteristic function ϕ
n
(t) instead of ϕ(t). Here one must take into
account that ϕ
n
(t) is an almost periodic function, and equation
|ϕ
n
(δ)| =
1
n + 1
has infinitely many roots. Therefore, since
lim
n→∞
sup
|t|≤T
n
|ϕ
n
(t) ϕ(t)| = 0
(see Cs¨org˝o and Totik, 1983), where T
n
and log T
n
= o(n) as n , it suffices to consider only
roots on the interval [0, e
n
]. Of course it is not necessary to calculate ϕ
n
(t) on such a wide interval:
all roots of this interval are contained in the interval [0,
n], practically, in a much shorter interval.
12
7. Uniform consistency and estimation of the mode
In this section we prove that the sinc estimator is uniformly consistent: it converges (in probability)
to the true density function uniformly over the whole real line, and that the mode of the sinc estimator
is a consistent estimator of the mode of the true density function. We formulate and prove results in
the simplest form, leaving possible generalizations to the reader.
Let K(x) be a symmetric, differentiable, conventional kernel with finite variance σ
2
. Suppose also
that its derivative has finite total variation which we denote by v. Denote the characteristic function
of K(x) by ψ(t) and the kernel estimator, based on K(x), by
ˆ
f
n
(x; h). As before, f
n
(x; h) denotes the
sinc estimator.
Lemma 7.1. Let the characteristic function ϕ(t) of the underlying distribution be integrable:
Z
|ϕ(t)|dt < .
Then
sup
x
|f
n
(x; h)
ˆ
f
n
(x; h)|
a.s.
0 as n , h 0, nh .
Proof.
sup
x
|f
n
(x; h)
ˆ
f
n
(x; h)|
1
2π
Z
|ψ(ht)| · |ϕ
n
(t) ϕ(t)|dt +
Z
|ϕ(t)| · |ψ(ht) I
[1/h,1/h]
(t)|dt+
+
Z
1/h
1/h
|ϕ
n
(t) ϕ(t)|dt
.
Now we prove that each of the three integrals in the square brackets tends to zero as n , h
0, nh . Denote these integrals by I
1
, I
2
and I
3
, respectively. Then
I
1
Z
|t|≤n
2
|ϕ
n
(t) ϕ(t)|dt + 4
Z
n
2
|ψ(ht)|dt.
The first integral in the right hand side almost surely converges to zero as n due to theorem 1
by Cs¨org˝o and Totik (1983). To estimate the second integral, we use the inequality
|ψ(t)|
v
|t|
2
,
which holds for all t, see Ushakov and Ushakov (2000). Making use of this inequality, we obtain
Z
n
2
|ψ(ht)|dt
v
n
2
h
2
0 as nh .
So, I
1
a.s.
0 as n , nh .
I
2
2
Z
1/
h
0
[1 ψ(ht)]dt + 4
Z
1/
h
|ϕ(t)|dt.
The second integral tends to zero as h 0 because |ϕ(t)| is integrable. To estimate the first integral,
we use the inequality
ψ(t) 1
σ
2
t
2
2
13
which holds for all t, see Ushakov (1999). Making use of this inequality, we obtain
Z
1/
h
0
[1 ψ(ht)]dt
σ
2
6
h 0 as h 0.
Thus I
2
0 as h 0.
Finally, if nh , then 1/h cn with some constant c, and therefore
I
3
Z
|t|≤cn
|ϕ
n
(t) ϕ(t)|dt
a.s.
0 as n
due to the mensioned theorem by Cs¨org˝o and Totik (1983).
Remark. The condition of the lemma (ϕ is integrable) implies that f(x) is uniformly continuous
but is a little more restrictive. It is satisfied for example when f (x) is differentiable and f
0
(x) has
finite total variation.
Theorem 7.1. Let ϕ(t) be integrable. Then
sup
x
|f
n
(x; h) f(x)|
P
0 as n , h 0, nh
2
. (7.1)
Proof. Let K(x) be any conventional kernel satisfying conditions of both Lemma 7.1 of this
section and Theorem 3A of Parzen (1962). Then, due to Theorem 3A by Parzen (1962),
sup
x
|
ˆ
f
n
(x; h) f(x)|
P
0 as n , h 0, nh
2
, (7.2)
and, due to Lemma 7.1,
sup
x
|f
n
(x; h)
ˆ
f
n
(x; h)|
P
0 as n , h 0, nh
2
. (7.3)
(7.2) and (7.3) evidently imply (7.1).
Denote a mode of f (x) by θ. Suppose it is unique. Let θ
n
be a mode of the sinc estimate.
Theorem 7.2. Let ϕ(t) be integrable. Then
θ
n
P
θ as n , h 0, nh
2
.
Proof of the theorem coincides with that of the second part of Theorem 3A by Parzen (1962).
8. Inequalities
In this section, we derive some upper bounds for the MISE of the sinc estimator. In addition to
practical importance (evaluation of the sample size sufficient for achieving a given accurancy etc.),
these inequalities throw more light onto properties of the sinc estimator (especially for finite samples)
and make easier comparison of this estimator with other estimators.
We define the 0-th derivative of a function as the function itself: f
(0)
(x) = f(x) (as usually).
We need below the following form of the Parseval equality: let f(x) be m times differentiable prob-
ability density function (m 0), its m-th derivative f
(m)
(x) be square integrable, and ϕ(t) be the
14
corresponding characteristic function. Then
Z
(f
(m)
(x))
2
dx =
1
2π
Z
t
2m
|ϕ(t)|
2
dt. (8.1)
Theorem 8.1. Let f (x) be m times differentiable (m 0), and its m-th derivative is square
integrable. Then
MISE(f
n
(x; h)) < ε(h)h
2m
R(f
(m)
) +
1
πnh
(8.2)
where ε(h) 1 for all h and ε(h) 0 as h 0.
Proof. Estimate the first summand in the right hand side of (2.3). Making use of (8.1), we obtain
1
2π
Z
|t|>1/h
|ϕ(t)|
2
dt = h
2m
1
2π
Z
|t|>1/h
(1/h)
2m
|ϕ(t)|
2
dt
h
2m
1
2π
Z
|t|>1/h
t
2m
|ϕ(t)|
2
dt = h
2m
1
2π
Z
t
2m
|ϕ(t)|
2
dt
h
2m
1
2π
Z
1/h
1/h
t
2m
|ϕ(t)|
2
dt = h
2m
1
2π
Z
t
2m
|ϕ(t)|
2
dt
1
R
1/h
1/h
t
2m
|ϕ(t)|
2
dt
R
t
2m
|ϕ(t)|
2
dt
=
= ε(h)h
2m
Z
(f
(m)
(x))
2
dx = ε(h)h
2m
R(f
(m)
),
where
ε(h) = 1
R
1/h
1/h
t
2m
|ϕ(t)|
2
dt
R
t
2m
|ϕ(t)|
2
dt
evidently satisfies conditions of the theorem: ε(h) 1 and ε(h) 0 as h 0.
For the second summand in the right hand side of (2.3) we have
1
n
·
1
2π
Z
1/h
1/h
(1 |ϕ(t)|
2
)dt <
1
n
·
1
2π
Z
1/h
1/h
dt =
1
πnh
.
Thus we finally obtain (8.2).
Corollary 1. Let conditions of theorem 1 be satisfied. Then
MISE(f
n
(x; h)) < h
2m
R(f
(m)
) +
1
πnh
. (8.3)
Putting in (8.3)
h =
1
2πnmR(f
(m)
)
1
2m+1
(this h minimizes the right hand side of (8.3)), we get
Corollary 2. Let conditions of Theorem 8.1 be satisfied. Then
inf
h>0
MISE(f
n
(x; h)) <
1 + 2m
(2πm)
(2m)/(2m+1)
R(f
(m)
)
1/(2m+1)
n
2m
2m+1
.
Corollary 3. Let conditions of Theorem 8.1 be satisfied. Then
inf
h>0
MISE(f
n
(x; h)) = o
n
2m/(2m+1)
, n .
15
Corollary 4. If f (x) is two times differentiable, and its second derivative is square integrable,
then
inf
h>0
MISE(f
n
(x; h)) = o(n
4/5
), n
and
inf
h>0
MISE(f
n
(x; h)) <
5
4π
(4πR(f
00
))
1/5
n
4/5
.
To obtain more sensitive and accurate estimates we use one more characteristic of the density
to be estimated — its total variation (or the total variation of its derivatives). For a function g(x),
we denote its total variation by V r(g). In the general case, two conditions, finiteness of the total
variation and square integrability, are not comparable: a function may have finite variation but be
not square integrable and vice versa. But for densities, square integrability is milder then finiteness
of the total variation: a density, having bounded variation, is square integrable. Thus finiteness of
the total variation of a density (or its derivatives) is a little more restrictive condition than its square
integrability. But on the other hand, assumption that some derivative of the density to be estimated
has finite total variation (together with the use of the sinc kernel) allows one to improve the order of
decreasing the MISE. For example, if the density is two times differentiable and V r(f
00
) < , then
MISE(f
n
(x; h)) = O(n
5/6
), n , instead of o(n
4/5
) (see below).
Theorem 8.2. Let f (x) be m times differentiable (m 0), and its mth derivative has finite
total variation. Then
MISE(f
n
(x; h)) h
2m+1
V r(f
(m)
)
2
(2m + 1)π
+
1
πnh
. (8.4)
Proof. For all t,
|ϕ(t)|
V r(f
(m)
)
|t|
m+1
,
see Ushakov and Ushakov (2000). Making use of this inequality, estimate the first summand in the
right hand side of (2.3):
1
2π
Z
|t|>1/h
|ϕ(t)|
2
dt
V r(f
(m)
)
2
π
Z
1/h
dt
t
2m+2
=
V r(f
(m)
)
2
(2m + 1)π
h
2m+1
.
For the second summand of the right hand side of (2.3) we have (see proof of Theorem 8.1)
1
n
·
1
2π
Z
1/h
1/h
(1 |ϕ(t)|
2
)dt <
1
πnh
.
So, we obtain (8.4).
Corollary. Let conditions of Theorem 8.2 be satisfied. Then
inf
h>0
MISE(f
n
(x; h))
2(m + 1)
(2m + 1)π
V r(f
(m)
)
1/(m+1)
n
(2m+1)/(2m+2)
.
For example, if m = 2, we get
inf
h>0
MISE(f
n
(x; h))
6
5π
V r(f
00
)
1/3
n
5/6
.
Following Watson and Leadbetter (1963) and Davis (1975), we will say that a characteristic func-
tion ϕ(t) decreases exponentially with degree α and coefficient ρ (ρ > 0, 0 < α 2) if
|ϕ(t)| Ae
ρ|t|
α
(8.5)
16
for some constant A. Davis (1975, Theorem 4.1) proved that if the characteristic function of the
density to be estimated satisfies (8.5), then
lim
n→∞
he
ρ/h
α
|
B
(f
n
(x; h))| = 0
The next theorem makes this more precise.
Theorem 8.3. Let
1
2π
Z
e
ρ|t|
α
|ϕ(t)|
2
dt = C < .
Then
MISE(f
n
(x; h)) ε(h)Ce
ρ/h
α
+
1
πnh
, (8.6)
where 0 < ε(h) < 1 and ε(h) 0 as h 0.
Proof is similar to that of Theorem 8.1. For the first summand of the right hand side of (2.3), we
have
1
2π
Z
|t|>1/h
|ϕ(t)|
2
dt < e
ρ/h
α
1
2π
Z
|t|>1/h
e
ρ|t|
α
|ϕ(t)|
2
dt =
= ε(h)Ce
ρ/h
r
where
ε(h) = 1
R
1/h
1/h
e
ρ|t|
α
|ϕ(t)|
2
dt
R
e
ρ|t|
α
|ϕ(t)|
2
dt
.
It is difficult to find explicitly h minimizing the right hand side of (8.6), therefore we take h for
which the right hand side of (8.6) has a simple form. Namely, put
h =
1
ρ
ln n
1
,
then
MISE(f
n
(x; h)) <
C +
(ln n)
1
πρ
1
1
n
<
C +
1
πρ
1
(ln n)
1
n
provided n > 2. If, for example, f(x) is the standard normal density, then
MISE(f
n
(x; h)) <
1
2π
+
2
π
!
ln n
n
Corollary. Let the characteristic function ϕ(t) of f (x) decreases exponentially with degree r and
coefficient ρ (i.e. satisfies (8.5)). Then
lim
n→∞
e
cρ/h
α
|
B
(f
n
(x; h))| = 0
for any c < 2.
9. Estimation of derivatives
The sinc estimator is especially superior to conventional estimators when a derivative of f(x) is
estimated. Let f(x) be r times differentiable. Suppose that one needs to estimate the rth derivative
17
f
(r)
(x). A natural way is to estimate f
(r)
(x) by the rth derivative of a kernel estimator of f(x),
provided that the kernel is r times differentiable. So, let
ˆ
f
n
(x; h) =
1
nh
n
X
j=1
K
x X
j
h
be a kernel estimator of f (x), and K
(r)
(x) exists. Then the estimator
ˆ
f
(r)
n
(x; h) =
1
nh
r+1
n
X
j=1
K
(r)
x X
j
h
is used for estimation of f
(r)
(x).
If K(x) is a conventional kernel, then the MISE of the estimator
ˆ
f
(r)
n
(x; h) is represented in the
form (provided that f(x) has r + 2 derivatives and K(x) has finite second moment)
MISE(
ˆ
f
(r)
n
(x; h)) =
1
4
h
4
µ
2
(K)
2
R(f
(r+2)
) +
1
nh
2r+1
R(K
(r)
)+
+o
h
4
+
1
nh
2r+1
, h 0.
Therefore the optimal MISE is of order n
4/(2r+5)
, i.e. the rate becomes slower for higher values of
r, and the difficulty increases, see Wand and Jones (1995) and Stone (1982). In this section, we show
that the sinc estimator f
(r)
n
(x; h) is free of this difficulty and the estimator f
(r)
n
(x; h) of f
(r)
(x) has
almost the same order of consistency as the estimator f
n
(x; h) of f(x) even for large values of r (of
course, if the density to be estimated is smooth enough). First we formulate an analog of Lemma 2.1
for derivatives.
Lemma 9.1. For the sinc estimator,
MISE(f
(r)
n
(x; h)) =
1
2π
Z
|t|>1/h
t
2r
|ϕ(t)|
2
dt +
1
n
·
1
2π
Z
1/h
1/h
t
2r
(1 |ϕ(t)|
2
)dt. (9.1)
where ϕ(t) is the characteristic function of the density to be estimated.
Proof. Due to the Parseval-Plancherel identity we have
MISE(f
(r)
n
(x; h)) =
E
Z
(f
(r)
n
(x; h) f
(r)
(x))
2
dx =
=
1
2π
E
Z
t
2r
|ϕ
n
(t)I
[1/h,1/h]
(t) ϕ(t)|
2
dt =
=
1
2π
Z
E
h
(ϕ
n
(t)I
[1/h,1/h]
(t) ϕ(t))(ϕ
n
(t) I
[1/h,1/h]
(t) ϕ(t))
i
dt =
=
1
2π
Z
1/h
1/h
t
2r
E
|ϕ
n
(t)|
2
dt
1
2π
Z
1/h
1/h
t
2r
ϕ(t)
E
ϕ
n
(t) + ϕ(t)
E
ϕ
n
(t)
dt +
1
2π
Z
t
2r
|ϕ(t)|
2
dt. (9.2)
It is easy to see that
E
ϕ
n
(t) = ϕ(t), (9.3)
18
E
ϕ
n
(t) = ϕ(t) (9.4)
and
E
|ϕ
n
(t)|
2
=
E
1
n
n
X
j=1
e
itX
j
2
=
E
1
n
n
X
j=1
e
itX
j
·
1
n
n
X
k=1
e
itX
k
=
1
n
2
n +
X
j6=k
e
it(X
j
X
k
)
=
1
n
+
1
1
n
|ϕ(t)|
2
. (9.5)
Substituting (9.3)—(9.5) to the right hand side of (9.2), we obtain (9.1).
Making use of Lemma 9.1 we obtain following analogs of theorems of Section 8.
Theorem 9.1. Let f(x) be r + m times differentiable (r, m 0), and its r + m-th derivative is
square integrable. Then
MISE(f
(r)
n
(x; h)) < ε(h)h
2m
R(f
(r+m)
) +
1
π(2r + 1)nh
2r+1
where ε(h) 1 for all h and ε(h) 0 as h 0.
Corollary. Let conditions of Theorem 9.1 be satisfied. Then
MISE
h>0
(f
(r)
n
(x; h)) C
m,r
R(f
(r+m)
)
(2r+1)/(2r+2m+1)
n
2m/(2r+2m+1)
where
C
m,r
= (2πm)
2m/(2r+2m+1)
+
(2πm)
(2r+1)/(2r+2m+1)
π(2r + 1)
.
Theorem 9.2. Let
1
2π
Z
t
2r
e
ρ|t|
α
|ϕ(t)|
2
dt = C < .
Then
MISE(f
(r)
n
(x; h)) ε(h)Ce
ρ/h
α
+
1
πnh
2r+1
,
where 0 < ε(h) < 1 and ε(h) 0 as h 0.
Corollary. Let conditions of Theorem 9.2 be satisfied. Then
inf
h>0
MISE(f
(r)
n
(x; h)) <
C +
(ln n)
(2r+1)
πρ
(2r+1)
1
n
.
Theorem 9.3. Let the characteristic function ϕ(t) of f(x) satisfies the condition: there exists
T > 0 such that f(t) = 0 for |t| > T . Then, if
h
1
T
,
then
MISE(f
(r)
n
(x; h))
1
πnh
2r+1
.
In particular, if h = const = 1/T , then
MISE(f
(r)
n
(x; h))
T
2r+1
πn
.
19
Proofs of Theorems 9.1 9.3 are similar to those of theorems of Section 8 therefore we leave them
to the reader.
References
Cs¨org˝o, S. and Totik, V. (1983). On how long interval is the empirical characteristic function
uniformly consistent? Acta Sci. Math., 45, 141-149.
Davis, K.B. (1975). Mean square error properties of density estimates. Ann. Statist., 3, no. 4,
1025-1030.
Davis, K.B. (1977). Mean integrated square error properties of density estimates. Ann. Statist.,
5, no. 3, 530-535.
Devroye, L. (1992). A note on the usefulness of superkernels in density estimates. Ann. Statist.,
20, 2037-2056.
Fan, J., Gijbels, I. (1996). Local polynomial modelling and its applications. Monograps on Statistics
and Applied Probability. Chapman and Hall, London.
Glad, I.K., Hjort, N.L. and Ushakov, N.G. (2003). Correction of density estimators that are not
densities. Scand. J. Statist., 30, no. 2, 415-427.
Ibragimov, I.A., Khas’minskii, R.Z. (1982). Estimation of distribution density belonging to a class
of entire functions. Theory Probab. Applic., 27, No. 3, 551-562.
Loeve, M. (1977). Probability Theory. Springer, Berlin.
Parzen, E. (1962). On estimation of a probability density function and its mode. Ann. Math.
Statist., 33, 1065-1076.
Politis, D.N., Romano, J.P. (1999). Multivariate density estimation with general flat-top Kernels
of infinite order. J. Multiv. Anal., 68, 1-25.
Stone, C.J. (1982). Optimal global rates of convergence on nonparametric regression. Ann.
Statist., 10, no. 4, 1040-1053.
Ushakov, N.G. (1999). Selected Topics in Characteristic Functiond. VSP, Utrecht.
Ushakov, V.G. and Ushakov, N.G. (2000). Some inequalities for characteristic functions of densities
with bounded variation. Moscow Univ. Comput. Math. Cybernet., no. 3, 45-52.
Wand, M.P. and Jones, M.C. (1995). Kernel smoothing. Chapman and Hall, London.
Watson, G.S. and Leadbetter, M.R. (1963) On the estimation of the probability density, I. Ann.
Math. Statist., Vol. 34, 480-491.
20
... Consequently,f (x | D), is not a KDE, but is the special-case Fourier Integral Estimator (FIE; Davis, 1975Davis, , 1977. An optimal length scale, h, exists for the FIE (Glad et al., 2007;Chacón et al., 2007), and can be estimated by solving the equation ...
... where φ n (t) is the empirical characteristic function (Glad et al., 2007), or by cross validation. ...
... While the FIE is not a probability estimator, it can be converted to one. Two techniques for doing so are due to Glad et al. (2007Glad et al. ( , 2003 and Agarwal et al. (2016). Glad et al. (2003) developed corrections for different classes of quasi-kernels. ...
Article
Full-text available
Distributed vector representations are a key bridging point between connectionist and symbolic representations in cognition. It is unclear how uncertainty should be modelled in systems using such representations. In this paper we discuss how bundles of symbols in certain Vector Symbolic Architectures (VSAs) can be understood as defining an object that has a relationship to a probability distribution, and how statements in VSAs can be understood as being analogous to probabilistic statements. The aim of this paper is to show how (spiking) neural implementations of VSAs can be used to implement probabilistic operations that are useful in building cognitive models. We show how similarity operators between continuous values represented as Spatial Semantic Pointers (SSPs), an example of a technique known as fractional binding, induces a quasi-kernel function that can be used in density estimation. Further, we sketch novel designs for networks that compute entropy and mutual information of VSA-represented distributions and demonstrate their performance when implemented as networks of spiking neurons. We also discuss the relationship between our technique and quantum probability, another technique proposed for modelling uncertainty in cognition. While we restrict ourselves to operators proposed for Holographic Reduced Representations, and for representing real-valued data. We suggest that the methods presented in this paper should translate to any VSA where the dot product between fractionally bound symbols induces a valid kernel.
... According to [3] the expressions for the sinc estimator and its characteristic function are ...
... Also according to [3] we have the following expression for MISE of the sinc 37 estimator. ...
... The optimal bandwidth based on the sinc estimator is given as (for proof see [3]) ...
... Devroye (1992) constructed super-kernels which are 'infinite' order kernels (see also Politis and Romano (1999)). If the underlying density function f is p times continuously differentiable with bounded p th derivative, then the rate of the bias of a superkernel density estimator f n is o(h p ). Glad et al. (2002) concluded that the sinc kernel density estimator is preferable compared to super-kernel density estimators in several situations. Recently, Chacón et al. (2014) have used the sinc kernel for smooth distribution function estimation. ...
... The bandwidth selection approaches for the competing methods were different. We have used the function bw.nrd in the R package stats for the Gaussian kernel, and (log(n + 1)) −1/2 for the sinc kernel (Glad et al., 2002). For the proposed methods, we have taken n −1/(2p+1) when the kernel is of order p. ...
Preprint
In this paper, we show that a suitably chosen covariance function of a continuous time, second order stationary stochastic process can be viewed as a symmetric higher order kernel. This leads to the construction of a higher order kernel by choosing an appropriate covariance function. An optimal choice of the constructed higher order kernel that partially minimizes the mean integrated square error of the kernel density estimator is also discussed.
... However, they do not produce proper, positive density estimators, and they exhibit spurious oscillations, particularly in the tails. Many unlimited order kernel estimators have been discussed either explicitly or implicitly in the literature, e.g. by Konakov (1973), Davis (1975Davis ( , 1977, Devroye (1992), Romano (1996, 1999), Devroye andLugosi (2001), page 192, Politis (2003), Glad et al. (2007), Ushakov (2012) and Ushakov and Ushakova (2012). ...
... In particular,f might be a generalized kernel estimator of the density f , based perhaps on the sinc kernel, L.x/ = sin.x/=.πx/ (see for example Davis (1975Davis ( , 1977 and Glad et al. (2007)), or on a general flat top kernel (see for example Politis (2003Politis ( , 2011), and having a fast convergence rate, but failing to satisfy even the most basic qualitative properties. For example,f might be negative in places, and so not be a proper probability density, and it might have many spurious 'wiggles,' and so have far too many modes. ...
Article
Motivated by both the shortcomings of high-order density estimators, and the increasingly large datasets in many areas of modern science, in this talk, we introduce new high-order, nonparametric density estimators that are guaranteed to be positive and do not have highly oscillatory tails. Our approach is based on data perturbation, for example by tilting or data sharpening. It leads to new estimators that are more accurate than conventional kernel techniques that use positive kernels, but which nevertheless enjoy the positivity property, and are far less ``wiggly'' than high-order kernel estimators. We investigate performance by theoretical~analysis and in a numerical study. [Joint work with Peter Hall: To appear in JRSS-B].
... Granted, because the induced kernels can take on negative values, similarities are not strictly probabilities. However, the sinc function, without correction, can be used in density estimation (Davis 1977;Glad, Hjort, and Ushakov 2007), and can be a more efficient kernel than the "optimal" Epanechnikov kernel (Tsybakov 2009, §1.3). However, if strictly non-negative values are required one can employ corrections like squaring the quantity (i.e., Born's rule (Born 1926)) p(x) = (ϕ(x) · M D ) 2 (14) or using the biased rectification, ...
Article
Recent developments in generative models have demonstrated that with the right data set, techniques, computational infrastructure, and network architectures, it is possible to generate seemingly intelligent outputs, without explicitly reckoning with underlying cognitive processes. The ability to generate novel, plausible behaviour could be a boon to cognitive modellers. However, insights for cognition are limited, given that generative models' blackbox nature does not provide readily interpretable hypotheses about underlying cognitive mechanisms. On the other hand, cognitive architectures make very strong hypotheses about the nature of cognition, explicitly describing the subjects and processes of reasoning. Unfortunately, the formal framings of cognitive architectures can make it difficult to generate novel or creative outputs. We propose to show that cognitive architectures that rely on certain Vector Symbolic Algebras (VSAs) are, in fact, naturally understood as generative models. We discuss how memories of VSA representations of data form distributions, which are necessary for constructing distributions used in generative models. Finally, we discuss the strengths, challenges, and future directions for this line of work.
... where ξ is a fixed scalar chosen so that Glad et al., 2003Glad et al., , 2007. Note that this is simply a ReLU neuron with bias ξ , and either weights M n and input φ(x), or vice versaweights φ(x) and input M n . ...
Article
Full-text available
To navigate in new environments, an animal must be able to keep track of its position while simultaneously creating and updating an internal map of features in the environment, a problem formulated as simultaneous localization and mapping (SLAM) in the field of robotics. This requires integrating information from different domains, including self-motion cues, sensory, and semantic information. Several specialized neuron classes have been identified in the mammalian brain as being involved in solving SLAM. While biology has inspired a whole class of SLAM algorithms, the use of semantic information has not been explored in such work. We present a novel, biologically plausible SLAM model called SSP-SLAM—a spiking neural network designed using tools for large scale cognitive modeling. Our model uses a vector representation of continuous spatial maps, which can be encoded via spiking neural activity and bound with other features (continuous and discrete) to create compressed structures containing semantic information from multiple domains (e.g., spatial, temporal, visual, conceptual). We demonstrate that the dynamics of these representations can be implemented with a hybrid oscillatory-interference and continuous attractor network of head direction cells. The estimated self-position from this network is used to learn an associative memory between semantically encoded landmarks and their positions, i.e., an environment map, which is used for loop closure. Our experiments demonstrate that environment maps can be learned accurately and their use greatly improves self-position estimation. Furthermore, grid cells, place cells, and object vector cells are observed by this model. We also run our path integrator network on the NengoLoihi neuromorphic emulator to demonstrate feasibility for a full neuromorphic implementation for energy efficient SLAM.
... SSPs connect to information theoretic exploration through the kernel induced by the dot product over SSP vectors. SSPs induce a sinc kernel function [20], and sinc kernels have been found to be efficient kernels for kernel density estimators [21], [22]. Vector representations that induce kernels can be used to make memory-and time-efficient kernel density estimators, as in the EXPoSE algorithm [23]. ...
Preprint
Full-text available
Mutual information (MI) is a standard objective function for driving exploration. The use of Gaussian processes to compute information gain is limited by time and memory complexity that grows with the number of observations collected.We present an efficient implementation of MI-driven exploration by combining vector symbolic architectures with Bayesian Linear Regression. We demonstrate equivalent regret performance to a GP-based approach with memory and time complexity that is constant in the number of samples collected, as opposed to t^2 and t^3, respectively, enabling long-term exploration.
... Notice that the integrated squared bias (K h * g(x) − g(x)) 2 dx = O p (h 2 ) because of Assumption C1 and the "infinite order kernel" property of the sinc kernel (see e.g. Theorem 8.1 of Glad et al. (2007)); g(x) − g(x;θ) 2 dx = O p (n −1 ) follows from Assumption B2; so both terms are dominated by T 1 under the bandwidth assumption. The cross product terms are also dominated by T 1 because of the Cauchy-Schwarz inequality. ...
Article
This paper develops a specification test for stochastic volatility models by comparing the nonparametric kernel deconvolution density estimator of an integrated volatility density with its parametric counterpart. distance is used to measure the discrepancy. The asymptotic null distributions of the test statistics are established and the asymptotic power functions are computed. Through Monte Carlo simulations, the size and power properties of the test statistics are studied. The tests are applied to an empirical example.
Article
Higher‐order kernels have been widely implemented for nonparametric function estimation, mainly due to their faster asymptotic rates of convergence and interpretability. This article constructs a novel higherorder kernel of any pre‐specified even order by using Shannon’s formula. The proposed kernels have a closed form expression and are easy to implement. We compare the constructed kernel with several popular higher‐order kernels in the context of density estimation. Further, we observe that the mean integrated squared error (MISE) of the density estimator corresponding to the proposed kernel is comparable with that of the popular kernels. The constructed kernel performs somewhat better for the Fejér‐de la Vallée Poussin and the lp‐symmetric densities. The MISE of the corresponding kernel density estimator and a few related theoretical results are also presented.
Article
An exact, closed form, and easy to compute expression for the mean integrated squared error (MISE) of a kernel estimator of a normal mixture cumulative distribution function is derived for the class of Gaussian-based kernels, extending the results of Marron and Wand (1992). Comparisons are made with the empirical distribution function and the infeasible minimum MISE kernel (Abdous, 1993). The results will be useful to anyone analysing the finite sample performance of kernel estimators. The Gaussian-based kernels are found to perform remarkably well in very small and large samples, but in some situations may be substantially suboptimal in moderate samples. As in the case of density estimation, the asymptotic approximation to the MISE can be poor in finite samples, and the simple reference rule bandwidths should be used with caution.
Article
Several old and new density estimators may have good theoretical performance, but are hampered by not being bona fide densities; they may be negative in certain regions or may not integrate to 1. One can therefore not simulate from them, for example. This paper develops general modification methods that turn any density estimator into one which is a bona fide density, and which is always better in performance under one set of conditions and arbitrarily close in performance under a complementary set of conditions. This improvement-for-free procedure can, in particular, be applied for higher-order kernel estimators, classes of modern h4 bias kernel type estimators, superkernel estimators, the sinc kernel estimator, the k-NN estimator, orthogonal expansion estimators, and for various recently developed semi-parametric density estimators.
Article
Part 1 Basic properties: definition and elementary properties continuity theorems and inversion formulas criteria inequalities characteristic functions and movements, expansion of characteristic functions, asymptotic behaviour unimodality analycity of characteristic functions multivariate characteristic functions. Part 2 Inequalities: auxiliary results inequalities for characteristic functions of distributions with bounded support moment inequalities estimates for the characteristic functions of unimodal distributions estimates for the characteristic functions of absolutely continuous distributions estimates for the characteristic functions of discrete distributions inequalities for multivariate characteristic functions inequalities involving integrals of characteristic functions inequalities involving differences of characteristic functions estimates for the positive zero of a characteristic function. Part 3 Empirical characteristic functions: definition and basic properties asymptotic properties of empirical characteristic functions the first positive zero parameter estimation non-parametric density estimation 1 non-parametric density estimation 2 tests for independence tests for symmetry testing for normality goodness-of-fit tests based on empirical characteristic functions.
Article
Estimators of the form f^n(x)=(1/n)i=1nδn(xxi)\hat f_n(x) = (1/n) \sum^n_{i=1} \delta_n(x - x_i) of a probability density f(x) are considered, where x1xnx_1 \cdots x_n is a sample of n observations from f(x). In Part I, the properties of such estimators are discussed on the basis of their mean integrated square errors E[(fn(x)f(x))2dx]E\lbrack\int(f_n(x) - f(x))^2dx\rbrack (M.I.S.E.). The corresponding development for discrete distributions is sketched and examples are given in both continuous and discrete cases. In Part II the properties of the estimator f^n(x)\hat f_n(x) will be discussed with reference to various pointwise consistency criteria. Many of the definitions and results in both Parts I and II are analogous to those of Parzen [1] for the spectral density. Part II will appear elsewhere.
Article
We consider the Akaike-Parzen-Rosenblatt density estimate fnhf_{nh} based upon any superkernel L (i.e., an absolutely integrable function with L=1\int L = 1, whose characteristic function is 1 on [1,1])\lbrack -1, 1\rbrack), and compare it with a kernel estimate gnhg_{nh} based upon an arbitrary kernel K. We show that for a given subclass of analytic densities, infLsupKlimsupninfhEfnhfinfhEgnhf=1,\inf_L \sup_K \lim \sup_{n\rightarrow \infty} \frac{\inf_h \mathbb{E} \int |f_{nh} - f|}{\inf_h \mathbb{E} \int |g_{nh} - f |} = 1, where h>0h > 0 is the smoothing factor. Thus, asymptotically, the class of superkernels is as good as any other class of kernels when certain analytic densities are estimated. We also obtain exact asymptotic expressions for the expected L1L_1 error of the kernel estimate when superkernels are used.