Content uploaded by Nils Lid Hjort
Author content
All content in this area was uploaded by Nils Lid Hjort on Apr 23, 2014
Content may be subject to copyright.
NORGES TEKNISK-NATURVITENSKAPELIGE
UNIVERSITET
Density Estimation Using the Sinc Kernel
by
Ingrid K. Glad, Nils Lid Hjort and Nikolai Ushakov
PREPRINT
STATISTICS NO. 2/2007
NORWEGIAN UNIVERSITY OF SCIENCE AND
TECHNOLOGY
TRONDHEIM, NORWAY
This report has URL http://www.math.ntnu.no/preprint/statistics/2007/S2-2007.ps
Nikolai Ushakov has homepage: http://www.math.ntnu.no/∼ushakov
E-mail: ushakov@stat.ntnu.no
Address: Department of Mathematical Sciences, Norwegian University of Science and Technology, N-7491
Trondheim, Norway.
Density Estimation Using the Sinc Kernel
Ingrid K. Glad, Nils Lid Hjort
Department of Mathematics, University of Oslo, Norway
and
Nikolai Ushakov
Department of Mathematical Sciences
Norwegian University of Science and Technology
Trondheim, Norway
Abstract
This paper deals with the kernel density estimator based on the so-called sinc (or Fourier
integral) kernel K (x) = (πx)
−1
sin x. We study in detail both asymptotic and finite sample
properties of this estimator. It is shown that, contrary to widespread opinion, the sinc estimator
is superior to other estimators in many respects: it is more accurate for quite moderate values of
the sample size, has better asymptotics in non-smooth case (the density to be estimated has only
first derivative), is more convenient for the bandwidth selection etc.
Key words: Kernel estimation, Sinc kernel, Fourier integral kernel, Mean integrated squared error,
Superkernels, Empirical characteristic function, Finite samples, Inequalities
2
Contents
1. Introduction
2. The MISE of the estimator
3. Comparison of the exact MISE of the sinc estimator and conventional estimators (examples)
4. Asymptotic superiority of the sinc estimator to conventional kernel estimators
5. Comparison of the sinc estimator with superkernel estimators
6. Bandwidth selection
7. Uniform consistency and estimation of the mode
8. Inequalities
9. Sinc estimator of derivatives
1. Introduction
Let X
1
, ..., X
n
be independent and identically distributed random variables with the same proba-
bility density function f(x). We consider the problem of estimation of f(x) nonparametrically. One
of the most popular methods is the kernel estimator
f
n
(x; h) =
1
n
n
X
i=1
K
h
(x − X
i
) (1.1)
where K
h
(x) = h
−1
K(h
−1
x), K(x) is a kernel function (usually symmetric) and h is the smoothing
parameter (bandwith).
Typically, K(x) is taken to be a probability density with at least a couple of finite moments; this
ensures that f
n
(x; h) itself becomes a density function, and methods of Taylor expansion and so on
make it possible to analyse its behaviour to a satisfactory degree. Recent monographs dealing with
general aspects of kernel density estimators include Wand and Jones (1995) and Fan and Gijbels
(1996).
However, the present paper deals with a non-standard choice for K(x), namely the so-called sinc
kernel
K
s
(x) =
sin x
πx
with the Fourier transform (characteristic function)
ψ
s
(t) =
1 for |t| ≤ 1,
0 for |t| > 1.
The sinc kernel is not in L
1
, that is its absolute value has infinite integral, but it is square integrable,
and in addition it is integrable in the sense of the Cauchy principal value with
v.p.
Z
∞
−∞
K
s
(x)dx = 1,
3
in which
v.p.
Z
∞
−∞
= lim
T →∞
lim
→0
"
Z
−
−T
+
Z
T
#
.
(In the following we will omit integration limits when the integral is to be taken over the full real
line.) Respectively, we have
ψ
s
(t) = v.p.
Z
e
itx
K
s
(x)dx.
Sometimes the sinc kernel is defined as K(x) = sin(πx)/(πx) with the Fourier transform
ψ(t) =
1 for |t| ≤ π,
0 for |x| > π.
Both functions sin x/(πx) and sin(πx)/(πx) integrate to one in the sense of the principal value, and
the difference is only in the scale parameter.
The sinc kernel is a “non-conventional” kernel, it takes negative values and is not integrable in the
ordinary sense (we will say that a kernel is conventional if it is a probability density function i.e. it
is non-negative and integrates to one; kernel estimators, based on conventional kernels, will be called
conventional estimators). Respectively, realizations of the kernel estimator, based on the sinc kernel
(we will call it the sinc estimator; in some works, it is called FIE — Fourier integral estimator, see for
example Davis, 1975 and Davis, 1977) are not probability density functions. This defect however can
be easily corrected without loss of the performance (see Glad et al. (2003)).
The sinc estimator has excellent asymptotic properties compared to conventional kernel estimators
when the density to be estimated is smooth i.e. has several derivatives, see Davis (1975) and Davis
(1977). If, for example, the density to be estimated is an analytic function of a certain type, then
the mean integrated squared error of the sinc estimator decreases as n
−1
as n → ∞ while no one
conventional estimator can provide the rate of convergence better than n
−4/5
. It is believed however
that the performance of the sinc estimator is good, roughly speaking, only for very large n, and only
for very smooth f(x). In addition, it is beleaved that even for very large n and very smooth f(x),
the sinc estimator is inferior to kernel estimators based on so-called superkernels — non-conventional
kernels whose Fourier transform is continuous and equals 1 in some neighbourhood of the origin. In
this work, we try to show that these beliefs are unjust.
In Section 3, we present examples demonstrating that the sinc estimator is more accurate than
conventional estimators for quite moderate values of the sample size. In Section 4, we prove that
the sinc estimator is asymptotically supereor to conventional estimators (has a strictly better order
of consistency) even when f(x) has only one derivative. In Section 5, we make comparison of the
sinc estimator with a superkernel estimator and show that the sinc estimator has better properties,
in particular, it is more accurate. In Section 6, we consider the problem of bandwidth selection.
This problem is solved easier and more effective for the sinc estimator than for other estimators. The
problem of estimation of the mode is studied in Section 7. Some useful inequalities for the MISE of the
sinc estimator are obtained in Section 8. The inequalities also show that the performance of the sinc
estimator is good not only for large sample sizes but for moderate and even small too. In section 9, we
study the problem of estimating derivatives. Here the sinc estimator is especially effective compared
with other kernel estimators.
4
2. The MISE of the estimator
Let
ˆ
f
n
(x) be an estimator of f(x) associated with the sample X
1
, ..., X
n
. The customary perfor-
mance criterion for density estimators is the mean integrated squared error (MISE), which is defined
as
MISE(
ˆ
f
n
(x)) =
E
Z
[
ˆ
f
n
(x) −f(x)]
2
dx.
In practice one seeks methods to minimise the MISE function. The MISE is the sum of the integrated
squared bias (denote it by
B
(
ˆ
f
n
(x))) and the integrated variance (denote it by
V
(
ˆ
f
n
(x))) of the
estimator.
Denote the characteristic function of random variables X
j
by ϕ(t) and the empirical characteristic
function associated with the sample X
1
, ..., X
n
by ϕ
n
(t):
ϕ(t) =
E
e
itX
j
, ϕ
n
(t) =
1
n
n
X
j=1
e
itX
j
.
Then the sinc estimator (in the rest of the paper, we denote it by f
n
(x; h)) is
f
n
(x; h) =
1
πn
n
X
j=1
sin[(x − X
j
)/h]
x − X
j
,
and its characteristic function equals ϕ
n
(t)ψ
s
(ht) = ϕ
n
(t)I
[−1/h,1/h]
(t), where, as usually, I
A
(t) de-
notes the indicator of the set A.
For a real valued function g(x) we will use the following notation, provided the integrals exist:
µ
k
(g) =
Z
|x|
k
g(x)dx, k = 0, 1, 2, ..., R(g) =
Z
g
2
(x)dx.
The following lemma will be frequently used in the work.
Lemma 2.1. For the sinc estimator,
B
(f
n
(x; h)) =
1
2π
Z
|t|>1/h
|ϕ(t)|
2
dt, (2.1)
V
(f
n
(x; h)) =
1
n
·
1
2π
Z
1/h
−1/h
(1 − |ϕ(t)|
2
)dt, (2.2)
and
MISE(f
n
(x; h)) =
1
2π
Z
|t|>1/h
|ϕ(t)|
2
dt +
1
n
·
1
2π
Z
1/h
−1/h
(1 − |ϕ(t)|
2
)dt. (2.3)
where ϕ(t) is the characteristic function of the density to be estimated.
Corollary.
MISE(f
n
(x; h)) =
1
πnh
+ R(f ) −
1 +
1
n
1
π
Z
1/h
0
|ϕ(t)|
2
dt. (2.4)
Equalities (2.1)–(2.4) can be found for example in Davis (1977).
Using Lemma 2.1, one can make some general remarks concerning the sinc estimator. It is well
known that in case of the kernel estimator with a conventional kernel, the necessary and sufficient
5
condition of consistency is h → 0 as n → ∞ and nh → ∞. In case of the sinc estimator, this
condition is also sufficient, but sometimes, not necessary. If the characteristic function of the density
to be estimated vanishes outside some interval containing the origin: ϕ(t) = 0 for |t| > T , then
the necessary condition is milder: the sinc estimator is consistent if lim sup
n→∞
h < 1/T . This
circumstance was pointed out in a number of works, see for example Davis (1977) or Ibragimov and
Khas’minskii (1982).
The second remark concerns the problem of selection of the smoothing parameter. For a given n,
minimum of the MISE function, as a function of h, can be non-unique. This means that there may
exist several different optimal values of the bandwidth h. This problem is considered more in details
in Section 5.
3. Comparison of the exact MISE of the sinc estimator and conventional estimators
(examples)
In this section, the exact MISE of the sinc estimator is compared with that of estimators based on
some conventional kernels.
3.1. Normal distribution. Consider the standard normal density
f(x) =
1
√
2π
e
−x
2
/2
.
Let f
n
(x; h) be (as above) the sinc estimator of f(x), and denote the kernel estimator of f(x), based
on the normal kernel, by f
(norm)
n
(x; h). In this subsection, we compare the performance of f
n
(x; h)
and f
(norm)
n
(x; h) for several finite values of the sample size.
The MISE of estimators f
n
(x; h) and f
(norm)
n
(x; h) is found explicitly:
MISE(f
n
) =
1
√
π
"
1 −
1 +
1
n
Φ
√
2
h
!
+
1
n
1
h
√
π
+
1
2
#
and
MISE(f
(norm)
n
) =
1
2
√
π
1 − 2
r
2
2 + h
2
+
1
√
1 + h
2
+
1
nh
−
1
n
√
1 + h
2
!
,
where Φ(x) is the standard normal distribution function. Values of inf
h>0
MISE(f
n
) and
inf
h>0
MISE(f
(norm)
n
) (and their ratio) are given in Table 1 for the sample size n = 40, 45, 50,
100 and 1000
Table 1
n sinc normal sinc/normal
40 0.010141 0.01009 1.005
45 0.009203 0.009327 0.987
50 0.008436 0.00869 0.971
100 0.004699 0.005411 0.868
1000 0.000611 0.00103 0.593
6
The Table shows that, under appropriate choice of the smoothing parameter for both estimators,
the sinc estimator is less accurate (but very little, only 0.5%) than the estimator, based on the normal
kernel, when n = 40, but it becomes better already for n = 45. For n = 100, the sinc estimator is
about 15% more accurate than normal. For large sample sizes (> 1000), the sinc estimator becomes
several times better than normal (almost two times for n = 1000).
One more advantage of the sinc estimator is that MISE(f
n
(x; h)) ≤ inf
h>0
MISE(f
(norm)
n
(x; h)) for
a wide interval of values of the smoothing parameter h (0.4 < h < 0.53 for n = 100 and 0.25 < h < 0.46
for n = 1000). This means that, even if for the sinc estimator, h is chosen quite far from its optimal
value, the estimator is still better than the normal estimator under the optimal choice of h.
3.2. Cauchy distribution. Now consider the density
f(x) =
1
π(1 + x
2
)
(Cauchy distribution) with the characteristic function
ϕ(t) = e
−|t|
.
Then the MISE of the sinc estimator is
MISE(f
n
) =
1
π
1
2
1 +
1
n
e
−2/h
+
1
n
1
h
−
1
2
.
For the sake of simplicity we consider the conventional estimator with the Cauchy kernel
K(x) =
1
π(1 + x
2
)
.
Denote this estimator by f
(Cauchy)
n
(x; h). Its MISE is found explicitly
MISE(f
(Cauchy)
n
) =
1
π
1
2
−
2
2 + h
+
1
2(1 + h)
+
1
2nh
−
1
2n(1 + h)
.
Results are presented in Table 2. They are very similar to those in the previous subsection. Like in
the normal case, the sinc estimator is superior to the conventional estimator for n ≥ 45 and becomes
several times more accurate for n > 1000.
Table 2
n sinc Cauchy sinc/Cauchy
40 0.014776 0.014553 1.015
45 0.013542 0.01363 0.993
50 0.012516 0.012845 0.974
100 0.007346 0.00863 0.851
1000 0.0011 0.002126 0.517
4. Asymptotic superiority of the sinc estimator to conventional kernel estimators
In this section, we find conditions under which the sinc estimator f
n
(x; h) has a strictly better
order of consistency than conventional kernel estimators. Let K(x) be a square integrable conventional
7
kernel (a square integrable probability density) and
ˆ
f
n
(x; h) — the corresponding estimator. It is
known that if the density to be estimated f (x) is at least three times differentiable, then
MISE(f
n
(x; h))
MISE(
ˆ
f
n
(x; h))
→ 0 as h → 0, nh → ∞. (4.1)
It turns out that (4.1) holds under much broader conditions and can take place even when f(x) has
only the first derivative.
Since MISE(f
n
(x; h)) = MISE(
ˆ
f
n
(x; h)) = ∞ if f(x) is not square integrable, we, in the remainder
of this section, suppose that it is square integrable. In addition, without loss of generality, we suppose
that all conventional kernels under consideration are symmetric.
Denote the characteristic function of K(x) by ψ(t). For the integrated squared bias and integrated
variance of
ˆ
f
n
(x; h), the following representations hold (see for example Watson and Leadbetter, 1963)
B
(
ˆ
f
n
(x; h)) =
1
2π
Z
|ϕ(t)|
2
(1 − ψ(ht))
2
dt, (4.2)
V
(
ˆ
f
n
(x; h)) =
1
n
·
1
2π
Z
(1 − |ϕ(t)|
2
)(ψ(ht))
2
dt. (4.3)
Lemma 4.1. If
|ϕ(t)| = o
1
t
5/2
as t → ∞, (4.4)
then, under appropriate rescaling of the conventional kernel K(x),
B
(f
n
(x; h)) = o(
B
(
ˆ
f
n
(x; h))) as h → 0. (4.5)
Proof. There exist two positive numbers ε and c such that
ψ(t) ≤ 1 − εt
2
for |t| ≤ c, (4.6)
see Loeve (1977). Without loss of generality one can suppose that c = 1 (otherwise a rescaling of
K(x) can be used). Rewrite (4.6) in the form
(1 − ψ(ht))
2
≥ ε
2
h
4
t
4
, |t| ≤ 1/h. (4.7)
Put λ = 1/h,
R
1
(λ) =
1
λ
4
Z
λ
0
t
4
|ϕ(t)|
2
dt,
and
R
2
(λ) =
1
λ
4
Z
∞
λ
|ϕ(t)|
2
dt.
Note that (4.4) implies
|ϕ(λ)|
2
= o
1
λ
5
Z
λ
0
t
4
|ϕ(t)|
2
dt
!
as λ → ∞. (4.8)
Making use of (4.7), (4.8) and representations (2.1), (2.2), (4.2), and (4.3), we obtain
∞ = lim
λ→∞
ε
2
"
4
λ
5
R
λ
0
t
4
|ϕ(t)|
2
dt
|ϕ(λ)|
2
− 1
#
= lim
λ→∞
ε
2
R
0
1
(λ)
R
0
2
(λ)
= lim
λ→∞
ε
2
R
1
(λ)
R
2
(λ)
=
8
= lim
h→0
ε
2
h
4
R
1/h
0
t
4
|ϕ(t)|
2
dt
R
∞
1/h
|ϕ(t)|
2
dt
≤ lim
h→0
R
1/h
0
|ϕ(t)|
2
(1 − ψ(ht))
2
dt
R
∞
1/h
|ϕ(t)|
2
dt
≤
≤ lim
h→0
R
∞
0
|ϕ(t)|
2
(1 − ψ(ht))
2
dt
R
∞
1/h
|ϕ(t)|
2
dt
= lim
h→0
B
(
ˆ
f
n
(x; h))
B
(f
n
(x; h))
that implies (4.5).
Theorem 4.1. Let
|ϕ(t)| = o
1
t
5/2
as t → ∞.
Then
inf
h>0
MISE(f
n
(x; h)) = o( inf
h>0
MISE(
ˆ
f
n
(x; h))) as n → ∞. (4.9)
Proof. The main asymptotic term of the integrated variance of the both estimators, conventional
and sinc, has form c/(nh), therefore, due to Lemma 4.1 (K(x) is scaled so that (4.5) holds)
MISE(
ˆ
f
n
(x; h)) ∼ g(h) +
c
1
nh
(4.10)
and
MISE(f
n
(x; h)) ∼ g(h)ε(h) +
c
2
nh
(4.11)
where g(h) → 0 as h → 0, ε(h) → 0 as h → 0. (4.10) and (4.11) evidently imply (4.9).
Thus, under condition (4.4), the sinc estimator has strictly better order of consistency than any
conventional kernel. Condition (4.4) can be satisfied even when f(x) has only the first derivative. For
example, the density corresponding to the characteristic function (sin t/t)
3
, that is the convolution of
three standard uniform densities, does not have the second detivative for x = ±3.
5. Comparison of the sinc estimator with superkernel estimators
A superkernel is defined as a nonconventional kernel K(x) whose Fourier transform has form
ψ(t) =
1 for |t| ≤ ∆,
g(t) for ∆ ≤ |t| ≤ c∆,
0 for |x| > c∆,
where g(t) is a real-valued , even function, satisfying the inequality |g(t)| ≤ 1 and chosen in such a way
that ψ(t) is continuous. Superkernels were studied by Devroye (1992) in one-dimensional case and by
Politis and Romano (1999) in multidimensional case. The sinc kernel can be considered as a limit case
of superkernels as c → 1. Estimators, based on superkernels, have the same order of consistency as the
sinc estimator — both are kernels of “infinite order”. Superkernels have one advantage compared to the
sinc kernel: corresponding estimates are integrable, although this advantage is rather technical then
essential. Advantages of the sinc kernel compared to superkernels are simplicity (both for theoretical
analysis and practical use) and better solution of the problem of the bandwidth selection.
Politis and Romano (1999) state that “c = 1 is a bad choice”, that is the sinc estimator is worse
than superkernel estimators. In this section we compare accuracy of the sinc estimator with that of
9
the most recommended by Politis and Romano (1999) superkernel estimator, namely, that for which
c = 2 and g(t) is linear on the interval ∆ < t < 2∆. Thus
ψ(t) =
1 for |t| ≤ ∆,
2 −
|t|
∆
for ∆ ≤ |t| ≤ 2∆,
0 for |t| > 2∆.
The estimator based on this superkernel is denoted in this section by
ˆ
f
n
(x; h). The sinc estimator is
denoted, as earlier, by f
n
(x; h). The ∆ does not play any role, therefore, for convenience, we suppose
that ∆ < 1.
Now we make a comparison of the MISE of the sinc estimator and the superkernel estimator under
consideration for a broad class of underlying distributions. We consider densities whose characteristic
function ϕ(t) satisfies condition
|ϕ(t)|
2
=
a
|t|
m
for |t| > c
where a, c and m are real positive constants, m > 3. Without loss of generality suppose that a = 1.
Then, for sufficiently small h, namely, for h < ∆/c, the integrated squared bias of the both estimators
is calculated explicitly. Some elementary algebra shows that
B
(f
n
(x; h)) =
h
m−1
π(m −1)
(5.1)
and
B
(
ˆ
f
n
(x; h)) =
h
m−1
π∆
m−1
1
m − 1
−
2(2
m−2
− 1)
(m − 2)2
m−2
+
2
m−3
− 1
(m − 3)2
m−3
. (5.2)
For the integrated variance of the estimators the following equalities hold when h < ∆/c (also
after some elementary algebra)
πn
V
(f
n
(x; h)) =
1
h
−
Z
∆/h
0
|ϕ(t)|
2
dt −
h
m−1
∆
m−1
1 − ∆
m−1
m − 1
(5.3)
and
πn
V
(
ˆ
f
n
(x; h)) =
4∆
3h
−
Z
∆/h
0
|ϕ(t)|
2
dt−
−
h
m−1
∆
m−1
4(2
m−1
− 1)
(m − 1)2
m−1
−
4(2
m−2
− 1)
(m − 2)2
m−2
+
2
m−3
− 1
(m − 3)2
m−3
(5.4)
Once again, ∆ does not play any role, so let us take it to be equal 3/4 (then the main asymptotic
term of the integrated variance of the superkernel estimator coincides with that of the sinc estimator,
and the comparison becomes easier). For each m ≥ 4,
4(2
m−1
− 1)
(m − 1)2
m−1
−
4(2
m−2
− 1)
(m − 2)2
m−2
+
2
m−3
− 1
(m − 3)2
m−3
<
1 − (3/4)
m−1
m − 1
and therefore the right hand side of (5.4) is greater than the right hand side of (5.3) that is the
integrated variance of the sinc estimator is less than that of the superkernel estimator uniformly in h.
Consider right hand sides of (5.1) and (5.2). For m ≤ 17
B
(f
n
(x; h)) >
B
(
ˆ
f
n
(x; h)),
10
while for m ≥ 18
B
(f
n
(x; h)) <
B
(
ˆ
f
n
(x; h))
uniformly in h, and the ratio
B
(f
n
(x; h))/
B
(
ˆ
f
n
(x; h)) decreases with m and tends to zero very fast
as m tends to infinity (if, for example, m > 30, then
B
(f
n
(x; h)) is more than ten times smaller than
B
(
ˆ
f
n
(x; h))).
Thus, for more or less smooth underlying densities (m = 17 approximately corresponds to the case
when f(x) has five derivatives), the sinc estimator is more accurate than the superkernel estimator
under consideration. For really smooth (many times differentiable) densities it is much more accurate.
Moreover, both the integrated variance and the integrated squared bias of the sinc estimator are smaller
than those of the superkernel estimator uniformly in h. In non-smooth case (the fifth derivative of
f(x) does not exist), the accuracy of the two estimators is approximately the same: the integrated
variance of the sinc is smaller for each h, while the integrated squared bias is greater for each h. Now
the estimators can be compared only under condition that the bandwidth is chosen to be optimal for
each of them.
Suming up and taking into account other advantages of the sinc estimator, like simplicity, better
solution of the bandwidth selection etc., we must make conclusion that the sinc estimator is preferable
to the considered superkernel estimator.
6. Bandwidth selection
Representation of the MISE, given by Lemma 1, suggests relatively simple rules of selection of the
smoothing parameter h. Due to Corollary of Lemma 2.1,
MISE(f
n
(x; h)) =
1
πnh
+ R(f ) −
1 +
1
n
1
π
Z
1/h
0
|ϕ(t)|
2
dt.
Put δ = 1/h. Then
∂
∂δ
MISE(f
n
(x; h)) =
1
πn
−
1 +
1
n
1
π
|ϕ(δ)|
2
and
∂
2
∂δ
2
MISE(f
n
(x; h)) = −
1 +
1
n
1
π
∂
∂δ
|ϕ(δ)|
2
.
Therefore the optimal δ (δ minimizing the MISE) must be a root of the equation
|ϕ(δ)| =
1
√
n + 1
,
and the optimal bandwidth h
opt
is a solution of
|ϕ(1/h)| =
1
√
n + 1
. (6.1)
6.1 Normal rule
According to the normal scale rule, the bandwidth is selected so that it minimizes the MISE if
the underlying distribution is normal with the variance σ
2
, where unknown σ
2
is replaced by some its
11
estimator. For the sinc estimator this rule is as follows. For a normal underlying distribution formula
(6.1) becomes
e
−
σ
2
2h
2
=
1
√
n + 1
,
therefore the optimal value of h (if σ is known) is
h
opt
=
σ
ln(n + 1)
,
and the normal rule bandwidth is
h
norm
=
ˆσ
ln(n + 1)
where ˆσ is some estimator of σ, for example, the empirical standard deviation:
ˆσ =
1
n
n
X
j=1
(X
j
−
¯
X)
2
1/2
6.2 Method based on the empirical characteristic function
Note that equation (6.1) always has solution (since |ϕ(t)| → 0 as t → ∞) but maybe non-unique.
Consider all solutions of equation (6.1) for which |ϕ(δ − 0)| > 1/
√
n + 1 and |ϕ(δ + 0)| < 1/
√
n + 1.
Denote them by δ
1
, ..., δ
m
(suppose for simplicity that δ
1
< δ
2
< ... < δ
m
). Since |ϕ(δ)| decreases in
some neighbourhood of each δ
i
,
∂
∂δ
|ϕ(δ)|
2
δ=δ
i
< 0
and therefore
∂
2
∂δ
2
MISE(f
n
(x; h))
δ=δ
i
> 0.
Thus each δ
i
is a local minimum of the MISE.
The global minimum can be found by computing and comparison of the MISE at h = 1/δ
1
, ..., 1/δ
m
.
This does not lead to large computational expenses because, if one uses (2.3) for the computation,
the first integral in the right hand side of (2.3) for δ = δ
2
, ..., δ
m
is a part of this integral for δ = δ
1
,
while the second integral for δ = δ
1
, ..., δ
m−1
is a part of this integral for δ = δ
m
.
The characteristic function ϕ(t) however is (of course) unknown, therefore the procedure, described
above, is used to the empirical characteristic function ϕ
n
(t) instead of ϕ(t). Here one must take into
account that ϕ
n
(t) is an almost periodic function, and equation
|ϕ
n
(δ)| =
1
√
n + 1
has infinitely many roots. Therefore, since
lim
n→∞
sup
|t|≤T
n
|ϕ
n
(t) − ϕ(t)| = 0
(see Cs¨org˝o and Totik, 1983), where T
n
→ ∞ and log T
n
= o(n) as n → ∞, it suffices to consider only
roots on the interval [0, e
n
]. Of course it is not necessary to calculate ϕ
n
(t) on such a wide interval:
all roots of this interval are contained in the interval [0,
√
n], practically, in a much shorter interval.
12
7. Uniform consistency and estimation of the mode
In this section we prove that the sinc estimator is uniformly consistent: it converges (in probability)
to the true density function uniformly over the whole real line, and that the mode of the sinc estimator
is a consistent estimator of the mode of the true density function. We formulate and prove results in
the simplest form, leaving possible generalizations to the reader.
Let K(x) be a symmetric, differentiable, conventional kernel with finite variance σ
2
. Suppose also
that its derivative has finite total variation which we denote by v. Denote the characteristic function
of K(x) by ψ(t) and the kernel estimator, based on K(x), by
ˆ
f
n
(x; h). As before, f
n
(x; h) denotes the
sinc estimator.
Lemma 7.1. Let the characteristic function ϕ(t) of the underlying distribution be integrable:
Z
|ϕ(t)|dt < ∞.
Then
sup
x
|f
n
(x; h) −
ˆ
f
n
(x; h)|
a.s.
−→ 0 as n → ∞, h → 0, nh → ∞.
Proof.
sup
x
|f
n
(x; h) −
ˆ
f
n
(x; h)| ≤
≤
1
2π
Z
|ψ(ht)| · |ϕ
n
(t) − ϕ(t)|dt +
Z
|ϕ(t)| · |ψ(ht) − I
[−1/h,1/h]
(t)|dt+
+
Z
1/h
−1/h
|ϕ
n
(t) − ϕ(t)|dt
.
Now we prove that each of the three integrals in the square brackets tends to zero as n → ∞, h →
0, nh → ∞. Denote these integrals by I
1
, I
2
and I
3
, respectively. Then
I
1
≤
Z
|t|≤n
2
|ϕ
n
(t) − ϕ(t)|dt + 4
Z
∞
n
2
|ψ(ht)|dt.
The first integral in the right hand side almost surely converges to zero as n → ∞ due to theorem 1
by Cs¨org˝o and Totik (1983). To estimate the second integral, we use the inequality
|ψ(t)| ≤
v
|t|
2
,
which holds for all t, see Ushakov and Ushakov (2000). Making use of this inequality, we obtain
Z
∞
n
2
|ψ(ht)|dt ≤
v
n
2
h
2
→ 0 as nh → ∞.
So, I
1
a.s.
−→ 0 as n → ∞, nh → ∞.
I
2
≤ 2
Z
1/
√
h
0
[1 − ψ(ht)]dt + 4
Z
∞
1/
√
h
|ϕ(t)|dt.
The second integral tends to zero as h → 0 because |ϕ(t)| is integrable. To estimate the first integral,
we use the inequality
ψ(t) ≥ 1 −
σ
2
t
2
2
13
which holds for all t, see Ushakov (1999). Making use of this inequality, we obtain
Z
1/
√
h
0
[1 − ψ(ht)]dt ≤
σ
2
6
√
h → 0 as h → 0.
Thus I
2
→ 0 as h → 0.
Finally, if nh → ∞, then 1/h ≤ cn with some constant c, and therefore
I
3
≤
Z
|t|≤cn
|ϕ
n
(t) − ϕ(t)|dt
a.s.
−→ 0 as n → ∞
due to the mensioned theorem by Cs¨org˝o and Totik (1983).
Remark. The condition of the lemma (ϕ is integrable) implies that f(x) is uniformly continuous
but is a little more restrictive. It is satisfied for example when f (x) is differentiable and f
0
(x) has
finite total variation.
Theorem 7.1. Let ϕ(t) be integrable. Then
sup
x
|f
n
(x; h) − f(x)|
P
−→ 0 as n → ∞, h → 0, nh
2
→ ∞. (7.1)
Proof. Let K(x) be any conventional kernel satisfying conditions of both Lemma 7.1 of this
section and Theorem 3A of Parzen (1962). Then, due to Theorem 3A by Parzen (1962),
sup
x
|
ˆ
f
n
(x; h) − f(x)|
P
−→ 0 as n → ∞, h → 0, nh
2
→ ∞, (7.2)
and, due to Lemma 7.1,
sup
x
|f
n
(x; h) −
ˆ
f
n
(x; h)|
P
−→ 0 as n → ∞, h → 0, nh
2
→ ∞. (7.3)
(7.2) and (7.3) evidently imply (7.1).
Denote a mode of f (x) by θ. Suppose it is unique. Let θ
n
be a mode of the sinc estimate.
Theorem 7.2. Let ϕ(t) be integrable. Then
θ
n
P
−→ θ as n → ∞, h → 0, nh
2
→ ∞.
Proof of the theorem coincides with that of the second part of Theorem 3A by Parzen (1962).
8. Inequalities
In this section, we derive some upper bounds for the MISE of the sinc estimator. In addition to
practical importance (evaluation of the sample size sufficient for achieving a given accurancy etc.),
these inequalities throw more light onto properties of the sinc estimator (especially for finite samples)
and make easier comparison of this estimator with other estimators.
We define the 0-th derivative of a function as the function itself: f
(0)
(x) = f(x) (as usually).
We need below the following form of the Parseval equality: let f(x) be m times differentiable prob-
ability density function (m ≥ 0), its m-th derivative f
(m)
(x) be square integrable, and ϕ(t) be the
14
corresponding characteristic function. Then
Z
(f
(m)
(x))
2
dx =
1
2π
Z
t
2m
|ϕ(t)|
2
dt. (8.1)
Theorem 8.1. Let f (x) be m times differentiable (m ≥ 0), and its m-th derivative is square
integrable. Then
MISE(f
n
(x; h)) < ε(h)h
2m
R(f
(m)
) +
1
πnh
(8.2)
where ε(h) ≤ 1 for all h and ε(h) → 0 as h → 0.
Proof. Estimate the first summand in the right hand side of (2.3). Making use of (8.1), we obtain
1
2π
Z
|t|>1/h
|ϕ(t)|
2
dt = h
2m
1
2π
Z
|t|>1/h
(1/h)
2m
|ϕ(t)|
2
dt ≤
≤ h
2m
1
2π
Z
|t|>1/h
t
2m
|ϕ(t)|
2
dt = h
2m
1
2π
Z
t
2m
|ϕ(t)|
2
dt−
−h
2m
1
2π
Z
1/h
−1/h
t
2m
|ϕ(t)|
2
dt = h
2m
1
2π
Z
t
2m
|ϕ(t)|
2
dt
1 −
R
1/h
−1/h
t
2m
|ϕ(t)|
2
dt
R
t
2m
|ϕ(t)|
2
dt
=
= ε(h)h
2m
Z
(f
(m)
(x))
2
dx = ε(h)h
2m
R(f
(m)
),
where
ε(h) = 1 −
R
1/h
−1/h
t
2m
|ϕ(t)|
2
dt
R
t
2m
|ϕ(t)|
2
dt
evidently satisfies conditions of the theorem: ε(h) ≤ 1 and ε(h) → 0 as h → 0.
For the second summand in the right hand side of (2.3) we have
1
n
·
1
2π
Z
1/h
−1/h
(1 − |ϕ(t)|
2
)dt <
1
n
·
1
2π
Z
1/h
−1/h
dt =
1
πnh
.
Thus we finally obtain (8.2).
Corollary 1. Let conditions of theorem 1 be satisfied. Then
MISE(f
n
(x; h)) < h
2m
R(f
(m)
) +
1
πnh
. (8.3)
Putting in (8.3)
h =
1
2πnmR(f
(m)
)
1
2m+1
(this h minimizes the right hand side of (8.3)), we get
Corollary 2. Let conditions of Theorem 8.1 be satisfied. Then
inf
h>0
MISE(f
n
(x; h)) <
1 + 2m
(2πm)
(2m)/(2m+1)
R(f
(m)
)
1/(2m+1)
n
−
2m
2m+1
.
Corollary 3. Let conditions of Theorem 8.1 be satisfied. Then
inf
h>0
MISE(f
n
(x; h)) = o
n
−2m/(2m+1)
, n → ∞.
15
Corollary 4. If f (x) is two times differentiable, and its second derivative is square integrable,
then
inf
h>0
MISE(f
n
(x; h)) = o(n
−4/5
), n → ∞
and
inf
h>0
MISE(f
n
(x; h)) <
5
4π
(4πR(f
00
))
1/5
n
−4/5
.
To obtain more sensitive and accurate estimates we use one more characteristic of the density
to be estimated — its total variation (or the total variation of its derivatives). For a function g(x),
we denote its total variation by V r(g). In the general case, two conditions, finiteness of the total
variation and square integrability, are not comparable: a function may have finite variation but be
not square integrable and vice versa. But for densities, square integrability is milder then finiteness
of the total variation: a density, having bounded variation, is square integrable. Thus finiteness of
the total variation of a density (or its derivatives) is a little more restrictive condition than its square
integrability. But on the other hand, assumption that some derivative of the density to be estimated
has finite total variation (together with the use of the sinc kernel) allows one to improve the order of
decreasing the MISE. For example, if the density is two times differentiable and V r(f
00
) < ∞, then
MISE(f
n
(x; h)) = O(n
−5/6
), n → ∞, instead of o(n
−4/5
) (see below).
Theorem 8.2. Let f (x) be m times differentiable (m ≥ 0), and its m−th derivative has finite
total variation. Then
MISE(f
n
(x; h)) ≤ h
2m+1
V r(f
(m)
)
2
(2m + 1)π
+
1
πnh
. (8.4)
Proof. For all t,
|ϕ(t)| ≤
V r(f
(m)
)
|t|
m+1
,
see Ushakov and Ushakov (2000). Making use of this inequality, estimate the first summand in the
right hand side of (2.3):
1
2π
Z
|t|>1/h
|ϕ(t)|
2
dt ≤
V r(f
(m)
)
2
π
Z
∞
1/h
dt
t
2m+2
=
V r(f
(m)
)
2
(2m + 1)π
h
2m+1
.
For the second summand of the right hand side of (2.3) we have (see proof of Theorem 8.1)
1
n
·
1
2π
Z
1/h
−1/h
(1 − |ϕ(t)|
2
)dt <
1
πnh
.
So, we obtain (8.4).
Corollary. Let conditions of Theorem 8.2 be satisfied. Then
inf
h>0
MISE(f
n
(x; h)) ≤
2(m + 1)
(2m + 1)π
V r(f
(m)
)
1/(m+1)
n
−(2m+1)/(2m+2)
.
For example, if m = 2, we get
inf
h>0
MISE(f
n
(x; h)) ≤
6
5π
V r(f
00
)
1/3
n
−5/6
.
Following Watson and Leadbetter (1963) and Davis (1975), we will say that a characteristic func-
tion ϕ(t) decreases exponentially with degree α and coefficient ρ (ρ > 0, 0 < α ≤ 2) if
|ϕ(t)| ≤ Ae
−ρ|t|
α
(8.5)
16
for some constant A. Davis (1975, Theorem 4.1) proved that if the characteristic function of the
density to be estimated satisfies (8.5), then
lim
n→∞
he
ρ/h
α
|
B
(f
n
(x; h))| = 0
The next theorem makes this more precise.
Theorem 8.3. Let
1
2π
Z
e
ρ|t|
α
|ϕ(t)|
2
dt = C < ∞.
Then
MISE(f
n
(x; h)) ≤ ε(h)Ce
−ρ/h
α
+
1
πnh
, (8.6)
where 0 < ε(h) < 1 and ε(h) → 0 as h → 0.
Proof is similar to that of Theorem 8.1. For the first summand of the right hand side of (2.3), we
have
1
2π
Z
|t|>1/h
|ϕ(t)|
2
dt < e
−ρ/h
α
1
2π
Z
|t|>1/h
e
ρ|t|
α
|ϕ(t)|
2
dt =
= ε(h)Ce
−ρ/h
r
where
ε(h) = 1 −
R
1/h
−1/h
e
ρ|t|
α
|ϕ(t)|
2
dt
R
e
ρ|t|
α
|ϕ(t)|
2
dt
.
It is difficult to find explicitly h minimizing the right hand side of (8.6), therefore we take h for
which the right hand side of (8.6) has a simple form. Namely, put
h =
1
ρ
ln n
−1/α
,
then
MISE(f
n
(x; h)) <
C +
(ln n)
1/α
πρ
1/α
1
n
<
C +
1
πρ
1/α
(ln n)
1/α
n
provided n > 2. If, for example, f(x) is the standard normal density, then
MISE(f
n
(x; h)) <
1
√
2π
+
√
2
π
!
√
ln n
n
Corollary. Let the characteristic function ϕ(t) of f (x) decreases exponentially with degree r and
coefficient ρ (i.e. satisfies (8.5)). Then
lim
n→∞
e
cρ/h
α
|
B
(f
n
(x; h))| = 0
for any c < 2.
9. Estimation of derivatives
The sinc estimator is especially superior to conventional estimators when a derivative of f(x) is
estimated. Let f(x) be r times differentiable. Suppose that one needs to estimate the r−th derivative
17
f
(r)
(x). A natural way is to estimate f
(r)
(x) by the r−th derivative of a kernel estimator of f(x),
provided that the kernel is r times differentiable. So, let
ˆ
f
n
(x; h) =
1
nh
n
X
j=1
K
x − X
j
h
be a kernel estimator of f (x), and K
(r)
(x) exists. Then the estimator
ˆ
f
(r)
n
(x; h) =
1
nh
r+1
n
X
j=1
K
(r)
x − X
j
h
is used for estimation of f
(r)
(x).
If K(x) is a conventional kernel, then the MISE of the estimator
ˆ
f
(r)
n
(x; h) is represented in the
form (provided that f(x) has r + 2 derivatives and K(x) has finite second moment)
MISE(
ˆ
f
(r)
n
(x; h)) =
1
4
h
4
µ
2
(K)
2
R(f
(r+2)
) +
1
nh
2r+1
R(K
(r)
)+
+o
h
4
+
1
nh
2r+1
, h → 0.
Therefore the optimal MISE is of order n
−4/(2r+5)
, i.e. the rate becomes slower for higher values of
r, and the difficulty increases, see Wand and Jones (1995) and Stone (1982). In this section, we show
that the sinc estimator f
(r)
n
(x; h) is free of this difficulty and the estimator f
(r)
n
(x; h) of f
(r)
(x) has
almost the same order of consistency as the estimator f
n
(x; h) of f(x) even for large values of r (of
course, if the density to be estimated is smooth enough). First we formulate an analog of Lemma 2.1
for derivatives.
Lemma 9.1. For the sinc estimator,
MISE(f
(r)
n
(x; h)) =
1
2π
Z
|t|>1/h
t
2r
|ϕ(t)|
2
dt +
1
n
·
1
2π
Z
1/h
−1/h
t
2r
(1 − |ϕ(t)|
2
)dt. (9.1)
where ϕ(t) is the characteristic function of the density to be estimated.
Proof. Due to the Parseval-Plancherel identity we have
MISE(f
(r)
n
(x; h)) =
E
Z
(f
(r)
n
(x; h) − f
(r)
(x))
2
dx =
=
1
2π
E
Z
t
2r
|ϕ
n
(t)I
[−1/h,1/h]
(t) − ϕ(t)|
2
dt =
=
1
2π
Z
E
h
(ϕ
n
(t)I
[−1/h,1/h]
(t) − ϕ(t))(ϕ
n
(t) I
[−1/h,1/h]
(t) − ϕ(t))
i
dt =
=
1
2π
Z
1/h
−1/h
t
2r
E
|ϕ
n
(t)|
2
dt−
−
1
2π
Z
1/h
−1/h
t
2r
ϕ(t)
E
ϕ
n
(t) + ϕ(t)
E
ϕ
n
(t)
dt +
1
2π
Z
t
2r
|ϕ(t)|
2
dt. (9.2)
It is easy to see that
E
ϕ
n
(t) = ϕ(t), (9.3)
18
E
ϕ
n
(t) = ϕ(t) (9.4)
and
E
|ϕ
n
(t)|
2
=
E
1
n
n
X
j=1
e
itX
j
2
=
E
1
n
n
X
j=1
e
itX
j
·
1
n
n
X
k=1
e
−itX
k
=
1
n
2
n +
X
j6=k
e
it(X
j
−X
k
)
=
1
n
+
1 −
1
n
|ϕ(t)|
2
. (9.5)
Substituting (9.3)—(9.5) to the right hand side of (9.2), we obtain (9.1).
Making use of Lemma 9.1 we obtain following analogs of theorems of Section 8.
Theorem 9.1. Let f(x) be r + m times differentiable (r, m ≥ 0), and its r + m-th derivative is
square integrable. Then
MISE(f
(r)
n
(x; h)) < ε(h)h
2m
R(f
(r+m)
) +
1
π(2r + 1)nh
2r+1
where ε(h) ≤ 1 for all h and ε(h) → 0 as h → 0.
Corollary. Let conditions of Theorem 9.1 be satisfied. Then
MISE
h>0
(f
(r)
n
(x; h)) ≤ C
m,r
R(f
(r+m)
)
(2r+1)/(2r+2m+1)
n
−2m/(2r+2m+1)
where
C
m,r
= (2πm)
−2m/(2r+2m+1)
+
(2πm)
(2r+1)/(2r+2m+1)
π(2r + 1)
.
Theorem 9.2. Let
1
2π
Z
t
2r
e
ρ|t|
α
|ϕ(t)|
2
dt = C < ∞.
Then
MISE(f
(r)
n
(x; h)) ≤ ε(h)Ce
−ρ/h
α
+
1
πnh
2r+1
,
where 0 < ε(h) < 1 and ε(h) → 0 as h → 0.
Corollary. Let conditions of Theorem 9.2 be satisfied. Then
inf
h>0
MISE(f
(r)
n
(x; h)) <
C +
(ln n)
(2r+1)/α
πρ
(2r+1)/α
1
n
.
Theorem 9.3. Let the characteristic function ϕ(t) of f(x) satisfies the condition: there exists
T > 0 such that f(t) = 0 for |t| > T . Then, if
h ≤
1
T
,
then
MISE(f
(r)
n
(x; h)) ≤
1
πnh
2r+1
.
In particular, if h = const = 1/T , then
MISE(f
(r)
n
(x; h)) ≤
T
2r+1
πn
.
19
Proofs of Theorems 9.1 – 9.3 are similar to those of theorems of Section 8 therefore we leave them
to the reader.
References
Cs¨org˝o, S. and Totik, V. (1983). On how long interval is the empirical characteristic function
uniformly consistent? Acta Sci. Math., 45, 141-149.
Davis, K.B. (1975). Mean square error properties of density estimates. Ann. Statist., 3, no. 4,
1025-1030.
Davis, K.B. (1977). Mean integrated square error properties of density estimates. Ann. Statist.,
5, no. 3, 530-535.
Devroye, L. (1992). A note on the usefulness of superkernels in density estimates. Ann. Statist.,
20, 2037-2056.
Fan, J., Gijbels, I. (1996). Local polynomial modelling and its applications. Monograps on Statistics
and Applied Probability. Chapman and Hall, London.
Glad, I.K., Hjort, N.L. and Ushakov, N.G. (2003). Correction of density estimators that are not
densities. Scand. J. Statist., 30, no. 2, 415-427.
Ibragimov, I.A., Khas’minskii, R.Z. (1982). Estimation of distribution density belonging to a class
of entire functions. Theory Probab. Applic., 27, No. 3, 551-562.
Loeve, M. (1977). Probability Theory. Springer, Berlin.
Parzen, E. (1962). On estimation of a probability density function and its mode. Ann. Math.
Statist., 33, 1065-1076.
Politis, D.N., Romano, J.P. (1999). Multivariate density estimation with general flat-top Kernels
of infinite order. J. Multiv. Anal., 68, 1-25.
Stone, C.J. (1982). Optimal global rates of convergence on nonparametric regression. Ann.
Statist., 10, no. 4, 1040-1053.
Ushakov, N.G. (1999). Selected Topics in Characteristic Functiond. VSP, Utrecht.
Ushakov, V.G. and Ushakov, N.G. (2000). Some inequalities for characteristic functions of densities
with bounded variation. Moscow Univ. Comput. Math. Cybernet., no. 3, 45-52.
Wand, M.P. and Jones, M.C. (1995). Kernel smoothing. Chapman and Hall, London.
Watson, G.S. and Leadbetter, M.R. (1963) On the estimation of the probability density, I. Ann.
Math. Statist., Vol. 34, 480-491.
20