Page 1

LOG-DENSITY DECONVOLUTION

BY WAVELET THRESHOLDING

Jérémie Bigot & Sébastien Van Bellegem

March 2007

Abstract

This paper proposes a new wavelet-based method for deconvolving a

density. The estimator combines the ideas of nonlinear wavelet thresholding

with periodised Meyer wavelets and estimation by information projection.

is guaranteed to be in the class of density functions, in particular it is positive

everywhere by construction. The theoretical optimality of the estimator is

established in terms of rate of convergence of the Kullback-Leibler discrepancy

over Besov classes. Finite sample properties is investigated in detail, and show

the excellent practical performance of the estimator, compared with other recently

introduced estimators.

It

Keywords: Deconvolution, Waveletthresholding, Adaptiveestimation, Information projection, Kullback-

Leibler divergence, Besov space

AMS classifications: Primary 62G07; secondary 42C40, 41A29

Affiliations

JÉRÉMIE BIGOT, Institut de Mathématiques de Toulouse, Laboratoire de Statistique et Probabilités,

Université Paul Sabatier, F-31062 Toulouse Cedex 9, France, Jeremie.Bigot@math.ups-tlse.fr

SÉBASTIEN VAN BELLEGEM, Institut de statistique and CORE, Université catholique de Louvain, Voie

du Roman Pays, 20, B-1348 Louvain-la-Neuve, Belgium, vanbellegem@stat.ucl.ac.be

Ackowledgements

This work was supported by the IAP research network nr P5/24 of the Belgian Government (Belgian

Science Policy). We gratefully ackowledge Yves Rozenholc for providing the Matlab code to compute

Page 2

the model selection estimator, Marc Raimondo for providing the Matlab code for translation invariant

deconvolution, and Anestis Antoniadis for helpful comments and suggestions.

1 Introduction

Density deconvolution is a widely studied statistical problem that is encountered in

manyappliedsituations. Thisproblem ariseswhenthe probabilitydensityofarandom

variable X has to be estimated from an independent and identically distributed (iid)

sample contaminated by some independent additive noise. Namely, the observations

at hand, denoted by Yifor i = 1,...,n, are such that

Yi= Xi+ ǫi,

i = 1,...,n

where Xiare iid variables with unknown density fX, and the added variables ǫimodel

the contamination by some noise. The number n represents the sample size and the

contamination variables ǫiare supposed iid with a known density function fǫ, and

independent from the Xi’s. In this setting, the density function fYof the observed

sample Yican be written as a convolution between the density of interest fX, and the

density of the additive noise fǫ, i.e.

fY(y) = fX⋆ fǫ(y) :=

?

fX(u)fǫ(y − u)du,

y ∈ R . (1.1)

In data analysis, density estimation from noisy sample plays a fundamental role.

Applications can be found in communication theory (e.g. Masry, 2003), experimental

physics (e.g. Kosarev et al., 2003) or econometrics (e.g. Postel-Vinay and Robin, 2002)

to name but a few. The problem of estimating the probability density fXrelates to

classical nonparametric methods of estimation, but the indirect observation of the data

leads to different optimality properties, for instance in terms of rate of convergence.

Among the nonparametric methods of deconvolution, standard methods recently

studied in the statistical literature include estimation by model selection (e.g. Comte,

Rozenholc and Taupin, 2006b), wavelet thresholding (e.g. Fan and Koo, 2002), kernel

smoothing (e.g. Carroll and Hall, 1988), spline deconvolution (e.g. Koo, 1999) or

spectral cut-off (e.g. Carrasco and Florens, 2002). However, a problem frequently

encountered with most of these techniques is that the proposed estimator is not

everywhere positive, therefore is not a valid probability density.

1

Page 3

The goal of this paper is to present an estimator that is optimal in terms

of asymptotic rates of convergence, and that benefits from good finite sample

properties. Furthermore, the proposed estimator is automatically a valid density,

in particular because it is guaranteed to be positive.

uses wavelet thresholding combined with information projection techniques, and is

computationally simple.

The advantage of wavelet methods is their ability in estimating local features of the

density, such as peaks or local discontinuities. In particular, they can estimate irregular

functions (in Besov spaces) with optimal rates of convergence. Wavelet methods for

deconvolution have received a special attention in the recent literature. Optimality of

the nonlinear wavelet estimator has been established in Fan and Koo (2002), but the

given estimator is not computable since it depends on an integral in the frequency

domain that cannot be calculated in practice. The estimator we propose below, in

addition to be a valid density, is fully computable as it only involves finite sums in

finite sample. Other recent wavelet estimators for deconvolution problems include the

work of Johnstone, Kerkyacharian, Picard and Raimondo (2004) or De Canditiis and

Pensky (2006), see also the references therein.

Our estimator combines wavelet thresholding with information projection that

guarantees the solution to be positive. This technique was studied by Barron and Sheu

(1991) for the approximation of density functions by sequences of exponential families.

An extension of this method to linear inverse problems has been studied in Koo and

Chung (1998) using expansions in Fourier series. In the special case of Poisson inverse

problems, Antoniadis and Bigot (2006) combined this technique with estimation by

wavelet expansions.

It is well-known that the difficulty of the deconvolution problem is quantified

by the smoothness of the noise density fǫ.

coefficients of the densities fY, fXand fǫrespectively, then the convolution equation

(1.1) is equivalent to fY

tend to zero, the reconstruction of fX

was systematically studied by Fan (1991), who introduced the following two types of

assumption on the smoothness of fǫ.

The proposed solution

If fY

ℓ, fX

ℓ

and fǫ

ℓdenote the Fourier

ℓ= fX

ℓ· fǫ

ℓ. Depending how fast the Fourier coefficients fǫ

ℓwill be more or less accurate. This phenomenon

ℓ

Assumption 1.1 Ordinary smooth convolution: the Fourier coefficients of fǫdecay in a

polynomial fashion i.e. there exists a constant C and a real ν ? 0 such that |fǫ

ℓ| ∼ C|ℓ|−ν.

2

Page 4

Assumption 1.2 Super smooth convolution: the Fourier coefficients of fǫare such that

d1|ℓ|ν0exp(−|t|ν/d0) ? |fǫ

where d0,d1,d2,ν,ν0,ν1are some positive constants.

ℓ| ? d2|ℓ|ν1exp(−|t|ν/d0) as |ℓ| → ∞,

Inthispaper, wealsoconsider thesetwo smoothness assumptions. Theoptimal rate

of convergence we can expect from a linear or a nonlinear wavelet estimator depends

on thesesmoothness assumptions andare well-studiedin theliterature. Tosummarize,

we know from the work of Pensky and Vidakovic (1999); Fan and Koo (2002) that for

ordinary smooth convolution both linear and nonlinear wavelet estimators achieve

the optimal rate of convergence. This rate is of polynomial order of the sample size n.

However, no adaptive linear estimator are optimal, and only well-calibrated nonlinear

wavelet estimators are adaptive. For the case of super smooth convolution, the optimal

rate of convergence is only of logarithmic order of the sample size, and there is no

difference between the rate of convergence of linear and nonlinear estimators. These

results are recalled in Section 3 below. It is worth mentionning that the estimators we

define in this paper achieve these optimal rates of convergence.

The next section recalls some general results on wavelet approximation and the

definition of the Meyer wavelet used for deconvolution. Then Section 3 defines the

linear and nonlinear wavelet estimators by information projection. The (optimal) rate

of convergence of the proposed estimators are stated in Section 4. The loss function

we consider to calculate this rate is the Kullback-Leibler divergence.

aforementioned difference with the wavelet estimator of Fan and Koo (2002), their

technique of proof is very different from the proof presented in this paper. Our proof

is actually based on a combination of the maxiset theorem in Johnstone et al. (2004) for

hard thresholding waveletestimators and other results on Kullback-Leibler divergence

by Csiszár (1975) or Barron and Sheu (1991).

The fruitful combination of wavelet thresholding and information projection is not

new and was proposed in Antoniadis and Bigot (2006), who were concerned with

the estimation of the intensity of a Poisson process using nonlinear wavelet Galerkin

methods. The following development is different, as it deals with the estimation of

log-densities and is specific to the case of deconvolution with the use of periodised

Meyer wavelets. The proof techniques are also in contrast with Antoniadis and Bigot

(2006), who followed the Gaussian approximation technique developed in Donoho,

Due to the

3

Page 5

Johnstone, Kerkyacharian and Picard (1995). That approach is however not applicable

with periodised Meyer wavelets.

Section 5 addresses the practical issues of the proposed estimation procedure. We

compare the performance of the proposed estimator with two of the most recent

techniques for density deconvolution. The first is deconvolution via cosine series

studied by Hall and Qiu (2005), and the second is the model selection approach of

Comte, Rozenholc and Taupin (2006a). While the estimator by model selection showed

significant small sample improvements against most of the standard techniques

of deconvolution, the proposed wavelet-based estimator by information projection

outperforms the results of Comte et al. (2006a).

We conclude the paper by a technical appendix which contains the proof of the

main theorems, and where we adapt some results of Barron and Sheu (1991) to the

case of estimation by information projection using periodised Meyer wavelets.

2 Meyer wavelets for deconvolution

In this paper, we assume that the support of fXis compact and included in [0,1]. Of

course, this is not an assumption that would hold in many practical applications and it

is mainly made for mathematical convenience to define more easily the estimation of

fXby functions in an exponential family based on a finite linear combination of basis

functions (see the next section). The support of fǫhowever can be unbounded, so the

support of fYis in general unbounded1.

Wavelet systems provide unconditional bases for Besov spaces. Using wavelets,

one can express whether or not fXbelongs to a Besov space by a simple requirement

on the absolute value of the wavelet coefficients of fX. More precisely, assume that

(φ,ψ) denotes some scaling and wavelet functions that have enough regularity and

vanishing moments. If σ = s + (1/2 − 1/p) ? 0, define the norm ? · ?s,p,qby

?q/p

?fX?q

s,p,q=

∞

∑

j=0

?

2jσp

2j−1

∑

k=0

|?g,ψj,k?|p

.

It can be shown (Meyer, 1992) that this norm is equivalent to the norm in traditional

1The case where the support of fXis included in [0,T] is handled by adapting the Fourier tranform

(the corresponding exponential orthogonal system is exp(−i2πxℓ/T)).

4

Page 6

Besov space Bsp,q. The estimator we shall define in the next section is based on the

wavelet decomposition of functions in L2([0,1]) using periodised Meyer wavelets (see

Meyer (1992), Johnstone et al. (2004) for further details). Let (φ,ψ) be the periodised

Meyer scaling and wavelet function respectively. Scaling and wavelet functions at

scale j (i.e. resolution level 2j) will be denoted by φλand ψλ, where the index λ

summarizes both the usual scale and space parameters j and k (e.g. λ = (j,k) and

ψj,k= 2j/2ψ(2j· −k)). The notation |λ| = j will be used to denote a wavelet at scale j,

while |λ| < j denotes some wavelet at scale j′, with 0 ? j′< j.

For any function g of L2([0,1]), its wavelet decomposition can be written as:

g = ∑

|λ|=j0

cλφλ+

∞

∑

j=j0∑

|λ|=j

?1

βλψλ,

where cλ = ?g,φλ? =

denotes the usual coarse level of resolution. One reason of using Meyer wavelets

in the context of deconvolution is because they are band limited, i.e. their Fourier

transform have compact support. In particular we have that the support of Fφ is

[−4π/3,4π/3] and the support of Fψ is [−8π/3,−2π/3] ∪ [2π/3,8π/3], where F f

denotes the Fourier transform of a function f. Let eℓ(x) = e2πiℓx,ℓ ∈ Z and denote

by fℓ= ?f,eℓ? =?1

Plancherel’s identity that

0g(u)φλ(u)du, βλ = ?g,ψλ? =

?1

0g(u),ψλ(u)du and j0

0f(u)e−2πiℓudu the Fourier coefficients of a function f ∈ L2([0,1]).

Then, if we denote the Fourier coefficients of ψλby ψλ

ℓ= ?ψλ,eℓ? we obtain with the

βλ= ?fX,ψλ? =∑

ℓ

fX

ℓψλ

ℓ.

Given that the Meyer wavelets ψλare band-limited, the above sum only involves a

finite number of terms. Now, if we denote by fǫ

function of the ǫj’s and by fY

we have by independence of X1and ǫ1that

ℓ= E(e−2πiℓǫ1) the characteristic

ℓ= E(e−2πiℓY1) the characteristic function of the Yj’s ,

fY

ℓ= E(e−2πiℓY1) = E(e−2πiℓǫ1)E(e−2πiℓX1) = fǫ

ℓfX

ℓ.

An unbiased estimator of βλis thus given by

ˆβλ=∑

ℓ

?

ψλ

fǫ

ℓ

ℓ

??

1

n

n

∑

j=1

exp(−2πiℓYj)

?

. (2.1)

5

Page 7

provided that the fǫ

infinity. In (2.1), n−1∑n

observations Y1,...,Yn.

We define the estimators of the scaling coefficients cλanalogously, with φ instead of

ψ. From these estimators, we contruct in the next section our estimator of the unknown

density function fX.

ℓ’s are non-zero and have a sufficiently smooth decay as ℓ tends to

j=1exp(−2πiℓYj) is simply the discrete Fourier transform of the

3 Estimation by information projection

3.1Linear and nonlinear wavelet estimators

Based on the wavelet estimators ˆ cλandˆβλ, several estimators of the unknown density

fXhave been studied. First of all, the linear estimator is such that

ˆfX

L= ∑

|λ|=j0

ˆ cλφλ+

j1

∑

j=j0∑

|λ|=j

ˆβλψλ

This estimator was first studied by Pensky and Vidakovic (1999), who showed that for

an appropriate scale j1, it achieves the optimal rate of convergence among the class

of linear estimators. In the ordinary smooth situation (Assumption 1.1), the choice

of j1is such that 2j1≈ n

choice is not adaptive because j1depends on the unknown smoothness class of fX.

For super smooth convolution (Assumption 1.2), the optimal and adaptive choice is

2j1≈ (lnn)1/ν, see Pensky and Vidakovic (1999) or Fan and Koo (2002).

Contrary to the linear estimator, there exists adaptive nonlinear estimators by

wavelet thresholding that can achieve the optimal rate of convergence. To simplify the

notations, we write in the following (ψλ)|λ|=j0−1for the scaling functions (φλ)|λ|=j0.

The non-linear estimation by hard-thresholding is then defined by

1

2s+2ν+1if fXbelongs to the Sobolev space Hs. Note that this

ˆfX

h=

j1

∑

j=j0−1∑

|λ|=j

τj,n(x) = x1 1{|x|?δj,n}and the non-linear estimation by soft-thresholding

δh

τj,n(ˆβλ)ψλ

with threshold δh

is defined by

ˆfX

s=

j1

∑

j=j0−1∑

|λ|=j

δs

τj,n(ˆβλ)ψλ

6

Page 8

where δs

of approximation j0, the high-frequency cut-off j1and the threshold τj,nwhich may

depend on the level of resolution j. The estimatorsˆfX

proposed as estimators of fX. An optimal adaptative estimator is derived with

appropriate choices of scales j0, j1and threshold. One possible calibration for an

adaptive estimator in ordinary smooth deconvolution is 2j1≈ n

(Pensky and Vidakovic, 1999). The choice δj,n≈ 2νj?j/n was also considered (Fan

and Koo, 2002).

Since all of these estimators are of a form of orthogonal series estimator, they are

not in general in the space of valid densities. In particular, they are not necessarily

positive everywhere. In the next section, we modify the linear and nonlinear estimator

using a projection step to guarantee positivity.

τj,n(x) = sign(x)(x − δj,n)+. These estimators depend on the coarse level

handˆfX

s have already been

1

2ν+1and δj,n≈ 2νj/√n

3.2 Information projection to guarantee positivity

Let j ? 0. If θ denotes a vector in R2j, then θλdenotes its λ-th component. The wavelet

based exponential family Ejat scale j is defined as the set of functions:

|λ|<j

where

|λ|<j

The constant Cj(θ) is used to guarantee that fj,θis integrating to one on [0,1], and is

thus a probability density function.

Following Csiszár (1975), the density function fj,θin the exponential family Ejthat

is the closest to the true density fXin the Kullback-Leibler sense is characterized as the

unique density function in the family for which

Ej=

fj,θ(.) = exp

?1

∑

θλψλ(.) − Cj(θ)

, θ = (θλ)|λ|<j∈ R2j

,

Cj(θ) = log

0

exp

∑

θλψλ(x)

dx.

?fj,θ,ψλ? = ?fX,ψλ?, for all |λ| < j.

It seems therefore natural to estimate the unknown density function fX, by searching

for someˆθn∈ R2jsuch that:

?

fǫ

l

n

j=1

?fj,ˆθn,ψλ? =∑

ℓ

ψλ

l

??

1

n

∑

exp(−2πiℓYj)

?

:= ˆ αλ, for all |λ| < j. (3.1)

7

Page 9

Note that the notation ˆ αλ is used to denote both the estimation of the scaling

coefficients ˆ cλand the wavelet coefficientsˆβλ.

The positive linear and nonlinear wavelet estimator are then defined as follows:

• The positivelinear waveletestimator is fj1,ˆθnsuch that ?fj1,ˆθn,ψλ? = ˆ αλfor all |λ| < j1

• The positivenonlinear estimator with hard thresholding is fh

δh

j1,ˆθnsuch that ?fh

j1,ˆθn,ψλ? =

τj,n(ˆ αλ) for all |λ| < j1

• The positive nonlinear estimator with soft thresholding is fs

δs

j1,ˆθnsuch that ?fs

j1,ˆθn,ψλ? =

τj,n(ˆ αλ) for all |λ| < j1

The existence of these estimators is questionnable. This issue is addressed in the

next section and in the technical appendix. We also derive in the next section the rate

of convergence of the estimators.

4 Asymptotic optimality of the estimators

To calculate the rate of convergence of the estimators, we use the loss function given

by the Kullback-Leibler discrepancy between two probability density functions p and

q:

∆(p;q) =

?1

0

p(x)log(p(x)

q(x))du(x).

Let M be some fixed constant and let Fs

such that

p,q(M) denote the set of density functions

Fs

p,q(M) =

?

f ∈ L2[0,1] is a p.d.f. such that for g = log f, ?g?q

s,p,q? M

?

.

Note that assuming that f ∈ Fs

p,q(M) implies that f is strictly positive.

4.1Linear estimation

The following theorem is the general result on the nonadaptive information projection

estimator of the unknown density function.

8

Page 10

Theorem 4.1 Assume fX∈ Fs

fǫsatisfies Assumption 1.1 (ordinary smooth convolution). Let j(n) be such that 2−j(n)≈

n−1/(2s+2ν+1). Then, the information projection estimator fj(n),ˆθnexists and is such that

2,2(M) with s > 1, and suppose that the convolution kernel

E∆

?

fX; fj(n),ˆθn

?

= O

?

n−

2s

2s+2ν+1

?

The above estimator therefore converges with the optimal rate for densities in Fs

However, this estimator is not adaptive since the choice of j(n) depends on the

unknown smoothness class of the function fX. Moreover, the result is only suited for

smooth functions (as Fs

attain the optimal rates when for example g = log(fX) has singularities. In the next

section, we therefore propose another estimator based on an appropriate nonlinear

thresholding procedure.

2,2(M).

2,2(M) correspond to a Sobolev space of order s) and does not

4.2Non-linear estimation

Fan and Koo (2002) show that, when the error is supersmooth, optimal rates of

convergence are only of logarithmic order of the sample size. In this case, while the

linear wavelet estimators cannot be optimal, non linear estimators do not provide

much gain for estimating functions in the Besov spaces. For this reason, we only

consider the ordinary smooth case in the following.

In non-linear estimation, we need to define an appropriate thresholding of the

estimated coefficients ˆ αλ. This threshold is level-dependent and takes the form τj,n=

ητj

?(logn)/n with

τj= 2jν, (4.1)

and for some constant η > 0. The size of the exponential family used for the estimation

dependson the high-frequency cut-off j1which istypically related to the ill-possedness

ν of the inverse problem e.g.2j1 ? n1/2νas in Antoniadis and Bigot (2006) or

2j1= O

The following theorem indicates that the rate of convergence of the expected

Kullback-Leibler discrepancy for the positive nonlinearestimator by hard thresholding

achieves the optimal rate of convergence provided that the finest resolution level j1is

an apppropriate function of the degree of smoothness ν of the convolution.

?

(

n

log(n))1/(2ν+1)?

as in Johnstone et al. (2004).

9

Page 11

Theorem 4.2 Assume that f ∈ Fs

Assumption 1.1 with ν > 0 (ordinary smooth convolution). Suppose that the following

p,q(M), and suppose that the convolution kernel fǫsatisfies

conditions hold

0 ≤ q ≤ max((4ν + 2)/(2s + 2ν + 1),4ν/(2s + 2ν − 2/p + 1))

1 ≤ p < 2,

s ≥ 1/2 + 1/4ν,

s ≥ 1/p + 1/2,

ν ≥ 1/2,

s ≥ (2ν + 1)(1/p − 1/2),

s ≥ (2ν + 1)/(2ν − 1)

(4.2)

(4.3)

Then, the above described hard thresholding estimator satisfies the minimax rate (up to

logarithmic factors)

E∆(f; fh

j1(n),ˆθn) = O

??logn

n

?2s/(2s+2ν+1)?

,

provided that 2j1(n)= O

?

(

n

log(n))1/(2ν+2)?

.

The proof of Theorem 4.2isto be found in the appendix. Itisbased on a combinaison of

the maxiset theorem in Johnstone et al. (2004) for hard thresholding wavelet estimators

and other results on Kullback-Leibler divergence by Csiszár (1975) or Barron and Sheu

(1991). The space Fs

with local irregularities such as peaks or discontinuities. The above theorem shows

that our information projection estimate based on hard thresholding is adaptive and

near optimal for such irregular functions. Note that in Johnstone et al. (2004), the case

1/p − 1/2 − ν ≤ s < (2ν + 1)(1/p − 1/2) is also considered for which a different

rate of convergence is derived. This is known as the ’Elbow’ phenomenon which

has been commonly observed in direct models and recently noticed by Johnstone et

al. (2004) for deconvolution problems, but for simplicity we have not considered

this case. The conditions (4.2) and (4.3) are used to guarantee the existence of the

information projection estimates and to obtain near optimal rates of convergence (see

the proof of Theorem 4.2 and the technical Lemmas in the Appendix). Finally, note that

our condition on the high-frequency cut-off j1is different than the condition given in

Antoniadis and Bigot (2006) or Johnstone et al. (2004).

p,q(M) with 1 ≤ p < 2 can model piecewise smooth functions

10

Page 12

5 Simulations

In this section we report the result of simulations, and compare our procedure with

other deconvolution methods recently introduced in the literature.

Given a density fXwith variance σ2

generate observations Yi,i = 1,...,n from the additive model Yi = Xi+ ǫi, where

Xi(resp. ǫi) are independent realisations from fX(resp. fǫ). Important quantities

in the simulations are the sample size n and the root signal-to-noise ratio defined by

s2n := σX/σǫ.

For the sake of conciseness, we only consider for fǫthe Laplace density function,

given by

Xand a noise density fǫwith variance σ2

ǫwe

fǫ(x) =

1

√2σǫ

exp

?

−√

2|x|

σǫ

?

,

x ∈ R.

The Fourier coefficients of this density are given by

fǫ

ℓ=

1

1+ 2σ2

ǫπ2ℓ2,

ℓ = 0,±1,±2,...

This noise density corresponds to the case of ordinary smooth deconvolution with

ν = 2.

As for the density of interest fX, we consider the five following situations:

1. Uniform distribution: f(x) = 51 1[0.4,0.6](x).

2. Exponential distribution: f(x) = 10e−10(x−0.2)1 1[0.2,+∞[(x)

3. Laplace distribution: f(x) = 10e−20|x−0.5|

4. Gaussian distribution: X ∼ N(µ,σ2) with µ = 0.5 and σ = 0.1.

5. MixtGauss distribution (mixture of two Gaussian variables): X ∼ π1N(µ1,σ2

π2N(µ2,σ2

1) +

2) with π1= 0.4,π1= 0.6, µ1= 0.4,µ2= 0.6 and σ1= σ2= 0.05.

The five densities fXare displayed in Figure 5.1, where we can observe that

they show various types of smoothness.

constant function with two jumps, the Exponential distribution is a piecewise smooth

function with a single jump, the Laplace density is a continuous function with a

cusp at x = 0.5 and is thus non-differentiable at this point, whereas the Gaussian

The Uniform distribution is a piecewise

11

Page 13

00.2 0.4 0.6 0.81

0

1

2

3

4

5

6

(a)

0 0.20.40.6 0.81

0

2

4

6

8

10

(b)

0 0.2 0.40.6 0.81

0

2

4

6

8

10

(c)

00.2 0.4 0.60.81

0

0.5

1

1.5

2

2.5

3

3.5

4

(d)

0 0.2 0.40.6 0.81

0

1

2

3

4

5

(e)

Figure 5.1: Test densities: (a) Uniform, (b) Exponential, (c) Laplace, (d) Gaussian, (e)

MixtGauss (mixture of two Gaussian)

and the MixtGauss densities are very smooth signals (analytical functions). Due to

the excellent localization properties of the wavelets for the reconstruction of irregular

signals, itisexpected thatourwavelet-basedestimator iswell-adaptedtothese typesof

irregularity. Although all these distributions are not necessarily compactly supported

on [0,1], they have been chosen so that their mass is essentially concentrated on [0,1]

and it is therefore very unlikely to have observations Xioutside the interval [0,1].

5.1 Computation of the estimators

In the following, we describe in detail the computation algorithm of the wavelet

deconvolution by information projection described in the previous sections. We also

introduce two competitors, the estimator by model selection studied by Comte et al.

(2006a) and cosine series deconvolution of Hall and Qiu (2005). These two procedures

have been recently introduced in the literature and their properties on finite samples

have been well studied.

In all simulations, we used the Matlab program and the wavelet toolbox Wavelab

12

Page 14

(see Buckheit, Chen, Donoho and Johnstone, 1995).

5.1.1 Wavelet deconvolution

For ℓ = −n/2 + 1,...,n/2 we compute the coefficients

n

∑

j=1

ˆfX

ℓ= n−1

exp(−2πiℓYj)/fǫ

ℓ.

This gives an estimation of the Fourier coefficients of the unknown function fX, and

we then use the efficient algorithm of Kolaczyk (1994) to compute the Meyer wavelet

coefficients of a discrete signal. This algorithm only requires O(n(log(n))2) operations

to compute the empirical wavelet coefficients from a sample of size n.

According to Theorem 4.2, the high-frequency cut-off j1(n) must be chosen such

that 2j1= O

small, and in our simulations we have therefore investigated the choices j1 = 3 to

j1 = log2(n) − 1. For any of these choices, the optimal theoretical level is always

smaller than j1and introducing a higher level of resolution may only introduce some

instability in our estimator (for instance when a large wavelet coefficient due to the

noise at a fine scale is erroneously kept by the thresholding procedure). This behavior

has also been noticed by Johnstone et al. (2004). As we shall see in the simulation

results, the best empirical level j1depends on the amount of noise and is proportional

to the signal to noise ratio.

For a non-linear wavelet estimator, the results of Theorem 4.2 suggest to take a

threshold of the form

?

(

n

log(n))1/(2ν+2)?

. In practice the optimal level (2ν + 2)−1log2(n) is too

τj,n= ητj

?

(logn)/n,

where η is a tuning constant and τj= 2jν. Based on extensive simulations, we have

found that the best results were obtained with the choice η =√2 rather than η = 1. In

the context of Meyer wavelet-based deconvolution in a regression setting, Johnstone et

al. (2004) use the same type of level-dependent thresholding but the scale parameter τj

depends on the noise distribution fǫand on the support of the Meyer wavelet in the

Fourier domain. It is given by

˜ τj=

1

|Cj|∑

ℓ∈Cj

|fǫ

ℓ|−2,

13

Page 15

where Cjdenotes the set of of non-zero Fourier coefficients ψλ

that the Meyer wavelets are band-limited) and |Cj| = 4π2jis the cardinal of Cj. As it

can be seen from the proof of Lemma A.6, the choise τj= 2jνcomes from the bound

ℓat scale |λ| = j (recall

˜ τ2

j=

1

|Cj|∑

ℓ∈Cj

|fǫ

ℓ|−2= O(22jν)

under the assumption of ordinary smooth deconvolution.

the scale parameters τjand ˜ τjyield similar estimators. In our simulations, we have

therefore chosen tocompare theresults obtained from the “theoretical” scaleparameter

τjand from the “distribution dependent” scale parameter ˜ τj.

In our simulations we have found that better and smoother estimators were

obtained by soft thresholding. Once we have computed the thresholded coefficients

δs

projection estimate fs

described in Antoniadis and Bigot (2006).

It is not clear whether

τj,n(ˆ αλ) for all |λ| < j1, it remains to compute the empirical version of the information

j1,ˆθn. To do so, we use a Newton-Raphson type algorithm as

5.1.2Density deconvolution via model selection

The adaptive density deconvolution estimator of Comte et al.

on penalized contrast minimization over a collection of model Sm, m ∈ Mn =

{1,...,mn} where Smisthespaceofsquareintegrable functions withFourier transform

supported included in [−lm,lm] with lm = m∆, ∆ > 0. The adaptive estimator by

model selection is therefore a band-limited functionˆf ∈ Sˆ mwhere ˆ m is the model

selected by minimization of an appropriate penalized criteria based on the Yi’s and

the Fourier transform of the error distribution fǫ, see Comte et al. (2006a). Based

on extensive simulations with various sample sizes and signal to noise ratios, Comte

et al. (2006a) show that the model selection procedure performs very well for finite

samples, compared with the standard estimators. This estimator outperforms the

kernel estimator, even when the bandwidth parameter is selected in a data-driven

way. In consequence, we see this procedure as the most challenging competitor in

our simulations.

(2006a) is based

14

Page 16

5.1.3 Cosine series deconvolution

As an alternative competitor, we also consider the deconvolution estimator recently

introduced by Hall and Qiu (2005). The estimator if based on the cosine-series

expansion

ˆf(x) = 1 +

m

∑

j=1

2ˆ ajcos(jπx)

where ˆ ajis an estimator of the cosine coefficient aj=?1

about its mean 0 (recall that this is our choice for the error distribution for all the

simulations), a simple estimator of the cosine coefficient ajis given by

ˆbj

αjδτn(|ˆbj|)

where αj= E(cos(jπǫ1)), and δτn(|ˆbj|) = 1 1|ˆbj|>τnis a simple hard-thresholding rule

with τn = C?log(n)/(2n) and C is a tuning constant. Slight modifications of this

threholding rule are also considered in Hall and Qiu (2005), but these modifications

have the same theoretical and empirical properties as those based on δτn. Moreover,

the simulations carried out by Hall and Qiu (2005) have shown that the simple choice

m = n and C = 2 leads to satisfactory results and that there is not much to be gained

by employing cross-validation to choose m and C. So, in all the simulations presented

in this paper we take m = n and C = 2.

0f(x)cos(jπx)dx and m ? 1 is

an integer defining a high frequency cut-off. Since the Laplace density is symmetric

ˆ aj=

5.2 Results of the simulations

We present results for sample sizes n = 128,256 and 512 (respectively a small,

moderate and large sample size) and s2n = 100,10,3 (respectively a very low, a

moderate and a high level of noise). Note that for s2n = 100, as the variance of the

noise is extremely low, we are therefore very close to the direct density estimation

problem with uncontaminated data. For each combination of these factors, we

simulate 100 independent samples of size n, and for each sample the quality of an

estimatorˆfnof the test density f is measured by the empirical mean square error (MSE)

defined as

MSE =1

n

∑

i=1

n

?ˆfn(ti) − f(ti)

?2

15

Page 17

where ti= i/n, i = 0,...,n − 1. In Figure 5.2, we illustrate the performance of each

method and show typical reconstructions of the test densities fXfor n = 256 and

s2n = 10. Note that for the sake a better visual quality, we only plot the positive part

of the estimators.

Our wavelet estimator is by construction a probability density function.

therefore visually much more satisfactory than the model selection estimator and the

cosine series estimator which may take negative values. For the three non-smooth

densities (Uniform, Exponential and Laplace distribution) the reconstruction of the

singularities (discontinuities and cusp) of the signals is much better with our wavelet

estimator. For the two smooth densities (Gauss and MixtGauss), the model selection

estimator performs slightly better than the two other methods.

By inspecting the first column in Figure 5.2 we see that the wavelet estimator is

affected by pseudo-Gibbs phenomena. A possible remedy to this defect is to use

a translation invariant (TI) procedure such as the one suggested by Donoho and

Raimondo (2004) for Meyer wavelet-based deconvolution in a regression setting.

Their algorithm yields thresholded coefficients δs

to translation that can be used to calculate a TI information projection estimate. In

the second column of Figure 5.2 is displayed the TI version of the wavelet estimators

plotted in the first column. One can see that the TI estimators are visually much better

since they exhibit very small oscillations while preserving a good reconstruction of

the singularities of the non-smooth densities. However, from the overall simulations,

we have found that the TI version of our wavelet estimator does not yield significant

improvements in terms of MSE. Therefore we only present results for the comparison

between our wavelet estimator (non-TI version) and the two alternative methods by

Comte et al. (2006a) and Hall and Qiu (2005).

In Figures 5.3 to 5.7, we depict for each test fXdensity the boxplot of the MSE

over the 100 replications. All combinations of sample sizes and signal-to-noise

ratios are considered. For wavelet deconvolution, we give boxplots for each type

of thresholding, either with the scale parameter τj(abbreviated as wavtheo) or ˜ τj

(abbreviated as wavemp). We also indicate the level j1which gives the best result in

term of averaged MSE over the 100 simulations. As it can be observed from these

boxplots, our wavelet approach outperforms the other methods for all type of non-

smooth densities fX. It confirms the superiority of wavelet-based positive estimators

It is

τj,n(ˆ αλ) (for all |λ| < j1) invariant

16

Page 18

0 0.20.4 0.60.81

0

1

2

3

4

5

6

0 0.2 0.40.60.81

0

1

2

3

4

5

6

0 0.20.4 0.60.81

0

1

2

3

4

5

6

0 0.20.4 0.60.81

0

2

4

6

8

0 0.20.40.6 0.81

0

2

4

6

8

10

0 0.20.40.60.81

0

2

4

6

8

10

0 0.20.4 0.60.81

0

2

4

6

8

10

0 0.20.40.60.81

0

2

4

6

8

10

0 0.20.4 0.60.81

0

2

4

6

8

10

12

0 0.2 0.40.6 0.81

0

2

4

6

8

10

12

00.2 0.40.6 0.81

0

2

4

6

8

10

12

0 0.2 0.40.60.81

0

2

4

6

8

10

12

0 0.20.4 0.6 0.81

0

1

2

3

4

0 0.2 0.4 0.60.81

0

1

2

3

4

0 0.2 0.40.6 0.81

0

1

2

3

4

0 0.20.4 0.60.81

0

1

2

3

4

0 0.2 0.40.6 0.81

0

1

2

3

4

5

00.2 0.4 0.60.81

0

1

2

3

4

5

0 0.2 0.40.6 0.81

0

1

2

3

4

5

0 0.20.40.60.81

0

1

2

3

4

5

Figure 5.2: Typical reconstructions for one realization of simulations by: wavelet

thresholding with j1= 5 and the “distribution dependent” scale parameter ˜ τj(first

column the non-TI version, second column the TI version), model selection (third

column) and cosine series (fourth column) for the five test densities:

Exponential, Laplace, Gaussian and MixtGauss.

densities and the solide lines correspond to the various estimators (n = 256 and

s2n = 10)

Uniform,

The dotted lines show the true

17

Page 19

over those based on Fourier decompositions for the reconstruction of signals with

local singularities. The wavelet thresholding with the scale parameter τj= 22jνgives

generally better results. For the Gaussian distribution, the wavelet approach with

scale parameter ˜ τjyields generally better results, in particular for a small sample sizes

(n = 128). With sample sizes n = 256 or n = 512, the results obtained with the

three methods are very similar. Finally, for the MixtGauss distribution, the wavelet

approach is clearly better for n = 128 while the model selection procedure is slightly

better than wavelet thresholding for n = 256 and n = 512. Note that the fine level j1

which gives the best results is generally quite low and depends on the signal to noise

ratio. For almost all combinations of the factors, the choices j1= 3,4 yield to the best

results. This observation is consistent with the condition of Theorem 4.2 that suggests

a smaller j1for ill-posed inverse problems than in the direct case. It also confirms that

introducing higher level of resolution does not necessarily improve the quality of the

estimator.

18

Page 20

Wavemp WavtheoModsel Cos

0.2

0.4

0.6

0.8

1

1.2

Values

n=128, s2n= 3, j1 (emp) = 3, j1 (theo) = 3

WavempWavtheo Modsel Cos

0.2

0.4

0.6

0.8

1

1.2

Values

n=128, s2n= 10, j1 (emp) = 3, j1 (theo) = 3

Wavemp WavtheoModsel Cos

0.2

0.4

0.6

0.8

1

Values

n=128, s2n= 100, j1 (emp) = 3, j1 (theo) = 3

Wavemp WavtheoModselCos

0.2

0.3

0.4

0.5

0.6

Values

n=256, s2n= 3, j1 (emp) = 3, j1 (theo) = 3

Wavemp Wavtheo Modsel Cos

0.2

0.3

0.4

0.5

0.6

Values

n=256, s2n= 10, j1 (emp) = 3, j1 (theo) = 3

Wavemp WavtheoModsel Cos

0.2

0.3

0.4

0.5

0.6

Values

n=256, s2n= 100, j1 (emp) = 4, j1 (theo) = 3

Wavemp WavtheoModselCos

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Values

n=512, s2n= 3, j1 (emp) = 3, j1 (theo) = 3

WavempWavtheo ModselCos

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Values

n=512, s2n= 10, j1 (emp) = 4, j1 (theo) = 3

WavempWavtheoModselCos

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Values

n=512, s2n= 100, j1 (emp) = 5, j1 (theo) = 3

Figure 5.3: Uniform distribution: graphical display (boxplots) of the MSE with 100

repetitions for each method and all combination of the factors n and s2n.

19

Page 21

WavempWavtheoModsel Cos

0.5

1

1.5

2

2.5

3

Values

n=128, s2n= 3, j1 (emp) = 4, j1 (theo) = 3

Wavemp WavtheoModsel Cos

0.5

1

1.5

2

2.5

Values

n=128, s2n= 10, j1 (emp) = 5, j1 (theo) = 3

Wavemp WavtheoModsel Cos

0.5

1

1.5

2

2.5

Values

n=128, s2n= 100, j1 (emp) = 5, j1 (theo) = 3

Wavemp Wavtheo ModselCos

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Values

n=256, s2n= 3, j1 (emp) = 4, j1 (theo) = 3

WavempWavtheoModsel Cos

0.4

0.6

0.8

1

1.2

Values

n=256, s2n= 10, j1 (emp) = 5, j1 (theo) = 3

Wavemp WavtheoModsel Cos

0.4

0.6

0.8

1

1.2

Values

n=256, s2n= 100, j1 (emp) = 5, j1 (theo) = 3

WavempWavtheoModsel Cos

0.5

1

1.5

2

Values

n=512, s2n= 3, j1 (emp) = 4, j1 (theo) = 3

Wavemp WavtheoModselCos

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Values

n=512, s2n= 10, j1 (emp) = 5, j1 (theo) = 3

WavempWavtheoModselCos

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Values

n=512, s2n= 100, j1 (emp) = 5, j1 (theo) = 7

Figure 5.4: Exponential distribution: graphical display (boxplots) of the MSE with 100

repetitions for each method and all combination of the factors n and s2n.

20

Page 22

Wavemp WavtheoModsel Cos

0

0.5

1

1.5

Values

n=128, s2n= 3, j1 (emp) = 3, j1 (theo) = 3

WavempWavtheoModsel Cos

0

0.5

1

1.5

Values

n=128, s2n= 10, j1 (emp) = 3, j1 (theo) = 3

Wavemp Wavtheo Modsel Cos

0

0.5

1

1.5

Values

n=128, s2n= 100, j1 (emp) = 3, j1 (theo) = 3

WavempWavtheo Modsel Cos

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Values

n=256, s2n= 3, j1 (emp) = 3, j1 (theo) = 3

Wavemp Wavtheo Modsel Cos

0.1

0.2

0.3

0.4

0.5

Values

n=256, s2n= 10, j1 (emp) = 3, j1 (theo) = 6

Wavemp Wavtheo Modsel Cos

0.1

0.2

0.3

0.4

0.5

Values

n=256, s2n= 100, j1 (emp) = 3, j1 (theo) = 3

Wavemp WavtheoModsel Cos

0.1

0.2

0.3

0.4

0.5

Values

n=512, s2n= 3, j1 (emp) = 3, j1 (theo) = 3

Wavemp Wavtheo ModselCos

0.05

0.1

0.15

0.2

0.25

0.3

Values

n=512, s2n= 10, j1 (emp) = 3, j1 (theo) = 3

WavempWavtheo ModselCos

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Values

n=512, s2n= 100, j1 (emp) = 4, j1 (theo) = 3

Figure 5.5: Laplace distribution: graphical display (boxplots) of the MSE with 100

repetitions for each method and all combination of the factors n and s2n.

21

Page 23

WavempWavtheoModsel Cos

0

0.2

0.4

0.6

0.8

Values

n=128, s2n= 3, j1 (emp) = 3, j1 (theo) = 3

WavempWavtheo ModselCos

0

0.2

0.4

0.6

0.8

Values

n=128, s2n= 10, j1 (emp) = 3, j1 (theo) = 5

Wavemp WavtheoModsel Cos

0

0.2

0.4

0.6

0.8

Values

n=128, s2n= 100, j1 (emp) = 3, j1 (theo) = 5

Wavemp WavtheoModselCos

0

0.1

0.2

0.3

0.4

0.5

0.6

Values

n=256, s2n= 3, j1 (emp) = 4, j1 (theo) = 3

Wavemp WavtheoModsel Cos

0

0.02

0.04

0.06

0.08

0.1

0.12

Values

n=256, s2n= 10, j1 (emp) = 3, j1 (theo) = 3

WavempWavtheo Modsel Cos

0

0.02

0.04

0.06

0.08

0.1

Values

n=256, s2n= 100, j1 (emp) = 3, j1 (theo) = 6

Wavemp Wavtheo ModselCos

0

0.05

0.1

0.15

0.2

0.25

Values

n=512, s2n= 3, j1 (emp) = 4, j1 (theo) = 3

Wavemp WavtheoModsel Cos

0

0.02

0.04

0.06

0.08

0.1

0.12

Values

n=512, s2n= 10, j1 (emp) = 3, j1 (theo) = 3

Wavemp WavtheoModsel Cos

0

0.02

0.04

0.06

0.08

0.1

Values

n=512, s2n= 100, j1 (emp) = 3, j1 (theo) = 3

Figure 5.6: Gaussian distribution: graphical display (boxplots) of the MSE with 100

repetitions for each method and all combination of the factors n and s2n.

22

Page 24

Wavemp Wavtheo ModselCos

0

0.5

1

1.5

Values

n=128, s2n= 3, j1 (emp) = 5, j1 (theo) = 3

WavempWavtheoModsel Cos

0

0.2

0.4

0.6

0.8

1

Values

n=128, s2n= 10, j1 (emp) = 4, j1 (theo) = 5

WavempWavtheo Modsel Cos

0

0.2

0.4

0.6

0.8

1

Values

n=128, s2n= 100, j1 (emp) = 3, j1 (theo) = 5

Wavemp WavtheoModsel Cos

0

0.2

0.4

0.6

0.8

1

1.2

Values

n=256, s2n= 3, j1 (emp) = 5, j1 (theo) = 3

WavempWavtheo ModselCos

0

0.1

0.2

0.3

0.4

Values

n=256, s2n= 10, j1 (emp) = 4, j1 (theo) = 6

WavempWavtheo Modsel Cos

0

0.1

0.2

0.3

0.4

Values

n=256, s2n= 100, j1 (emp) = 4, j1 (theo) = 6

WavempWavtheo Modsel Cos

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Values

n=512, s2n= 3, j1 (emp) = 4, j1 (theo) = 3

Wavemp WavtheoModselCos

0

0.05

0.1

0.15

0.2

0.25

Values

n=512, s2n= 10, j1 (emp) = 4, j1 (theo) = 3

Wavemp WavtheoModsel Cos

0

0.05

0.1

0.15

0.2

Values

n=512, s2n= 100, j1 (emp) = 4, j1 (theo) = 7

Figure 5.7: MixtGaussian distribution: graphical display (boxplots) of the MSE with

100 repetitions for each method and all combination of the factors n and s2n.

23

Page 25

AAppendix

For the reader’s convenience, we start the technical section by a set of lemmas that will be used

in the proofs of the main results. Section A.1 collects useful inequalities on the approximation

by periodized Meyer wavelets. Section A.2 recalls the maxiset theorem that is invoked to prove

the optimality of the non linear estimator. Finally, Section A.3 presents the proof the all results

of the paper.

A.1 Approximation lemmas with periodised Meyer wavelets

The estimation of density function based on information projection has been studied by

Barron and Sheu (1991). To apply this method in our context of density deconvolution using

periodised Meyer wavelets, we need to recall and to adapt a set of results that are useful to

prove the optimality of our estimator.

The first lemma derives a Pythagorian-like identity for the Kullback-Leibler divergence

onto Ej. This result is proved in Csiszár (1975).

Lemma A.1 Let α ∈ R2j. Assume that there exists some θ(α) ∈ R2jsuch that for all |λ| < j:

?fj,θ(α),ψλ? = αλ.

Then, for any density function f ∈ L2([0,1]) such that ?f,ψλ? = αλand for all θ ∈ R2j, the identity

∆(f; fj,θ) = ∆(f; fj,θ(α)) + ∆(fj,θ(α), fj,θ).

holds true.

Note that the divergence ∆(f; g) is stricly positive, unless f = g almost everywhere.

Therefore, the lemma of Csiszár (1975) implies that θ(α) (if it exists) uniquely minimizes

∆(f; fj,θ) for θ ∈ R2j.

The next lemma is a key result which gives sufficient conditions for the existence of the

vector θ(α) as defined in Lemma A.1. This lemma also relates distances between the densities

in the parametric family to distance between the corresponding wavelet coefficients. Its proof

relies upon a series of lemmas on bounds within exponential families for the Kullback-Leibler

distance and can be found in Barron and Sheu (1991, Lemma 5).

Lemma A.2 Let θ0∈ R2j, α0,λ= ?fj,θ0,ψλ? and α ∈ R2ja given vector. Let b = exp(?log(fj,θ0)?∞)

and e = exp(1). If ?α − α0?2?

1

2ebAj, then the solution θ(α) to

?fj,θ(α),ψλ? = αλfor all |λ| < j

24

Page 26

exists and satisfies:

?θ(α) − θ0?2

?log(fj,θ(α0)

fj,θ(α)

∆(fj,θ(α0); fj,θ(α)) ?

?

2eb?α − α0?2

2ebAj?α − α0?2

2eb?α − α0?2

(A.1)

)?∞

?

(A.2)

2.(A.3)

We continuewith aset oftechnical lemmas that are neededforthe proofof ourmain results.

These lemmas are an adaptation of similar results in Barron and Sheu (1991) or Antoniadis and

Bigot (2006) to the case of periodoc, Meyer wavelets on L2[0,1]. We start with some definitions.

For f ∈ Fs

p,q(M), let g = loge(f) and define

Dj= ?g − Pjg?L2

and

γj= ?g − Pjg?∞.

The scaling Meyer functions (φλ)|λ|=jspan a finite dimensional space Vj within a

multiresolution hierarchy V0 ⊂ V1 ⊂ ... ⊂ L2([0,1]), such that dim(Vj) = 2j(see e.g.

Meyer, 1992). In the following results, we use the inequalities ?φλ?∞ = ?φ?∞2|λ|/2and

?ψλ?∞ = ?ψ?∞2|λ|/2, and assume that there exists some constant Aj < ∞ such that for all

v ∈ Vj:

?v?∞? Aj?v?L2.

Lemma A.3 Let v ∈ Vj, then ?v?∞? C2j?v?L2.

PROOF: Let v = ∑|λ|=jβλψλ.

?ψλ?∞? C2j/2, we obtain that uniformly in x ∈ [0,1]

By the Cauchy-Schwartz inequality and by the fact that

|v(x)|2?∑

|λ|=j

|ψλ(x)|2∑

|λ|=j

|βλ|2? C22j?βj?2

2

which establishes the result.

?

Lemma A.4 Assume that f ∈ Fs

M1such that

p,q(M) with p ? 2. If s > 1/p + 1/2, then there exists a constant

0 <

1

M1

? f ? M1< ∞.

PROOF: Let g = log(f) = ∑∞

j=−1∑|λ|=jβλψλ. Since ?g?Bsp,q? M, we can write

|βλ|p? M2−jps′,

?βj?p

p= ∑

|λ|=j

25

Page 27

where s′= s + (1/2 − 1/p). As p ? 2, we also get

?βj?2? ?βj?p? C2−js′.

Therefore, Lemma A.3 implies

(A.4)

?g?∞?

∞

∑

j=−1

?∑

|λ|=j

βλψλ?∞?

∞

∑

j=0

C2j?βj?2? C

∞

∑

j=0

2j(1−s′)? C

∞

∑

j=0

2−j(s−1/p−1/2).

Since s > 1/p + 1/2, ∑∞

such that ?g?∞= ?log f?∞? log M1.

The next lemma derive bounds for Aj, Djand γj.

j=02−j(s−1/p−1/2)< ∞ and therefore there exists some constant M1> 1

?

Lemma A.5 The inequality

Aj? C2j

holds true. Moreover, assume that f ∈ Fs

Dj? C2−j(s+1/2−1/p)

p,q(M) with p ? 2. If s > 1/p + 1/2, then the inequalities

and

γj? C2−j(s−1/p−1/2)

hold true.

PROOF: The result for Ajimmediately follows from Lemma A.3. Note that from equation(A.4),

D2

j=∑

j′?j

?βj′?2

2? C∑

j′?j

2−2j′(s+1/2−1/p)= O(2−2j(s+1/2−1/p)).

By definition, γj= ?g − Pjg?∞? AjDj? C2−j(s−1/p−1/2), which completes the proof.

The following lemma controls the mean square error for ˆ αn,λ− αλ where ˆ αn,λ

∑l

fǫ

l

fǫ

Lemma A.6 Assume that the Fourier coefficients of fYare such that |fY

E(ˆ αn,λ− αλ)2?C

PROOF: For |λ| = j, let Cj= {ℓ

Cj= {ℓ : 2j? |l| ? 2j+r} for some fixed r > 0. To simplify the notation, we shall assume

that Cj= {ℓ : 2j? l ? 2j+r} noticing that all the bounds below also hold for negative values

of ℓ. Then, using the fact that under Assumption 1.1, |fǫ

independence of the Yi’s, we get the bound

?

=

?ψλ

l

??

1

n∑n

j=1e−2πilYj?

and αλ= ∑l

ψλ

l

lfY

l.

l| ? C|l|−uwith u > 1. Then,

n22|λ|ν

:

ψλ

ℓ?= 0}. Since the Meyer wavelets are band-limited,

ℓ| ∼ |ℓ|−ν, that |ψλ

ℓ| ? C2−|λ|/2and the

E(ˆ αn,λ− αλ)2?C

n22|λ|ν2−|λ|

2|λ|+r

∑

ℓ,ℓ′=2|λ|

Ee−2πi(ℓ−ℓ′)Y1?C

n22|λ|ν+C

n22|λ|ν2−|λ|∑

ℓ?=ℓ′

fY

ℓ−ℓ′

As |fY

which yields the result.

ℓ| ? C|ℓ|−uwith u > 1, the double sum ∑ℓ?=ℓ′ fY

ℓ−ℓ′ in the equation above is bounded

?

26

Page 28

A.2A maxiset theorem

We use the following maxiset theoremto guarantee the near optimality of ourhard threholding

projection estimate. We recall that theorem below for the reader’s convenience. Its proof can

be found in Johnstone et al. (2004).

Theorem A.1 Assume that (ψλ)|λ|>j0−1denotes the periodised Meyer wavelet basis of L2([0,1]). Let

f ∈ Bs

function

p,qwith wavelet coefficients denoted by βλ. For some estimatorsˆβλof the βλ’s define the following

ˆfh=

j1

∑

j=j0−1∑

|λ|=j

δh

τj,n(ˆβλ)ψλ,

with τj,n= ητj

?(logn)/n, τj= 2jνand 2j1= O

?

(

n

log(n))1/(2ν+1)?

for some constant η > 0. Then,

if

0 ≤ q ≤ max((4ν + 2)/(2s + 2ν + 1),4ν/(2s + 2ν − 2/p + 1))

p ≥ 1,

s ≥ 1/p,

s ≥ (2ν + 1)(1/p − 1/2)

and if for η large enough, there exists two constant C1and C2such that for all n ∈ N∗and |λ| = j

E|ˆβλ− βλ|4

?

≤

C1

τ2

n,

j

(A.5)

P

?

|ˆβλ− βλ| ≥ ητj

(logn)/n

?

≤

C2(logn

n

)2, (A.6)

then there exists a constant C such that for all n ∈ N∗

E?ˆfh− f?2

L2([0,1])≤ C(logn

n

)2s/(2s+2ν+1)

A.3Proof of the main theorems

The proof of the two main theorems is based on a decomposition of the relative entropy

between the true and the estimated density function into the sum of two terms which

correspond to approximation error and estimation error (bias and variance in a familiar mean

squared error analysis). This decomposition is given by

∆(fX; fj,ˆθn) = ∆(fX; fj,θ∗

j) + ∆(fj,θ∗

j; fj,ˆθn)

(A.7)

where fj,θ∗

divergence. This identity comes from the Pythagorian Theorem derived in Csiszár (1975) and

jdenotes the closest function of Ejto the true density fXfor the Kullback-Leibler

27

Page 29

recalled in Lemma A.1. It allows in particular to write the risk E∆(fX; fj(n),ˆθn) as the sum of an

approximation error term ∆(fX; fj(n),θ∗

The control of the approximation error term is similar for the linear and the nonlinear

estimators. Below, we first prove the existence and uniqueness of fj,θ∗

inequalities derived in Barron and Sheu (1991), we show that the approximation error is

controlled by the norms ?g − Pjg?L2 and ?g − Pjg?∞, where g = log(fX) and Pjg =

∑|λ|<j?g,ψλ?ψλ. Bounds for these norms were derived in the above Section A.1.

The control of the estimation error term differs for the linear or the nonlinear estimators.

In the linear case, it simply relates to the control of the risk E?ˆ αn − α0?2

from the technical appendix). Thus, the architecture of the proof follows the reasoning of

Antoniadis and Bigot (2006) but the arguments differ for the control of the nonlinear term.

In the nonlinear situation, Antoniadis and Bigot (2006) use the fact that their Poisson inverse

problem is not too far from a usual Gaussian white noise model. This allows to use standard

resultsin theGaussian settingonsoft-threholdingestimatorswith a level-dependentthreshold.

This reasoning follows the technique initially proposed by Donoho, Johnstone, Kerkyacharian

and Picard (1996, Section 6), and adapted to the case of Poisson inverse problems by Cavalier

and Koo (2002) and Antoniadis and Bigot (2006). This argument is no longer valid for our

deconvolution problem, because our estimatoris constructed using periodisedMeyerwavelets

which, by definition, do not have appropriate vanishing moments. In this paper, for the

nonlinear situation, we use some classical moment bounds (Rosenthal (1972)) and Bernstein’s

inequality to control the difference between the estimated wavelet coefficients and their true

values, together with the maxiset theorem of Johnstone et al. (2004) that is recalled in Theorem

A.1 at the end of the Appendix.

Finally note that in the rest of the Appendix, C denotes a generic constant whose value may

change from line to line.

j(n)) and an estimation error term E∆(fj(n),θ∗

j(n); fj(n),ˆθn).

j. Based on some

2(using Lemma A.6

A.3.1 Proof of Theorem 4.1

This proof concerns the linear, non adaptative estimator.

We first prove the existence of θ∗

define the wavelet coefficients αj,λ= ?exp(Pjg),ψλ? and αλ= ?fX,ψλ?. The Bessel’s inequality

gives ?αj− α?2

?(fX− exp(Pjg))2

j. Let g = log(fX) = ∑∞

j=−1∑|λ|=jβλψλ, and for all |λ| < j,

2? ?fX− exp(Pjg)?2

L2. Therefore, Lemma A.4 implies

?αj− α?2

2? M1

fX

dµ.

28

Page 30

Now, using Lemma 2 of Barron and Sheu (1991), we can write

?αj− α?2

2? M1e2?g−Pjg?∞

?

fX(g − Pjg)2dµ ? M2

1e2γjD2

j.

where Dj= ?g − Pjg?L2 and γj= ?g − Pjg?∞.

Define κj= 2M2

b = exp{?log(exp(Pjg))?∞} implies that θ∗

This last condition is fulfilled if κj? 1 because ?log(exp(Pjg))?∞? log M1+ γj.

We now control the approximation error term ∆(fX; fj(n),θ∗

write ∆(fX; fj,θ∗

1e2γj+1DjAj. Lemma A.2 with θ0,λ= βλ, αλ= ?fX,ψλ? for all |λ| < j and

j= θ(α) exists provided that M1eγjDj? (2ebAj)−1.

j(n)). From Lemma A.1 we can

j) ? ∆(fX;exp(Pjg)). Thence, by Lemma 1 of Barron and Sheu (1991),

∆(fX; fj,θ∗

j) ?1

?1

2exp(?g − Pjg?∞)

2M1eγjD2

?

fX(g − Pjg)2dµ

j

(A.8)

Now let j(n) be such that 2j1(n)? n1/2ν. As fX∈ Fs

follows from the bounds on Aj, Djand γjgiven in Lemma A.5 that γj(n)→ 0 as n → ∞ and so

κj= κj(n)= O(Aj(n)Dj(n)) = O(2−j(n)(s−1)). Since κj(n)→ 0 as n → ∞, equation (A.8) implies

that for n sufficiently large, there exists some θ∗

|λ| < j(n) which satisfies for 2−j(n)≈ n−1/(2s+2ν+1)

?

We now turn to the estimation error term. For all |λ| < j(n), define α0,λ = ?fX,ψλ? =

?fj,θ∗

ˆθn∈ R2j(n)such that

2,2(M) with s > 1 by assumption, it

j(n)such that ?fX,ψλ? = ?fj(n),θ∗

j(n),ψλ? for all

∆(fX; fj(n),θ∗

j(n)) = O

2−2j(n)s?

= O

?

n−2s/(2s+2ν+1)?

. (A.9)

j,ψλ? and let ˆ αn,λ= ∑l(ψλ

l/fǫ

l)∑n

j=1exp(−2πilYj)/n. To prove the existence of a vector

?fj,ˆθn,ψλ? = ˆ αn,λ, for all |λ| < j(n),

we need to control the term ?ˆ αn − α0?2

with θ0= θ∗

we have that |fY

that

2= ∑|λ|<j(n)(ˆ αn,λ− α0,λ)2and then to apply Lemma A.2

j(n), α = ˆ αnand b = exp{?log(fj(n),θ∗

l| ? C|l|−(s+ν)with s+ν > 1, and we can therefore apply Lemma A.6 to obtain

j(n))?∞}. Given our assumption on fXand fǫ

E?ˆ αn − α0?2

2?C

n2j(n)(2ν+1)

Note we have that

?log(fj(n),θ∗

j(n)/exp(Pj(n)g))?∞? κj(n)

29

Page 31

and so b ? M1eκj(n)+γj(n). Hence, if we set δj(n)= 2M1eκj(n)+γj(n)+1Aj(n)2j(n)(ν+1/2)/√n, we can

write δj(n)= O(2j(n)(ν+3/2)/√n) = O(2−j(n)(s−1)) → 0 as n → ∞. Hence, by Lemma A.2 we

have that for n sufficiently large,ˆθnexists and is such that

E

?

∆(fj(n),θ∗

j(n); fj(n),ˆθn)

?

= O

?

n2j(n)(2ν+1)?

= O

?

n−2s/(2s+2ν+1)?

, (A.10)

for 2−j(n)≈ n−1/(2s+2ν+1). The result of the theorem now follows from the control of the

approximation and estimation error terms, using the identity (A.7).

?

A.3.2Proof of Theorem 4.2

We consider in this proof the nonlinear, adaptive estimator, and first control the approximation

error term in a very similar way than in the preceeding proof. By proceeding as in the proof of

Theorem 4.1, we easily show that for 1 ≤ p < 2 and s > 1/2 + 1/p

?

using the notations from the proof of Theorem 4.1 for fj1(n),θ∗

?

∆(fX; fj1(n),θ∗

n

∆(fX; fj1(n),θ∗

j1(n)) = O

2−2j1(n)(s−1/2−1/p),

?

j1(n). Then, since 2j1(n)=

O(

n

log(n))1/(2ν+1)?

, we have that for n sufficiently large

j1(n)) = O

?

(log(n)

)2(s−1/2−1/p)/(2ν+1)

?

.

Now, since by assumption, s ≥ (2ν+1)(1/p−1/2), we obtain that s−1/2−1/p ≥ 2sν/(2ν +

1) and the condition s ≥ 1/2 + 1/4ν finally implies that 2sν/(2ν + 1)2≥ 2s/(2s + 2ν + 1)

which yields the following near-optimal order of convergence for the approximation term

∆(fX; fj1(n),θ∗

j1(n)) = O

?

(log(n)

n

)2s/(2s+2ν+1)

?

.

We can now consider the estimation error term. Define ˆ αn,λand αλas in the proof of

Theorem 4.1. Define

E?δh

τj,n(ˆ αn) − α0?2

2=

∑

j0−1?|λ|<j1(n)

E(δh

τj,n(ˆ αn,λ) − αλ)2

with τj,n= ητj

gives the rate of convergence for level-dependenthard thresholding estimator defined with the

periodised Meyer wavelet basis. Given our conditions on p,q,s,ν, j1and τj,nwe only have to

check the conditions (A.5) and (A.6) in Theorem A.1 withˆβλ= ˆ αn,λand βλ= αλ.

?(logn)/n and τj= 2jν. To control the above sum, we use Theorem A.1 that

30

Page 32

First, we recall the following result formoment bounds of i.i.d. variables (Rosenthal(1972)):

let Z1,...,Znbe i.i.d. random variables with EZj= 0, EZ2

that if m ≥ 2

?????

Recall that ˆ αn,λ− αλ=

l

Since the Meyer wavelets are band-limited, Cj= {l : 2j? |l| ? 2j+r} for some fixed r > 0.

To simplify the notation, we shall assume that Cj= {l : 2j? l ? 2j+r} noticing that all the

bounds below also hold for negative values of ℓ. Hence, we have that

j≤ σ2. Then, there exists cmsuch

E

1

n

n

∑

j=1

Zj

?????

m

≤ cm

?

σm

nm/2+E|Z1|m

nm−1

?

(A.11)

1

n∑n

j=1

?

∑l

ψλ

fǫ

l

?

e−2πilYj− fY

l

??

. For |λ| = j, let Cj= {l : ψλ

l?= 0}.

ˆ αn,λ− αλ=1

n

n

∑

j=1

Zj,

where the Zj’s are i.i.d. variables such that

Zj=

2|λ|+r

∑

l=2|λ|

ψλ

fǫ

l

l

?

e−2πilYj− fY

l

?

.

Hence, EZj = 0. Given our assumption on fXand fǫwe have that |fY

s + ν > 1/2. Hence, there exists a constant C such that for all l ∈ Z, |fY

the fact that under Assumption 1.1, |fǫ

l| ? C|l|−(s+ν)with

l| ≤ C. Moreover, using

l| ? C2−|λ|/2, we obtain that

l| ∼ |ℓ|−ν, and using |ψλ

|Zj|2≤ C22|λ|ν2−|λ|

2|λ|+r

∑

ℓ,ℓ′=2|λ|

?

e−2πilYj− fY

l

??

e−2πil′Yj− fY

l′

?

≤ C2|λ|(2ν+1)

which implies that EZ2

with m = 4 and for |λ| = j we obtain that

j≤ C2j(2ν+1)and EZ4

j≤ C2j(4ν+2). Now, if we apply inequality (A.11)

E|ˆ αn,λ− αλ|4≤ C

?

2j(4ν+2)

n2

+2j(4ν+2)

n3

?

As the thresholding parameter is such that τ2

2j(2ν+2)= O

j= 22jνand given that for j ≤ j1(n), one has

?

n

log(n)

?

, we have that

2j(4ν+2)

n2

≤ C

τ2

nand2j(4ν+2)

j

n3

≤ C

τ2

j

n

and therefore the inequality

E|ˆ αn,λ− αλ|4≤ C

τ2

n,

j

31

Page 33

holds true. This development proves that ˆ αn,λ− αλsatisfies the condition (A.5).

Now, recall the standard Bernstein’s inequality: let Z1,...,Znbe i.i.d. random variables

with EZj= 0, EZ2

j≤ σ2, |Zj| ≤ ?Z?∞< +∞, then for any λ > 0

P

??????

1

n

n

∑

j=1

Zj

?????> λ

?

≤ 2exp

?

−

nλ2

2(σ2+ ?Z?∞λ/3)

?

.

So, if we apply Bernstein’s inequality with the Zj’s as defined previously, we get the following

bound for |λ| = j (for some constant C1and C2)

?

≤

P

|ˆ αn,λ− αλ| > ητj

?

(logn)/n

?

≤

2exp

?

−

η2log(n)

2(C1/n + C2η(logn/n)1/2)

2exp?−Cη2log(n)?.

?

,

Hence, for η large enough one has that for all n ≥ 1

?

which proves that ˆ αn,λ− αλsatisfies the condition (A.6).

Hence, from Theorem A.1, we finally derive the following upper bound

P

|ˆ αn,λ− αλ| > ητj

?

(logn)/n

?

≤ C(logn

n

)2,

E?δh

τj,n(ˆ αn) − α0?2

2= O

?

(log(n)

n

)2s/(2s+2ν+1)

?

.

To prove the existence of the projection estimate fh

4.1. For b ? M1eǫj(n)+γj(n), we define δj(n)= 2M1eǫj(n)+γj(n)+1Aj(n)(log(n)

2j1(n)= O

our assumptions on s and ν, we have that s ≥ (2ν + 1)/(2ν − 1), which implies that δj(n)→ 0

as n → ∞. Hence, by Lemma A.2 we have that for n sufficiently large, fh

that

j1(n),ˆθnwe proceed as in the proof of Theorem

n

)s/(2s+2ν+1). Thus, for

)s/(2s+2ν+1)−1/(2ν+1)??

(

n

log(n))1/(2ν+1)?

, we can write that δj(n)= O

?

(log(n)

n

. Given

j(n),ˆθnexists and is such

E

?

∆(fj(n),θ∗

j(n); fh

j(n),ˆθn)

?

≤ O

?

(log(n)

n

)2s/(2s+2ν+1)

?

The result of the theorem now follows from the control of the approximation and estimation

error terms, using the identity (A.7).

?

References

Antoniadis, A. and Bigot, J. (2006). Poisson inverse problems. Ann. Statist., 34, 2132-

2158.

32

Page 34

Barron, A. R. and Sheu, C. H. (1991). Approximation of density functions by sequences

of exponential families. Ann. Statist., 19, 1347–1369.

Buckheit, J., Chen, S., Donoho, D. and Johnstone, I. (1995). Wavelab reference manual

(Tech. Rep.). Department of Statistics, Stanford University. (http://www-stat.

stanford.edu/software/wavelab)

Carrasco, M. and Florens, J.-P. (2002). Sepctralmethod for deconvolvinga density(Working

Paper No. 138). Université de Toulouse I: IDEI.

Carroll, R. andHall, P. (1988). Optimal rates ofconvergence for deconvolving a density.

J. Amer. Statist. Assoc., 83, 1184–1186.

Cavalier, L. and Koo, J.-Y. (2002). Poisson intensity estimation for tomographic data

using a wavelet shrinkage approach. IEEE Trans. Information Theory, 48, 2794–

2802.

Comte, F., Rozenholc, Y. and Taupin, M.-L. (2006a). Finite sample penalization in

adaptive density deconvolution. J. Stat. Comput. Simul., to appear.

Comte, F., Rozenholc, Y. and Taupin, M.-L. (2006b). Penalized contrast estimator for

density deconvolution. Canad. J. Statist., 34, XXX.

Csiszár, I. (1975). I-divergence geometry of probability distributions and minimization

problems. Ann. Probab., 3, 146–158.

De Canditiis, D. and Pensky, M.(2006).

periodic setting. Scand. J. Statist., 33, 293–306.

Donoho, D. L., Johnstone, I. M., Kerkyacharian, G. and Picard, D. (1995). Wavelet

shrinkage: Asymptopia? J. Roy. Statist. Soc. Ser. B, 57, 301–369.

Donoho, D. L., Johnstone, I. M., Kerkyacharian, G. and Picard, D. (1996). Density

estimation by wavelet thresholding. Ann. Statist., 24, 508–539.

Donoho, D. L. and Raimondo, M. (2004). Translation invariant deconvolution in a

periodic setting. Int. J. Wavelets Multiresolut. Inf. Process., 4, 415–431.

Fan, J. (1991). On the optimal rate of convergence for nonparametric deconvolution

problems. Ann. Statist., 19, 1257–1272.

Fan, J. and Koo, J.-Y. (2002). Wavelet deconvolution. IEEE Trans. Inform. Theory, 48,

734–747.

Hall, P. and Qiu, P. (2005). Discrete-transform approach to deconvolution problems.

Biometrika, 92, 135–148.

Johnstone, I., Kerkyacharian, G., Picard, D. and Raimondo, M.

Simultaneous wavelet deconvolution in

(2004). Wavelet

33

Page 35

deconvolution in a periodic setting. J. Roy. Statist. Soc. Ser. B, 66, 547–573.

Kolaczyk, E. (1994).

Wavelet methods for the inversion of certain homogeneous linear

operators in the presence of noisy data.

Stanford University, Stanford.

Koo, J.-Y. (1999). Logspline deconvolution in Besov space. Scand. J. Statist., 26, 73–86.

Koo, J.-Y. and Chung, H.-Y. (1998). Log-density estimation in linear inverse problems.

Ann. Statist., 26, 335–362.

Kosarev, E., Shul’man, A., Tarasov, M. and Lindstroem, T. (2003). Deconvolution

problems and superresolution in Hilbert-transform spectroscopy based on a.c.

Josephson effect. Comput. Phys. Comm., 151, 171–186.

Masry, E.(2003).Deconvolving multivariate kernel density estimated from

contaminated associated observations. IEEE Trans. Inform. Theory, 49, 2941–2952.

Meyer, Y. (1992). Wavelets and operators. Cambridge: Cambridge University Press.

Pensky, M. and Vidakovic, B. (1999). Adaptive wavelet estimator for nonparametric

density deconvolution. Ann. Statist., 27, 2033–2053.

Postel-Vinay, F. and Robin, J.-M. (2002). Equilibrium wage dispersion with worker and

employer heterogeneity. Econometrica, 70, 2295–2350.

Rosenthal, H. P. (1972). On the span in Lpof sequences of independent random

variables. II. 149–167.

Ph.d. thesis, Department of Statistics,

34