Page 1

Efficient Estimation of Population-Level Summaries

in General Semiparametric Regression Models

Arnab MAITY, Yanyuan MA, and Raymond J. CARROLL

This article considers a wide class of semiparametric regression models in which interest focuses on population-level quantities that combine

both the parametric and the nonparametric parts of the model. Special cases in this approach include generalized partially linear models,

generalized partially linear single-index models, structural measurement error models, and many others. For estimating the parametric part

of the model efficiently, profile likelihood kernel estimation methods are well established in the literature. Here our focus is on estimating

general population-level quantities that combine the parametric and nonparametric parts of the model (e.g., population mean, probabilities,

etc.). We place this problem in a general context, provide a general kernel-based methodology, and derive the asymptotic distributions of

estimates of these population-level quantities, showing that in many cases the estimates are semiparametric efficient. For estimating the

population mean with no missing data, we show that the sample mean is semiparametric efficient for canonical exponential families, but not

in general. We apply the methods to a problem in nutritional epidemiology, where estimating the distribution of usual intake is of primary

interest and semiparametric methods are not available. Extensions to the case of missing response data are also discussed.

KEY WORDS: Generalized estimating equations; Kernel methods; Measurement error; Missing data; Nonparametric regression; Nutri-

tion; Partially linear model; Profile method; Semiparametric efficient score; Semiparametric information bound; Single-

index models.

1. INTRODUCTION

This article is about semiparametric regression models when

one is interested in estimating a population quantity such as

the mean, variance, and probabilities. The unique feature of

the problem is that the quantities of interest are functions of

both the parametric and the nonparametric parts of the model.

We will also allow for partially missing responses, but handling

such a modification is relatively easy. The main aim of the ar-

ticle is to estimate population quantities that involve both the

parametric and the nonparametric parts of the model and to do

so efficiently and in considerable generality.

We will construct estimators of these population-level quan-

tities that exploit the semiparametric structure of the problem,

derive their limiting distributions, and show in many cases that

the methods are semiparametric efficient. The work is moti-

vatedbyandillustratedwithanimportantprobleminnutritional

epidemiology, namely, estimating the distribution of usual in-

take for episodically consumed foods such as red meat.

A special simple case of our results is already established

in the literature (Wang, Linton, and Härdle 2004 and the refer-

ences therein), namely, the partially linear model

Yi= XT

iβ0+θ0(Zi)+ξi,

(1)

where θ0(·) is an unknown function and ξi= Normal(0,σ2

We allow the responses to be partially missing, important in

0).

Arnab Maity is a Graduate Student (E-mail: amaity@stat.tamu.edu),

Yanyuan Ma is Assistant Professor (E-mail: ma@stat.tamu.edu), and Raymond

J. Carroll is Distinguished Professor (E-mail: carroll@stat.tamu.edu), Depart-

ment of Statistics, Texas A&M University, College Station, TX 77843. This

work was supported by grants from the National Cancer Institute (CA57030

for AM and RJC; CA74552 for YM) and by the Texas A&M Center for En-

vironmental and Rural Health via a grant from the National Institute of Envi-

ronmental Health Sciences (P30-ES09106). The authors are grateful to Janet

Tooze, Amy Subar, Victor Kipnis, and Douglas Midthune for introducing us to

the problem of episodically consumed foods and for allowing us to use their

data.TheauthorsthankNaisyinWangforreadingthefinalmanuscriptandhelp-

ing us with replies to a referee. Part of the original work of the last two authors

originally occurred during a visit to the Centre of Excellence for Mathematics

and Statistics of Complex Systems at the Australian National University, whose

support they gratefully acknowledge. The authors especially wish to thank three

referees, the associate editor, and the joint editor for helping turn the original

submission into a publishable article. Their patience and many helpful sugges-

tions are very greatly appreciated.

cases where the response is difficult to measure but the pre-

dictors are not. Suppose that Y is partially missing and let

δ = 1 indicate that Y is observed, so that the observed data are

(δiYi,Xi,Zi,δi). Suppose further that Y is missing at random, so

that pr(δ = 1|Y,X,Z) = pr(δ = 1|X,Z).

Usually, of course, the main interest is in estimating β0effi-

ciently. This is not the problem we discuss, because in our ex-

ample the parameters β0are themselves of relatively minor in-

terest. In their work, Wang et al. (2004) estimated the marginal

mean κ0= E(Y) = E{XTβ0+ θ0(Z)}. Note how this combines

both the parametric and the nonparametric parts of the model.

One of the results of Wang et al. is that if one uses only the

complete data that Y is observed, then fits the standard profile

likelihood estimator to obtain? β and? θ(·,? β), it transpires that

sample mean is also semiparametric efficient.

Actually, quite a bit more is true even in this relatively simple

Gaussian case. Let B = (βT,σ2)Tand let? B and? θ(·,? B) be the

ample, Severini and Wong (1992) for local constant estimation

and Claeskens and Carroll (2007) for local linear estimation.

Consider estimating any functional κ0= E[F{X,θ0(Z),B0}]

for some function F(·) that is thrice continuously differen-

tiable: This, of course, includes such quantities as population

mean, and probabilities. Then one very special case of our re-

sults is that the semiparametric efficient estimate of κ0is just

? κ = n−1?n

ties. Thus, consider a semiparametric problem in which the

log-likelihood function given (X,Z) is L{Y,X,θ(Z),B}. If we

define LB(·) and Lθ(·) to be derivatives of the log-likelihood

with respect to B and θ(Z), we have the properties that

E[LB{Y,X,θ0(Z),B0}|X,Z] = 0 and similarly for Lθ(·). We

use profile likelihood methods computed at the observed data.

With missing data, this local linear kernel version of the pro-

file likelihood method of Severini and Wong (1992) works

a semiparametric efficient estimator of the population mean κ0

is n−1?n

i=1{XT

i? β +? θ(Zi,? β)}. If there are no missing data, the

profile likelihood estimates in the complete data; see, for ex-

i=1F{Xi,? θ(Zi,? B),? B}.

semiparametric models and general population-level quanti-

In contrast to Wang et al. (2004), we deal with general

© 2007 American Statistical Association

Journal of the American Statistical Association

March 2007, Vol. 102, No. 477, Theory and Methods

DOI 10.1198/016214506000001103

123

Page 2

124Journal of the American Statistical Association, March 2007

as follows. Let K(·) be a smooth symmetric density func-

tion with bounded support, let h be a bandwidth, and let

Kh(z) = h−1K(z/h). For any fixed B, let (? α0,? α1) be the local

n

?

and then setting? θ(z,B) =? α0. The profile likelihood estimator

ing in B

n

?

Our estimator of κ0= E[F{X,θ0(Z),B0}] is then

n

?

We emphasize that the possibility of missing response data

and finding a semiparametric efficient estimate of B0is not the

focus of the article. Instead, the focus is on estimating quanti-

ties κ0= E[F{X,θ0(Z),B0}] that depend on both the paramet-

ric and the nonparametric parts of the model: This is a very

different problem than simply estimating B0. Previous work in

the area considered only the partially linear model and only es-

timation of the population mean: Our work deals with general

semiparametric models and general population-level quantities.

An outline of this article is as follows. In Section 2 we dis-

cuss the general semiparametric problem with log-likelihood

L{Y,X,θ(Z),B} and a general goal of estimating κ0= E[F{X,

θ0(Z),B0}]. We derive the limiting distribution of (4) and show

that it is semiparametric efficient. We also discuss the general

problem where the population quantity κ0of interest is the ex-

pectation of a function of Y alone and describe doubly robust

estimators in this context.

InSection3weconsidertheclassofgeneralizedpartiallylin-

earsingle-indexmodels(Carroll,Fan,Gijbels,andWand1997).

Single-index modeling, see Härdle and Stoker (1989) and Här-

dle, Hall, and Ichimura (1993), is an important means of di-

mension reduction, one that is finding increased use in this age

of high-dimensional data. We develop methods for estimating

population quantities in the generalized partially linear single-

indexmodelingframeworkandshowthatthemethodsaresemi-

parametric efficient.

Section 4 describes an example from nutritional epidemiol-

ogy that motivated this work, namely, estimating the distribu-

tion of usual intake of episodically consumed foods such as red

meat. The model used in this area is far more complex than the

simple partially linear Gaussian model (1), and while the pop-

ulation mean is of some interest, of considerably more interest

is the probability that usual intake exceeds thresholds. We will

illustrate why in this context one cannot simply adopt the per-

centages of the observed responses that exceed a certain thresh-

old.

Section5describesthreeissues ofimportance:(1) bandwidth

selection (Sec. 5.1), (2) the efficiency and robustness of the

sample mean when the population mean is of interest (Sec. 5.2),

and numerical and theoretical insights into the partially linear

likelihood estimator obtained by maximizing, in (α0,α1),

i=1

δiKh(Zi−z)L{Yi,Xi,α0+α1(Zi−z),B},

(2)

of B0modified for missing responses is obtained by maximiz-

i=1

δiL{Yi,Xi,? θ(Zi,B),B}.

(3)

? κ = n−1

i=1

F{Xi,? θ(Zi,? B),? B}.

(4)

model and the nature of our assumptions (Sec. 5.3). An inter-

esting special case is, of course, the partially linear model when

κ0is the population mean. For this problem, we show in Sec-

tion 5.2 that, with no missing data, the sample mean is semi-

parametric efficient for canonical exponential families but not

of course in general, thus extending and clarifying the results of

Wang et al. (2004) that were specific to the Gaussian case.

Section 6 gives concluding remarks and results. All technical

results are given in the Appendix.

2. SEMIPARAMETRIC MODELS WITH

A SINGLE COMPONENT

2.1 Main Results

We benefit from the fact that the limiting expansions for? B

modification of incorporating the missing response indicators.

Let f(z) be the density function of Z, which is assumed to have

bounded support and to be positive on that support. Let ?(z) =

f(z)E{δLθθ(·)|Z = z}. Let Liθ(·) = Lθ{Yi,Xi,θ0(Zi),B0} and

so on. Then it follows from standard results (see the App. for

more discussion) that as a minor modification of the work of

Severini and Wong (1992),

and? θ(·) are essentially already well known, with the minor

? θ(z,? B)−θ0(z)

= (h2/2)θ(2)

0(z)−n−1

n

?

i=1

δiKh(Zi−z)Liθ(·)/?(z)

?n−1/2?,

δi?i+op

+θB(z,B0)(? B −B0)+op

? B −B0= M−1

where

(5)

1n−1

n

?

i=1

?n−1/2?,

(6)

θB(z,B0) = −E{δLBθ(·)|Z = z}/E{δLθθ(·)|Z = z},

?i= {LiB(·)+Liθ(·)θB(Zi,B0)},

M1= E(δ??T) = −E?δ{LBB(·)+LBθ(·)θT

andwhere,under regularityconditions,(5) is uniformin z. Con-

ditions guaranteeing (6) are well known; see the Appendix.

Define

Di(·) = −Liθ(·)E{Fθ(·)|Zi}

M2= E{FB(·)+Fθ(·)θB(Z,B0)}.

In the Appendix we show the following result.

(7)

(8)

B(Z,B0)}?,

E{δLθθ(·)|Zi},

Theorem 1. Suppose that nh4→ 0 and that (5) and (6) hold,

the former uniformly in z. Suppose also that Z has compact

support, that its density is bounded away from 0 on that sup-

port, and that the kernel function also has a finite support. Then

the estimator ? κ of κ0= E[F{X,θ0(Z),B0}] is semiparametric

n1/2(? κ −κ0)

= n−1/2

i=1

+δiDi(·)?+op(1)

efficient in the sense of Newey (1990). In addition, as n → ∞,

n

?

?Fi(·)−κ0+MT

2M−1

1δi?i

(9)

Page 3

Maity, Ma, and Carrol: Efficient Estimation of Population Quantities125

⇒ Normal?0,E{F(·)−κ0}2+MT

2M−1

1M2

+E{δD2(·)}?.

(10)

Remark 1. To obtain asymptotically correct inference

about κ0, there are two possible routes. The first is to use the

bootstrap: Whereas Chen, Linton, and Van Keilegom (2003)

only justified the bootstrap for estimating B0, we conjecture

that the bootstrap works for κ0as well. More formally, one re-

quiresonlyaconsistentestimateofthelimitingvariancein(10).

This is a straightforward exercise, although programming in-

tense: One merely replaces all the expectations by sums in that

expression and all the regression functions by kernel estimates.

Remark 2. Our analysis of semiparametric efficiency in the

sense of Newey (1990) has this outline. We first assume path-

wise differentiability of κ; see Section A.2.2 for a definition.

Working with this assumption, we derive the semiparametric

efficient score. With this score in hand, we then prove pathwise

differentiability. Details are given in the Appendix.

Remark 3. With a slight modification using a device in-

troduced to semiparametric methods by Bickel (1982), The-

orem 1 also holds for estimated bandwidths. We confine our

discussion to bandwidths of order n−1/3; see Section 5.1.2 for

a reason. Write such bandwidths as hn= cn−1/3, where, fol-

lowing Bickel, the values for c are allowed to take values in

the set U = a{0,±1,±2,...}, where a is an arbitrary small

number. We discretize bandwidths so that they take on values

cn−1/3with c ∈ U. Denote estimators as? κ(hn) and note that for

? κ(c0n−1/3)} = op(1) and that n1/2{? κ(c0n−1/3)−? κ(c∗n−1/3)} =

mated bandwidth with the property that?hn= Op(n−1/3), then

holds for these estimated bandwidths.

an arbitrary c∗and an arbitrary fixed, deterministic sequence

cn→ c0for finite c0, Theorem 1 shows that n1/2{? κ(cnn−1/3)−

op(1). Hence, it follows from Bickel (1982, p. 653, just af-

ter eq. 3.7) that if?hn= ? cnn−1/3, with ? c ∈ U, is an esti-

n1/2{? κ(? cnn−1/3) − ? κ(c∗n−1/3)} = op(1). Hence, Theorem 1

2.2 General Functions of the Response

and Double Robustness

It is important to consider estimation in problems where

κ0 can be constructed outside the model. Suppose that κ0=

E{G(Y)} and define F{X,θ0(Z),B0} = E{G(Y)|X,Z}. We will

discuss two estimators with the properties that (1) if there are

no missing response data, the semiparametric model is not used

and the estimator is consistent; and (2) under certain circum-

stances, the estimator is consistent if either the semiparametric

model is correct or if a model for the missing-data process is

correct.

Our motivating example discussed in Section 4 dose not fall

into the category discussed in this section.

The two estimators are based on different constructions for

estimating the missing-data process. The first is based on a

nonparametric formulation for estimating pr(δ = 1|Z) = πmarg,

where the subscript indicates a marginal estimation of the prob-

ability that Y is observed. The second is based on a paramet-

ric formulation for estimating pr(δ = 1|Y,X,Z) = π(X,Z,ζ),

where ζ is an unknown parameter estimated by standard logis-

tic regression of δ on (X,Z).

The first estimator, similar to one defined by Wang et al.

(2004) and efficient in the Gaussian partially linear model, can

be constructed as follows. Estimate πmargby local linear logis-

tic regression of δ on Z, leading to the usual asymptotic expan-

sion

? πmarg(z)−πmarg(z)

= n−1

n

?

j=1

{δj−πmarg(Zj)}Kh(z−Zj)/fZ(z)+op

?n−1/2?,

(11)

assuming that nh4→ 0. Then construct the estimator

n

?

? κmarg= n−1

i=1

?

δi

? πmarg(Zi)G(Yi)+

?

1−

δi

? πmarg(Zi)

×F{Xi,? θ(Zi,? B),? B}

?

?

.

The estimatorhas two useful properties:(1) if there are no miss-

ing data, it does not depend on the model and is, hence, consis-

tent for κ0; and (2) if observation of the response Y depends

only on Z, it is consistent even if the semiparametric model is

not correct.

In a similar vein, the second estimate, also similar to another

estimate of Wang et al. (2004), is given as

?

? κ = n−1

n

?

i=1

δi

π(Xi,Zi,? ζ)G(Yi)+

?

1−

δi

π(Xi,Zi,? ζ)

×F{Xi,? θ(Zi,? B),? B}

?

?

.

This estimator has the double-robustness property that if either

the parametric model π(X,Z,ζ) or the underlying semipara-

metric modelfor {B,θ(·)} is correct, then? κ is consistentand as-

in both? κmargand? κ improve efficiency: They are also important

If both models are correct, then the following results are ob-

tained as a consequence of (5) and (6); see the Appendix for a

sketch.

ymptotically normally distributed. Generally, the second terms

for the double-robustness property of? κ.

Lemma 1. Define

M2,marg= E

??

1−

δ

πmarg(Z)

??

?

{FB(·)+Fθ(·)θB(Z,B0)}T

?

?

,

Di,marg(·) = −Liθ(·)E1−

δi

πmarg(Zi)

Fiθ(·)

???Zi

?

?E{δLθθ(·)|Zi}.

Then, to terms of order op(1),

n1/2(? κmarg−κ0)

≈ n−1/2

?

n

?

i=1

?

δi

πmarg(Zi)G(Yi)

?

+

1−

δi

πmarg(Zi)

Fi(·)−κ0

?

Page 4

126Journal of the American Statistical Association, March 2007

+M2,margM−1

1n−1/2

n

?

i=1

δi?i

+n−1/2

n

?

i=1

δiDi,marg(·).

(12)

Lemma 2. Define πζ(X,Z,ζ) = ∂π(X,Z,ζ)/∂ζ. Assume

that n1/2(? ζ − ζ) = n−1/2?n

n1/2(? κ −κ0)

≈ n−1/2

i=1

?

Remark 4. The expansions (12) and (13) show that? κmargand

the asymptotic variances are given as

?

i=1ψiζ(·) + op(1) with E{ψζ(·)|

X,Z} = 0. Then, to terms of order op(1),

n

?

?

δi

π(Xi,Zi,ζ){G(Yi)−κ0}

+

1−

δi

π(Xi,Zi,ζ)

?

{Fi(·)−κ0}

?

.

(13)

? κ are asymptotically normally distributed. One can show that

Vκ,marg= var

δ

πmarg(Z)G(Y)+

?

1−

δ

πmarg(Z)

?

F(·)

+M2,margM−1

1δ? +δDmarg(·)

?

,

Vκ= var

?

δi

π(Xi,Zi,ζ)G(Yi)

?

+

1−

δi

π(Xi,Zi,ζ)

?

Fi(·)

?

,

respectively, from which estimates are readily derived.

Finally, we note that Claeskens and Carroll (2007) showed

that in general likelihood problems, if there is an omitted co-

variate, then under contiguous alternatives the effect on estima-

tors is to add an asymptotic bias, without changing the asymp-

totic variance.

3. SINGLE–INDEX MODELS

One means of dimension reduction is single-index modeling.

Single-index models can be viewed as a generalized version of

projection pursuit, in that only the most influential direction is

retained to keep the model tractable and to reduce dimension.

Since its introduction in Härdle and Stoker (1989), single-index

modeling has been widely studied and used. A comprehensive

summary of the model is given in Härdle, Müller, Sperlich, and

Werwatz (2004). Let Z = (R,ST)Twhere R is a scalar. We con-

sider here the generalized partially linear single-index model

(GPLSIM)ofCarrolletal.(1997),namely,theexponentialfam-

ily (20) with η(X,Z) = XTβ0+θ0(ZTα0), where θ0(·) is an un-

known function and for identifiability purposes ?α0? = 1. Be-

cause identifiability requires that one of the components of Z

be a nontrivial predictor of Y, for convenience we will make

the very small modification that one component of Z, what we

call R, is a known nontrivial predictor of Y. The reason for

making this modification can be seen in theorem 4 of Carroll

et al. (1997) where the final limit distribution of the estimate

of α0has a singular covariance matrix. In addition, their main

asymptotic expansion, given in their equation (A.12), is about

the nonsingular transformation (I −α0αT

With this modification, we write the model as

E(Y|X,Z) = C(1)?c{η(X,Z)}?= µ{XTβ0+θ0(R+STγ0)},

where γ0is unrestricted.

Carroll et al. (1997) used profile likelihood to estimate B0=

(γ0,β0) and θ0(·), although they presented no results con-

cerning the estimate of φ0, their interest largely being in lo-

gistic regression where φ0= 1 is known. Rewrite the likeli-

hood function (20) as L{Y,X,β,θ(R + STγ),φ}. Then, given

B = (γT,βT)T, they formed U(γ) = R + STγ and computed

the estimate? θ{u(γ),B} by local likelihood of Y on {X,U(γ)}

ST

iγ,B),φ}] in B and φ.

Our goal is to estimate κSI= E[F{X,θ0(R+STγ0),β0,φ0}].

Ourproposedestimateis? κSI= n−1?n

G = Dφ(Y,φ0)−?Yc{XTβ0+θ0(U)}−C{c(·)}?/φ2

Also define ? = {STθ(1)

and ? = [Y−µ{XTβ0+θ0(U)}]ρ1{XTβ0+θ0(U)}.Define Ni=

?i− [E{δρ2(·)|Ui}]−1E{δi?iρ2(·)|Ui} and Q = E{δNNT×

ρ2(·)}. Make the further definitions Fβ(·) = ∂F{X,θ0(U),

β0,φ0}/∂β0, Fφ(·) = ∂F{X,θ0(U),β0,φ0}/∂φ0, and Fθ(·) =

∂F{X,θ0(U),β0,φ0}/∂θ0(U). Also define

J(U) = [E{δρ2(·)|U}]−1E{Fθ(·)|U},

D =

E{Fβ(·)}−E(Fθ(·)[E{δρ2(·)|U}]−1E{δXρ2(·)|U})

Then we have the following result regarding the asymptotic dis-

tribution of? κSI.

iid and that the conditions in Carroll et al. (1997) hold, in par-

ticular, that nh4→ 0. Then

n1/2(? κSI−κSI)

= n−1/2

i=1

+DTQ−1δiNi?i+δiJ(Ui)?i

+δiGiE{Fφ(·)}/E(δG2)?+op(1)

⇒ Normal(0,V),

where V = E[F{X,θ0(U),β0,φ0}−κSI]2+DTQ−1D+var{δ×

J(U)?} + E(δG2)[E{Fφ(·)}]2/{E(δG2)}2. Further, ? κSIis semi-

4. MOTIVATING EXAMPLE

0)(? α −α0).

(14)

as in Severini and Staniswalis (1994), using the data with

δ = 1. Then they maximized?n

i=1δilog[L{Yi,Xi,β,? θ(Ri+

i=1F{Xi,? θ(Ri+ST

i? γ,? B),

? β,? φ}.

Our main result is as follows. First, define U = R+STγ0and

0.

0(U),XT}T, ρ?(·) = {µ(1)(·)}?/V(·),

?

E{Fθ(·)θ(1)(U)S}−E(Fθ(·)[E{δρ2(·)|U}]−1θ(1)(U)E{δSρ2(·)|U})

?

.

Theorem 2. Assume that (Yi,δi,Xi,Zi),i = 1,2,...,n, are

n

?

?F{Xi,θ0(Ui),β0,φ0}−κSI

(15)

parametric efficient.

4.1 Introduction

There is considerable interest in understanding the distribu-

tion of dietary intake in various populations. For example, as

obesity rates continue to rise in the United States (Flegal, Car-

roll, Ogden, and Johnson 2002), the demand for information

Page 5

Maity, Ma, and Carrol: Efficient Estimation of Population Quantities127

about diet and nutrition is increasing. Information on dietary

intake has implications for establishing population norms, con-

ducting research, and making public policy decisions (Woteki

2003).

We wish toemphasizethatthereare no missingresponse data

in this example. We also emphasize that the problem is vastly

different from simply estimating the population mean using a

Gaussian partially linear model. The strength of our approach

is that once we have proposed a semiparametric model, then

our methodology, asymptotics, and semiparametric efficiency

results are readily employed.

This article was motivated by the analysis of the Eating at

America’s Table Study (EATS) (Subar et al. 2001), where esti-

mating the distribution of the consumption of episodically con-

sumed foods is of interest. The data consist of four 24-hour re-

calls over the course of a year as well as the National Cancer

Institute’s (NCI) dietary history questionnaire (DHQ), a partic-

ularversionofafoodfrequencyquestionnaire(FFQ;seeWillett

et al. 1985 and Block et al. 1986). The goal is to estimate the

distribution of usual intake, defined as the average daily intake

of a dietary component by an individual in a fixed time period,

a year in the case of EATS. There were n = 886 individuals in

the dataset.

When the responses are continuous random variables, this

is a classical problem of measurement error, with a large

literature. However, little of the literature is relevant to episodi-

cally consumed foods, as we now describe. Consider, for ex-

ample, consumption of red meat, dark-green vegetables, and

deep-yellow vegetables, all of interest in nutritional surveil-

lance. In the EATS data, 45% of the 24-hour recalls reported

no red-meat consumption. In addition, 5.5% of the individu-

als reported no red-meat consumption on any of the four sepa-

rate 24-hour recalls: For deep-yellow vegetables these numbers

are 63% and 20%, respectively, while for dark-green vegetables

the numbers are 78% and 46%, respectively. Clearly, methods

aimed at understanding usual intakes for continuous data are

inappropriate for episodically consumed foods with so many

zero-reported intakes.

4.2 Model

To handle episodically consumed foods, two-part models

have been developed (Tooze, Grunwald, and Jones 2002).

These are basically zero-inflated repeated-measures examples.

Our methods are applicable to such problems when the covari-

ate Z is evaluated only once for each subject, as it is in our

example.

We describe here a simplification of this approach, used to il-

lustrate our methodology. On each individual, we measure age

and gender, the collection being what we call R. We also ob-

serve energy (calories) as measured by the DHQ, the logarithm

of which we call Z. The reader should note that Z is evalu-

ated only once per individual, and, hence, while there are re-

peated measures on the responses, there are no repeated mea-

sures on Z: θ0(Z) occurs only once in the likelihood function,

and our methodology applies.

Let X = (R,Z). The response data for an individual i consist

of four 24-hour recalls of red-meat consumption. Let ?ij= 1

if red meat is reported consumed on the jth 24-hour recall for

j = 1,...,4. Let Yijbe the product of ?ijand the logarithm

of reported red-meat consumption, with the convention that

0log(0) = 0. Then the response data are Yi= (?ij,Yij)4

j=1.

4.2.1 Modeling the Probability of Zero Response.

part of the model is whether the subject reports red-meat con-

sumption.Wemodelthisasarepeated-measureslogisticregres-

sion, so that

The first

pr(?ij= 1|Ri,Zi,Ui1) = H(β0+XT

where H(·) is the logistic distribution function and Ui1=

Normal(0,σ2

u1) is a person-specific random effect. Note that,

for simplicity, we have modeled the effect of energy consump-

tion as linear, because in the data there is little hint of nonlin-

earity.

iβ1+Ui1),

(16)

4.2.2 Modeling Positive Responses.

model consists of a distribution of the logarithm of red-meat

consumption on days when consumption is reported, namely,

The second part of the

[Yij|?ij= 1,Ri,Zi,Ui2] = Normal{RT

iβ2+θ(Zi)+Ui2,σ2},

(17)

where Ui2= Normal(0,σ2

which we take to be independent of Ui1. Note that (17) means

that the nonzero Y data within an individual marginally have

the same mean RT

covariance σ2

u2.

4.2.3 Likelihood Function.

is B, consisting of β0, β1, β2, σ2

likelihood function L(·) is readily computed with numerical in-

tegration as follows:

?u1

j=1

×{1−H(β0+XTβ1+u1)}1−?ijdu1

×σ−1

?

Of course, the second numerical integral is not necessary, be-

cause the integration can be done analytically.

u2) is a person-specific random effect,

iβ2+ θ(Zi), variance σ2+ σ2

u2, and common

The collection of parameters

u1, σ2

u2, and σ2. The log-

exp{L(·)} =

1

σu1

?

φ

σu1

? 4?

{H(β0+XTβ1+u1)}?ij

u2σ−?

4?

j?ij

?

φ

?u2

iβ2+θ(Zi)+u2}

σ

σu2

?

×

j=1

φ

?Yij−{RT

???ij

du2.

4.2.4 Defining Usual Intake at the Individual Level.

ing from (17) that reported intake on days of consumption fol-

lows a log-normal distribution, the usual intake for an individ-

ual is defined as

Not-

G{X,U1,U2,B,θ(Z)}

= H(β0+XT

×exp{RTβ2+θ(Z)+U2+σ2/2}.

The goal is to understand the distribution of G{X,U1,U2,

B,θ(Z)} across a population. In particular, for arbitrary c

we wish to estimate pr[G{X,U1,U2,B,θ(Z)} > c]. Define

F{X,B,θ(Z)} = pr[G{X,U1,U2,B,θ(Z)} > c|X,Z], a quan-

tity that can be computed by numerical integration. Then κ0=

E[F{X,B,θ(Z)}] is the percentage of the population whose

long-term reported daily average consumption of red meat ex-

ceeds c.

iβ1+U1)

(18)

Page 6

128 Journal of the American Statistical Association, March 2007

4.3 Bias in Naive Estimates, and a Simulation Study

We emphasize that the distribution of mean intake cannot

be estimated consistently by the simple device of computing

the sample percentage of the observed 24-hour recalls that ex-

ceed c, and, as a consequence, going through the model-fitting

process is actually necessary. To see this, suppose only one

24-hour recall per person was computed and the percentage of

these 24-hour recalls exceeding c was computed. In large sam-

ples, this percentage converges to

κ24hr= E?H(β0+XTβ1+U1)

×??{RTβ2+θ(Z)−log(c)}/(σ2+σ2

In contrast, for σ2> 0,

κ0= E????RTβ2+θ(Z)+σ2/2

2)1/2??.

−log{c/H(β0+XTβ1+U1)}?/σ2

??.

As the number of replicates m of the 24-hour recall ap-

proaches ∞, the percentage κm,24hrof the means of the 24-hour

recalls that exceed c → κ0, so we would expect that the fewer

the replicates, the less our estimate agrees with the sample ver-

sion of κm,24hr, a phenomenon observed in our data; see below.

To see this numerically, we ran the following simulation

study. Gender, age, and the DHQ were kept the same as in

the EATS. The parameters (β0,β1,β2,σ2,σ2

same as our estimated values; see below. The function θ(·) was

roughly in accord with our estimated function, for simplicity,

being quadratic in the logarithm of the DHQ, standardized to

have minimum .0 and maximum 1.0, with intercept, slope, and

quadratic parameters being .50, 1.50, and −.75, respectively.

The true survival function, that is, 1 − the cdf, was computed

analytically, while the survival functions for the mean of two

24-hour recalls and the mean of four 24-hour recalls were com-

puted by 1,000 simulated datasets.

The results are given in Figure 1, where the bias from not

using a model is evident.

We used our methods with a nonparametrically estimated

function, a bandwidth h = .30, and the Epanechnikov kernel

function. We generated 300 datasets, with results displayed in

Figure 2. The mean over the simulation was almost exactly the

correct function, not surprising given that the sample size is

large (n = 886). In Figure 2 we also display a 90% confidence

range from the simulated datasets, indicating that in the EATS

data at least, the results of our approach are relatively accurate.

1,σ2

2) were the

4.4 Data Analysis

We standardized age to have mean 0 and variance 1. In the

logistic part of the model, the intercept was estimated as −8.15,

with the coefficients for (gender, age, DHQ) = (.13,.14,1.09).

The random-effect variance was estimated as ? σ2

.05 to .40, with little change in any of the estimates, as de-

scribed in more detail in Section 5.1. With a bandwidth h = .30,

our estimates were? σ2= .76,? σ2

data: We used other methods such as mixed models with poly-

nomial fits and obtained roughly the same answers.

1= .66. In the

continuous part of the model, we used bandwidths ranging from

2= .043,and the coefficientsfor

gender and age were −.25 and .02, respectively. The coefficient

for the person-specific random effect σ2

2appears intrinsic to the

Figure 1. Results of the Simulation Study Meant to Mimic the EATS

Study. All results are averages over 1,000 simulated datasets. The mean

of the semiparametric estimator (

almost identical to the true survival curve. The empirical survival func-

tion of the mean of two 24-hour recalls (

datasets. The empirical survival function of the mean of four 24-hour

recalls (

) from 1,000 simulated datasets.

) of the survival curse, which is

) from 1,000 simulated

We display the computed survival function in Figure 3. Dis-

played there are our method, along with the empirical survival

functions for the mean of the first two 24-hour recalls and the

mean of all four 24-hour recalls. While these are biased, it is

interesting to note that using the mean of only two 24-hour re-

calls is more different from our method than using the mean of

four 24-hour recalls, which is expected as described previously.

The similarity of Figures 1 and 3 is striking, mainly indicating

that naive approaches, such as using the mean of two 24-hour

recalls, can result in badly biased estimates of κ0.

Figure 2. Results of the Simulation Study Meant to Mimic the EATS

Study. Plotted is the mean survival function for 300 simulated datasets,

along with the 90% pointwise confidence intervals. The mean fitted func-

tion is almost exact.

Page 7

Maity, Ma, and Carrol: Efficient Estimation of Population Quantities129

Figure 3. Results From the EATS Example. Plotted are estimates of

the survival function (1 − the cdf) of usual intake of red meat. The solid

line is the semiparametric method described in Section 4. The dotted

line is the empirical survival function of the mean of the first two 24-hour

recalls per person, while the dashed line is survival function of the mean

of all the 24-hour recalls per person.

5. BANDWIDTH SELECTION, THE PARTIALLY LINEAR

MODEL, AND THE SAMPLE MEAN

5.1 Bandwidth Selection

5.1.1 Background.

nel density function, that is, one with mean 0 and positive vari-

ance. With this choice, in Theorem 1 we have assumed that the

bandwidth satisfies nh4→ 0: for estimation of the population

mean in the partially linear model. In contrast, if one were in-

terested only in B0, then it is well known that by using profile

likelihood the usual bandwidth order h ∼ n−1/5is acceptable,

and off-the-shelf bandwidth selection techniques yield an as-

ymptotically normal limit distribution.

The reason for the technical need for undersmoothing is

the inclusion of θ0(·) in κ0. For example, suppose that κ0=

E{θ0(Z)}. Then it follows from (5) that ? κ − κ0= Op(h2+

moves the bias term entirely.

Note that κ0is not a parameter in the model, being a mixture

of the parametric part B0, the nonparametric part θ0(·), and the

joint distribution of (X,Z). Thus, it does not appear that κ0can

be estimated by profiling ideas.

We have used a standard first-order ker-

n−1/2). Thus, in order for n1/2(? κ −κ0) = Op(1), we require that

nh4= Op(1). The additional restriction that nh4→ 0 merely re-

5.1.2 Optimal Estimation.

ymptotic distribution of n1/2(? κ −κ0) is unaffected by the band-

and numerical evidence of the lack of sensitivity to the band-

width choice; see also Section 5.3 for further numerical evi-

dence. In Section 5.1.4 we describe three different, simple prac-

tical methods for bandwidth selection in this problem, all of

which work quite well in our simulations and example.

Because first-order calculations do not get squarely at the

choice of bandwidth, other than to suggest that it is not partic-

As seen in Theorem 1, the as-

width, at least to first order. In Section 5.1.3 we give intuitive

ularly crucial, an alternative theoretical device is to do second-

order calculations. Define η(n,h) = n1/2h2+ (n1/2h)−1. In a

problem similar to ours, Sepanski, Knickerbocker, and Carroll

(1994) showed that the variance of linear combinations of the

estimate of B0has a second-order expansion as follows. Sup-

pose we want to estimate ξTB0. Then, for constants (a1,a2),

n1/2(ξT? B −ξTB0) = Vn+op{η(n,h)},

This means that the optimal bandwidth is on the order of

h = cn−1/3for, a constant c depending on (a1,a2), which, in

turn, depend on the problem, that is, on the distribution of

(Y,X,Z) as well as on B0and θ0(·). In their practical imple-

mentation, translated from the Gaussian kernel function to our

Epanechnikov kernel function, Sepanski et al. (1994) suggested

the following device, namely, that if the optimal bandwidth for

estimating θ0(·) is ho= cn−1/5, then one should use the correct-

order bandwidth h = cn−1/3. They also did sensitivity analysis,

for example, h = (1/2)cn−1/3, but found little change in their

simulations. One of our three methods of practical bandwidth

selection is exactly this one.

A problem not within our framework but carrying a sim-

ilar flavor was considered by Powell and Stoker (1996) and

Newey, Hsieh, and Robins (2004), namely, the estimation of

the weighted average derivative κAD= E{Yθ(1)

by Sepanski et al. (1994), Powell and Stoker (1996) showed

that the optimal bandwidth constructed from second-order cal-

culations is an undersmoothed bandwidth. Newey et al. (2004)

suggested that a simple device of choosing the bandwidth is

to choose something optimal when using a standard second-

order kernel function but to then undersmooth, in effect, by us-

ing a higher order kernel such as the twicing kernel. This is our

second bandwidth selection method described in Section 5.1.4.

Like the first, it appears to be an effective means of eliminating

the bias term.

In our problem, the article by Sepanski et al. (1994) is more

relevant. Preliminary calculations based on the basic tools in

that article suggest that for our problem, the optimal bandwidth

is also of order n−1/3. We intend to pursue these very calcula-

tions in another article.

cov(Vn) = constant+?a1n1/2h2+a2

?hn1/2?−1?2.

0(Z)}. As done

5.1.3 Lack of Sensitivity to Bandwidth.

term technical need for undersmoothing because that is what it

really is. In practice, as Theorem 1 states, the asymptotic dis-

tribution of? κ is unaffected by bandwidth choice for very broad

pens with estimation of the function θ0(·), where bandwidth se-

lection is typically critical in practice, and this is seen in theory

through the usual bias–variance tradeoff.

In practice, we expect little effect of the bandwidth selection

on estimation of B0, and even less effect on estimation of κ0.

The reason is that broad ranges of bandwidths lead to no as-

ymptotic effect on the distribution of? B. The extra amount of

? κ will be even less sensitive to the bandwidth, the so-called

To see this issue, consider the simulation in Wang et al.

(2004). They set X and Z to be independent, with X =

Normal(1,1) and Z = Uniform[0,1]. In the partially linear

We have used the

ranges of bandwidths. This is totally different from what hap-

smoothing inherent in the summation in (4) should mean that

double-smoothing phenomenon.

Page 8

130Journal of the American Statistical Association, March 2007

model, they set B0= 1.5, ? = Normal(0,1), and θ0(z) =

3.2z2− 1. They used the kernel function (15/16)(1 − z2)2×

I(|z| ≤ 1), and they fixed the bandwidth to be h = n−2/3, which

at least asymptotically is very great undersmoothing, because

h ∼ n−1/3is already acceptable and typically something like

nh2/log(n) → ∞ is usually required. In their case 3, they used

effective sample sizes for complete data of 18, 36, and 60, with

corresponding bandwidths .146, .092, and .065, respectively.

We reran the simulation of Wang et al. (2004), with com-

plete response data and n = 60. We used bandwidths .02, .06,

.10, and .14, ranging from a very small bandwidth, less than

1/3 that used by Wang et al. (2004), to a larger bandwidth,

more than double that used. As another perspective, if one sets

h = σzn−c, where σzis the standard deviation of Z, then the

bandwidths used are equivalent to c = .73, .46, .34, and .26.

In other words, a bandwidth here of h = .02 is very great un-

dersmoothing, while even h = .14 satisfies the theoretical con-

straint on the bandwidth.

In Figure 4 we plot the results for a single dataset, where, as

in Wang et al. (2004), interest lies in estimating κ0= E(Y). As

is obvious from this figure, the bandwidth choice is very impor-

tant for estimation of the function, but trivially unimportant for

estimation of κ0, the estimate of which ranged from 1.818 to

1.828.

In Figure 5 we plot the mean estimated functions from 100

simulated datasets. Again, the bandwidth matters a great deal

for estimating the function θ0(·). Again, too, the bandwidth

matters hardly at all for estimating κ0. Thus, for estimating κ0,

the mean estimates across the bandwidths range from 1.513 to

1.526, and the standard deviations of the estimates range from

.249 to .252. There is somewhat more effect of bandwidth on

the estimate of B0: For h ≥ .06, there is almost no effect, but

choosing h = .02 results in a 50% increase in standard devia-

tion.

In other words, as expected by theory and intuition, band-

width selection has little effect on the estimate of B0, except

when the bandwidth is much too small, and very little effect

on the estimation of κ0= E(Y). Similar results occur when one

looks at the variance of the errors as the parameter, and κ0is

the population variance.

5.1.4 Bandwidth Selection.

bandwidth selection is not a vital issue for estimating κ0: Of

course, it is vital for estimating θ0(·). Effectively, what this

means is that the real need is simply to get bandwidths that

satisfy the technical assumption of undersmoothing but are not

too ridiculously small: A precise target is often unnecessary. In

addition, because the asymptotic distribution of ? κ does not de-

are not possible. Thus, in our example, we used three different

methods,allof whichgaveanswers thatwereas nearlyidentical

as in the simulation of Wang et al. (2004).

All the methods are based on a so-called “typical device” to

get an optimal bandwidth for estimating θ0, of the form hopt=

cσzn−1/5. In practice, this can be accomplished by constructing

a finite grid of bandwidths of the form hgrid= cgridσzn−1/5: We

use a grid from .20 to 5.0. After estimating B0by? B(hgrid), this

obtained. The maximizer of the log-likelihood cross-validation

score is selected as hopt.

As described in Section 5.1.3,

pend on the bandwidth, simple first-order methods of the type

that are used in bandwidth selection for function estimation

valueisfixed,andthenalog-likelihoodcross-validationscoreis

• If hopt= cσzn−1/5, an extremelysimple deviceis simply to

set h = hoptn−2/15= cσzn−1/3, which satisfies the techni-

cal condition of undersmoothing without becoming ridicu-

lously small. This device may seem terribly ad hoc, but the

theory, the simulation of Wang et al. (2004), the discus-

sion in Section 5.1.3, and our own work suggest that this

method actually works reasonably well. Note, too, that in

Section 5.1.2 we give evidence that this bandwidth rate is

most likely optimal.

• A second approach is taken by Newey et al. (2004) and

is also an effective practical device. The technical need

for undersmoothing comes from the fact that the bias term

in a first-order local likelihood kernel regression is of or-

der O(h2). One can use higher order kernels to get the

bias to be of order O(h2s) for s ≥ 2, but this does not re-

ally help in that the variance remains of order O{(nh)−1},

so that the optimal mean squared error kernel estima-

tor has h = O{n−1/(4s+1)}, and thus undersmoothing to

estimate κ0 is still required. However, as Newey et al.

(2004) pointed out, if one uses the optimal bandwidth

hopt= cσzn−1/5, but then does the estimation procedure

replacing the first-order kernel by a higher order kernel,

then the bias is O(h2s

nient higher order kernel is the second-order twicing ker-

nel Ktw(u) = 2K(u) −?K(u − v)K(v)dv, where K(·) is a

• One can also use log-likelihood cross-validation, but with

the grid of values being of the form hgrid= cgridσzn−1/3.

Because cross-validation scores often have multiple

modes, this is not the same as optimal smoothing.

opt) = o(n−1/2) if s ≥ 2. A conve-

first-order kernel.

It may be worth pointing out again that Wang et al. (2004) set

h = n−2/3, and even then, with too much undersmoothing (as-

ymptotically), the performance of the method is rather good.

5.2 Efficiency and Robustness of the Sample Mean

In general problems with completedata, with no assumptions

about the response Y other than that it has a second moment, the

sample mean Y is semiparametric efficient for estimating the

population mean κ0= E(Y); see, for example, Newey (1990).

Somewhat remarkably, Wang et al. (2004) showed that in the

partially linear model with Gaussian errors, with complete data

the sample mean is still semiparametric efficient. This fact is

crucial, of course, in establishing that with missing response

data, their estimators are still semiparametric efficient.

It is clear that with complete data, the sample mean will not

be semiparametric efficient for all semiparametric likelihood

models. Simple counterexamples abound, for example, the par-

tially linear model for Laplace or t errors. More complex exam-

ples can be constructed, for example, the partially linear model

in the Gamma family with log-linear mean exp{XTB0+θ0(Z)}:

Details follow from Lemma 4.

The model robustness of the sample mean for estimating

the population mean in complete data is nonetheless a pow-

erful feature. It is, therefore, of considerable interest to know

whether there are cases of semiparametric likelihood problems

where the sample mean is still semiparametric efficient and,

thus, would be used because of its model robustness. It turns

out that such cases exist. In particular, the sample mean for

complete response data is semiparametric efficient in canoni-

cal exponential families with partially linear form.

Page 9

Maity, Ma, and Carrol: Efficient Estimation of Population Quantities131

Figure 4. Results for a Single Dataset in a Simulation as in Wang et al. (2004), the Partially Linear Model With n = 60, Complete Response Data,

and When κ0= E(Y). Various bandwidths are used, and the estimates of the function θ0(·) are displayed. Note how the bandwidth has a major

impact on the function estimate when the bandwidth is too small (h = .02), but very little effect on the estimate of κ0. (

, h = .06, κ = 1.818;

, h = .02, κ = 1.828;

, h = .10, κ = 1.826; , h = .14, κ = 1.818.)

Figure 5. Results for 100 Simulated Datasets in a Simulation as in Wang et al. (2004), the Partially Linear Model With n = 60 and Complete

Response Data. Various bandwidths are used, and the mean estimates of the function θ0(·) are displayed. Note how even over these simulations,

the bandwidth has a clear impact on the function estimate: There is almost no impact on estimates of the population mean and variance. (

h = .02;

,

, h = .06; , h = .10; , h = .14.)

Page 10

132Journal of the American Statistical Association, March 2007

Lemma 3. Recall that ? is defined in (8). If there are no miss-

ing data, the sample mean is a semiparametric efficient estima-

tor of the population mean only if

Y −E(Y|X,Z) = E(Y?T)M−1

1? +Lθ(·)E{YLθ(·)|Z}

E{L2

θ(·)|Z}

.

(19)

It is interesting to consider (19) in the special case of expo-

nential families with likelihood function

?yc{η(x,z)}−C[c{η(x,z)}]

where η(x,z) = xTβ0+ θ0(z), so that E(Y|X,Z) = C(1)[c{η(X,

Z)}] = µ{η(X,Z)} = µ(X,Z) and var(Y|X,Z) = φC(2)[c{η(X,

Z)}] = φV[µ{η(X,Z)}].

As it turns out, effectively, (19) holds and the sample mean is

semiparametric efficient only in the canonical exponential fam-

ily for which c(t) = t. More precisely, we show in Section A.6

the following result.

f(y|x,z) = exp

φ

+D(y,φ)

?

, (20)

Lemma 4. If there are no missing data, under the exponen-

tial model (20), the sample mean is a semiparametric efficient

estimate of the population mean if ∂c{XTβ + θ(Z)}/∂θ(Z) is

a function only of Z for all β, for example, the canonical

exponential family. Otherwise, the sample mean is generally

not semiparametric efficient: The precise condition is given

in (A.29) in the Appendix. In particular, outside the canoni-

cal exponential family, the only possibility for the sample mean

to be semiparametric efficient is that if for some known (a,b),

c{xTβ +θ(z)} = a+blog{xTβ +θ(z)}.

Remark 5. We consider Lemmas 3 and 4 to be positive re-

sults, although an earlier version of the paper had a misplaced

emphasis. Effectively, we have characterized the cases, with

complete data, that the sample mean is both model free and

semiparametric efficient. In these cases, one would use the sam-

ple mean, or perhaps a robust version of it, rather than fit a po-

tentially complex semiparametric model that can do no better

and if that model is incorrect, can incur nontrivial bias.

5.3 Numerical Experience and Theoretical Insights

in the Partially Linear Model, and Some

Tentative Conclusions

In responding to a referee about the estimation of the popu-

lation mean in the partially linear model (1), we collect here

a few remarks based on our numerical experience. Because

the problem of estimating the population mean is the prob-

lem focused on by Chen et al. (2003), we focus on the simu-

lation setup in their article, although some of the conclusions

we reach may be supportable in general cases. To remind the

reader, in their simulation, X and Z are independent, with X =

Normal(0,1), Z = Uniform[0,1], β = 1.5, θ(z) = 3.2z2− 1,

and ? = Normal(0,1).

5.3.1 Can Semiparametric Methods Improve Upon the Sam-

ple Mean?

When there are missing response data, the simu-

lations in Wang et al. (2004) show conclusively that substantial

gains in efficiency can be made over using the sample mean of

the observed responses alone. In addition, if missingness de-

pends on (X,Z), the sample mean of the observed responses

will be biased.

Thisleavestheissueofwhathappenswhentherearenomiss-

ing data. Obviously, if one thought that ? were normally dis-

tributed, it would be delusional to use anything other than the

sample mean, it being efficient.

Theoretically, some insight can be gained by the following

considerations. Suppose that X and Z are independent. Suppose

also that ? has a symmetric density function known up to a scale

parameter. Let σ2

inverse of the Fisher information for estimating the mean in the

model Y = µ + ?. Then, it can be shown that E{FB(·)} = 0,

that θB(z,B) = 0, and that the asymptotic mean squared error

(MSE) efficiency of the semiparametric efficient estimate of the

population mean compared to the sample mean is

?be the variance of ? and let ζ ≤ σ2

?be the

MSE efficiency of sample mean

=

β2var(X)+var{θ(Z)}+ζ

β2var(X)+var{θ(Z)}+σ2

?may be quite small, es-

?

≤ 1.

Note that there are cases where ζ/σ2

pecially when ? is heavy tailed, so that if β = 0 and θ(·) is ap-

proximately constant, the MSE efficiency of the sample mean

would be ζ/σ2

?, and then substantial gains in efficiency would

be gained. However, the usual motivation for fitting semipara-

metric models is that the regression function is not constant, in

which case the MSE efficiency gain will be attenuated toward

1.0, often dramatically.

We conclude then that with no missing data, in the partially

linear model, substantial improvements upon the sample mean

will be realized mainly when the regression errors are heavy

tailed and the regression signal is slight.

We point out that in the example that motivated this work

(Sec. 4), there is no simple analog to the sample mean, one that

could avoid fitting models to the data.

5.3.2 How Critical Are Our Assumptions on Z?

made two assumptions on Z: It has a compact support, and its

density function is positive on that support. We have indicated

in Section A.1.2 that all general articles in the semiparametric

kernel-based literature make this assumption and that it appears

to be critical for deriving asymptotic results for problems such

as our example in Section 4. It is certainly well beyond our

capabilities to weaken this assumption as it applies to problems

such as our motivating example.

The condition that the density of Z be bounded away from 0

warns users that the method will deteriorate if there are a few

sparsely observed outlying Z values; see below for numerical

evidence of this phenomenon.

Estimation in subpopulations formed by compact subsets of

Z can also be of considerable interest in practice, and these

compact subsets can be chosen to avoid density spareness and

meet our assumptions. A simple example might be where Z is

age, and one might be interested in population summaries for

those in the 40- to 60-year age range.

The partially linear model is a special case, however, be-

cause all estimates are explicit and what few Taylor expan-

sions are necessary simplify tremendously. That is, the esti-

mates are simple functions of sums of random variables. Cheng

(1994) considered a different problem where there is no X and

where local constant estimation of the nonparametric function

is used, rather than local linear estimation, so that? θ(z0) =

We have

Page 11

Maity, Ma, and Carrol: Efficient Estimation of Population Quantities133

?n

Z decay exponentially fast.

We tested this numerically in the normal-based simulation

of Wang et al. (2004) with the sample size of n = 500: Simi-

lar results were found with n = 100. We use the Epanechnikov

kernel and estimated the bandwidth using the following meth-

ods. First, we regressed Y and X separately on Z, using the

direct plug-in (DPI) bandwidth selection method of Ruppert,

Sheather, and Wand (1995) to form different estimated band-

widths on each. We then calculated the residuals from these fits

and regressed the residual in Y on the residual in X to get a

preliminary estimate? βstartof β. Following this, we regressed

smoothed it by multiplication by n−2/15to get a bandwidth of

order n−1/3to eliminate bias, and then reestimated β and θ(·).

We found that for various Beta distributions on Z, for ex-

ample, the Beta(2, 1) that violates our assumptions, the sam-

ple mean and the semiparametric efficient method were equally

efficient. The same occurs for the case that Z is normally dis-

tributed. However, when Z has a t distribution with 9 degrees

of freedom, the sample mean greatly outperforms the under-

smoothed estimator (MSE efficiency ≈ 2.0), which, in turn,

outperformed the method that did not employ undersmoothing

(MSE efficiency ≈ 2.5). An interesting quote from Ma, Chiou,

and Wang (2006) is relevant here: Also operating in a partial

linear model, they stated “This condition enables us to sim-

plifyasymptoticexpressionof certainsums of functionsof vari-

ables...also excludes pathological cases where the number of

observations in a window defined by the bandwidth may not

increase to infinity when n → ∞.”

We conclude that if the design density in Z is at all heavy

tailed, then the semiparametric methods will be badly affected.

If such a phenomenon happens in the simple case of the par-

tially linear model, it is likely to hold in most other cases.

Otherwise, in practice at least, as long as there are no design

“stragglers,” the assumption is likely to be one required by the

technicalities of the problem. How well this generalizes to com-

plex nonlinear problems is unknown.

i=1Kh(Zi− z0)Yi/?n

i=1Kh(Zi− z0). He indicated that the

essential condition for this case is that the tails of the density of

Y − XT? βstart on Z to get a common bandwidth, then under-

6. DISCUSSION

In this article we considered the problem of estimating

population-level quantities κ0such as the mean, variance, and

probabilities. Previous literature on the topic applies only to

the simple special case of estimating a population mean in the

Gaussian partially linear model. The problem was motivated

by an important issue in nutritional epidemiology, estimating

the distribution of usual intake for episodically consumed food,

whereweconsideredazero-inflatedmixturemeasurementerror

model:Such a problemis very different from the partially linear

model, and the main interest is not in the population mean.

The key feature of the problem that distinguishes it from

most work in semiparametric modeling is that the quantities

of interest are based on both the parametric and the non-

parametric parts of the model. Results were obtained for two

general classes of semiparametric ones: (1) general semipara-

metric regression models depending on a function θ0(Z) and

(2) generalized linear single-index models. Within these semi-

parametric frameworks, we suggested a straightforward estima-

tion methodology, derived its limiting distribution, and showed

semiparametric efficiency. An interesting part of the approach

is that we also allow for partially missing responses.

In the case of standard semiparametric models, we have con-

sidered the case where the unknown function θ0(Z) is a scalar

function of a scalar argument. The results, though, readily ex-

tend to the case of a multivariate function of a scalar argument.

We have also assumed that κ0= E[F{X,θ0(Z),B0}] and

F(·) are scalar, which, in principle, excludes the estimation of

the population variance and standard deviation. It is, however,

readily seen that both F(·) and κ0or κSIcan be multivariate,

and, hence, the obvious modification of our estimates is semi-

parametric efficient.

APPENDIX: SKETCH OF TECHNICAL ARGUMENTS

In what follows, the arguments for L and its derivatives are in the

form L(·) = L{Y,X,B0,θ0(Z)}. The arguments for F and its deriva-

tives are F(·) = F{X,θ0(Z),B0}.

Also, note that in our arguments about semiparametric efficiency,

we use the symbol d exactly as it was used by Newey (1990). It does

not stand for differential.

A.1 Assumptions and Remarks

A.1.1 General Considerations.

asymptotic distribution of our estimator are (5) and (6). The single-

index model assumptions were given already in Carroll et al. (1997).

Results (5) and (6) hold under smoothness and moment conditions

for the likelihood function and under smoothness and boundedness

conditions for θ(·). The strength of these conditions depends on the

generality of the problem. For the partially linear Gaussian model of

Wang et al. (2004), because the profile likelihood estimator of β is an

explicit function of regressions of Y and X on Z, the conditions are

simply conditions about uniform expansions for kernel regression es-

timators, as in, for example, Claeskens and Van Keilegom (2003). For

generalized partially linear models, Severini, and Staniswalis (1994)

gave a series of moment and smoothness conditions toward this end.

For general likelihood problems, Claeskens and Carroll (2007) stated

that the conditions needed are as follows.

The main results needed for the

(C1) The bandwidth sequence hn→ 0 as n → ∞ in such a way

that nhn/log(n) → ∞ and hn≥ {log(n)/n}1−2/λfor λ as in condi-

tion (C4).

(C2) The kernel function K is a symmetric, continuously differ-

entiable pdf on [−1,1] taking on the value 0 at the boundaries. The

design density f(·) is differentiable on an interval B = [b1,b2], the

derivative is continuous, and infz∈Bf(z) > 0. The function θ(·,B) has

two continuous derivatives on B and is also twice differentiable with

respect to B.

(C3) For B ?= B?, the Kullback–Leibler distance between L{·,·,B,

θ(·,B)} and L{·,·,B?,θ(·,B?)} is strictly positive. For every (y,x),

third partial derivatives of L{y,x,B,θ(z)} with respect to B exist and

are continuous in B. The fourth partial derivative exists for almost all

(y,x). Further, mixed partial derivatives

with 0 ≤ r,s ≤ 4,r + s ≤ 4 exist for almost all (y,x) and

E{supBsupv|

G(z), possesses a continuous derivative and infz∈BG(z) > 0.

(C4) There exists a neighborhood N{B0,θ0(z)} such that

????

for some λ ∈ (2,∞], where ? · ?λ,zis the Lλ-norm, conditional on

Z = z. Further,

?

∂r+s

∂Br∂vsL{y,x,B,v}|v=θ(z),

∂r+s

∂Br∂vsL{y,x,B,v}|2} < ∞. The Fisher information,

max

k=1,2sup

z∈B

sup

(B,θ)∈N{B0,θ0(z)}

????

∂k

∂θklog{L(Y,X,B,θ)}

????

????λ,z

< ∞

sup

z∈B

Ez

sup

(B,θ)∈N{B0,θ0(z)}

????

∂3

∂θ3log{L(Y,X,B,θ)}

????

?

< ∞.

Page 12

134Journal of the American Statistical Association, March 2007

The preceding regularity conditions are the same as those used in a lo-

cal likelihood setting where one wishes to obtain strong uniform con-

sistency of the local likelihood estimators. Condition (C3) requires the

fourth partial derivative of the log profile likelihood to have a bounded

second moment; it further requires the Fisher information matrix to

be invertible and to be differentiable with respect to z. Condition (C4)

requires a bound on the first and second derivatives of the log profile

likelihood and of the first moment of the third partial derivative, in a

neighborhood of the true parameter values.

A.1.2 Compactly Supported Z.

of this article commented that the assumption that Z be compactly sup-

ported with density positive on this support is too strong.

However, this assumption is completely standard in the kernel-

based semiparametric literature for estimation of B0, because it is

needed for uniform expansions for estimation of θ0(·). The assump-

tion was made in the founding articles on semiparametric likelihood

estimation (Severini and Wong 1992, p. 1875, part e); the first article

on generalized linear models (Severini and Staniswalis 1994, p. 511,

assumption D), the first article on efficient estimation of partially lin-

ear single index models (Carroll et al. 1997, p. 485, condition 2a); and

the precursor article to ours that was focused on estimation of the pop-

ulation mean in a partially linear model (Wang et al. 2004, p. 341,

condition C.T). The uniform expansions for local likelihood given in

Claeskens and Van Keilegom (2003) also make this assumption; see

their page 1869, condition R0. Thus, our assumption on the design

density of Z is a standard one.

The reason this assumption is made has to do with kernel technol-

ogy, where proofs generally require a uniform expansion for the kernel

regression or at least uniform in all observed values of Z, which is the

same thing. The Nadaraya–Watson estimator, for example, has a de-

nominator that is a density estimate, and the condition on Z stops this

denominator from getting too close to 0. Ma et al. (2006), who made

the same assumption (their condition 6 on p. 83), stated that it is nec-

essary to avoid “pathological cases.”

Multiple reviewers of earlier drafts

A.2 Proof of Theorem 1

A.2.1 Asymptotic Expansion.

is a log-likelihood function conditioned on (X,Z), so that we have

We first show (9). First, note that L

E{δLθθ(·)|X,Z} = −E{δLθ(·)Lθ(·)|X,Z},

E{δLθB(·)|X,Z} = −E{δLθ(·)LB(·)|X,Z}.

By a Taylor expansion,

n1/2(? κ −κ0)

= n−1/2

i=1

+{FiB(·)+Fiθ(·)θB(Zi,B0)}T(? B −B0)

= MT

n

?

+op(1).

Because nh4→ 0, using (5), we see that

n−1/2

i=1

n

?

(A.1)

n

?

?Fi(·)−κ0

+Fiθ(·){? θ(Zi,B0)−θ0(Zi)}?+op(1)

2n1/2(? B −B0)

+n−1/2

i=1

?Fi(·)−κ0+Fiθ(·){? θ(Zi,B0)−θ0(Zi)}?

n

?

Fiθ(·){? θ(Zi,B0)−θ0(Zi)}

= −n−1/2

i=1

Fiθ(·)n−1

n

?

j=1

δjKh(Zj−Zi)Ljθ(·)/?(Zi)+op(1)

= −n−1/2

n

?

n

?

i=1

δiLiθ(·)n−1

n

?

j=1

Kh(Zj−Zi)Fjθ(·)/?(Zj)+op(1)

= n−1/2

i=1

δiDi(·)+op(1),

the last step following because the interior sum is a kernel regression

converging to Di; see Carroll et al. (1997) for details. Result (9) now

follows from (6). The limiting variance (10) is an easy calculation;

noting that (A.1) implies that

E{δ?Lθ(·)|Z}

= E{δLθ(·)LB(·)+δLθ(·)Lθ(·)θB(Z,B0)|Z}

= −E{δLBθ(·)+δLθθ(·)θB(Z,B0)|Z}

= 0

by the definition of θB(·) given in (7), and, hence, the last two terms

in (9) are uncorrelated. We will use (A.2) repeatedly in what follows.

(A.2)

A.2.2 Pathwise Differentiability.

ric efficiency, using the results of Newey (1990). The relevant text of

his article is in his section 3, especially through his equation (9). A pa-

rameter κ = κ(?) is pathwise differentiable under two conditions. The

first is that κ(?) is differentiable for all smooth parametric submodels:

In our case, the parametric submodels include B, parametric submod-

elsfor θ(·),andparametricsubmodelsforthedistributionof (X,Z) and

the probability function pr(δ = 1|X,Z). This condition is standard in

the literature and fairly well required. Our motivating example clearly

satisfies this condition.

The second condition is that there exists a random vector d such

that E(dTd) < ∞ and ∂κ(?)/∂? = E(dST

likelihood score for the parametric submodel. Newey noted that path-

wise differentiability also holds if the first condition holds and if there

is a regular estimator in the semiparametric problem. Generally, as

Newey noted, finding a suitable random variable d can be difficult.

Assuming pathwise differentiability, which we show later, the effi-

cient influence function is calculated by projecting d onto the nuisance

tangent space. One innovation here is that we can calculate the efficient

influence function without having an explicit representation for d.

Our development in Section A.2.3 will consist of two steps. In the

first, we will assume pathwise differentiability and derive the efficient

score function under that assumption. Using this derivation, we will

then exhibit a random variable d that has the requisite property.

We now turn to the semiparamet-

?), where S?is the log-

A.2.3 Efficiency.

fX,Z(x,z) be the density function of (X,Z). Let the model under con-

sideration be denoted by M0. Now consider a smooth parametric sub-

model Mλ, with fX,Z(x,z,α1), θ(z,α2), and π(X,Z,α3) in place of

fX,Z(x,z), θ0(z), and π(X,Z), respectively. Then, under Mλ, the log-

likelihood is given by

Recall that pr(δ = 1|X,Z) = π(X,Z). Let

L(·) = δL(·)+δlog{π(X,Z,α3)}

+(1−δ)log{1−π(X,Z,α3)}

+log{fX,Z(X,Z,α1)},

where (·) represents the argument {Y,X,θ(Z,α2),B0}. Then the score

functions in this parametric submodel are given by

∂L(·)/∂B = δLB(·),

∂L(·)/∂α1= ∂ log{fX,Z(X,Z,α1)}/∂α1,

∂L(·)/∂α2= δLθ(·)∂θ(Z,α2)/∂α2,

∂L(·)/∂α3= {∂π(X,Z,α3)/∂α3}{δ −π(X,Z,α3)}

/?π(X,Z,α3){1−π(X,Z,α3)}?.

Page 13

Maity, Ma, and Carrol: Efficient Estimation of Population Quantities135

Thus, the tangent space is spanned by the functions δLB(·)T,sf(x,z),

δLθ(·)g(Z), and a(X,Z){δ − π(X,Z)}, where sf(x,z) is any function

with mean 0, while g(z) and a(X,Z) are any functions. For compu-

tational convenience, we rewrite the tangent space as the linear span

of four subspaces T1,T2,T3, and T4that are orthogonal to each other

(see below) and defined as follows:

T1= δLB(·)T+δLθ(·)θT

T2= sf(x,z),

T3= δLθ(·)g(Z),

T4= a(X,Z){δ −π(X,Z)}.

Toshowthat thesespaces areorthogonal, wefirstnote that, byassump-

tion, the data are missing at random, and, hence, pr(δ = 1|Y,X,Z) =

π(X,Z). This means that T4is orthogonal to the other three spaces.

Note also that, by assumption, E{LB(·)|X,Z} = E{Lθ(·)|X,Z} = 0.

This shows that T2is orthogonal to T1and T3. It remains to show that

T1and T3are orthogonal, which we showed in (A.2). Thus, the spaces

T1–T4are orthogonal.

Note that, under model Mλ,

?

Hence, we have

B(Z,B0),

κ0=

F{X,θ(Z,α2),B0}fX,Z(x,z,α1)dxdz.

∂κ0/∂B = E{FB(·)},

∂κ0/∂α1= E?F(·)∂ log{fX,Z(X,Z,α1)}/∂α1

∂κ0/∂α2= E{Fθ(·)∂θ(Z,α2)/∂α2},

∂κ0/∂α3= 0.

Now, by pathwise differentiability and equation (7) of Newey (1990),

there exists a random variable d, which we need not compute, such

that

E{FB(·)} = E?d{δLB(·)}?,

E{F(·)sf(X,Z)} = E{dsf(X,Z)},

E{Fθ(·)g(Z)} = E{dδLθ(·)g(Z)},

0 = E?da(X,Z){δ −π(X,Z)}?.

Next,wecomputetheprojectionsof d onto T1,T2,T3,and T4.First,

note that, by (A.4), for any function sf(X,Z) with expectation 0, we

have E[{d−F(·)+κ0}sf(X,Z)] = 0, which implies that the projection

of d onto T2is given by

?(d|T2) = F(·)−κ0.

Also, by (A.1) and (A.5), for any function g(Z), we have

?,

(A.3)

(A.4)

(A.5)

(A.6)

(A.7)

E[{d −δD(·)}δg(Z)Lθ(·)]

= E{Fθ(·)g(Z)}+E?δg(Z)L2

= 0,

and, hence, the projection of d onto T3is given by

θ(·)E{Fθ(·)|Z}/E{δLθθ(·)|Z}?

?(d|T3) = δD(·).

(A.8)

In addition, by (A.3) and (A.5),

E[{d −MT

= E{FT

= 0.

2M−1

B(·)}−E{Fθ(·)θT

1δ?}δ?T]

B(Z,B0)}−E(MT

2M−1

1δ??T)

Hence, the projection of d onto T1is given by

?(d|T1) = δMT

2M−1

1?.

(A.9)

Also, by (A.6), we have ?(d|T4) = 0. Using (A.7), (A.8), and (A.9),

we get that the efficient influence function for κ0is

ψeff= ?(d|T1)+?(d|T2)+?(d|T3)+?(d|T4)

= F(·)−κ0+δMT

2M−1

1? +δD(·),

which is the same as (9), hence completing the proof under the as-

sumption of pathwise differentiability. In the calculations that follow,

we will write FBrather than FB(·), a rather than a(X,Z), and so on.

We now show pathwise differentiability and, hence, semiparametric

efficiency; that is, we show that (A.3)–(A.6) hold for d = F − κ0+

δD+δMT

To verify (A.3), we see that

2M−1

1?.

E(dδLB) = E[(F −κ0+δD+δMT

= E[δDLB+δMT

?

+E{δLB(LB+LθθB)T}M−1

?

= −E(FθθB)+M2

= E(FB).

2M−1

1?LB]

?

1?)δLB]

2M−1

= −E

Lθ

E(Fθ|Z)

E(δLθθ|Z)LBδ

1M2

= E

δLθB

E(Fθ|Z)

E(δLθθ|Z)

?

−E{δ(LBB+LBθθT

B)}M−1

1M2

To verify (A.4), we see that

E(dsf) = E{(F −κ0+δD+δMT

= E(Fsf)−κ0E(sf)+E{E(δD+δMT

= E(Fsf).

2M−1

1?)sf}

2M−1

1?|X,Z)sf}

To verify (A.5), we see that

E(dδLθg) = E{(F −κ0+δD+δMT

= E(DLθδg)+MT

?

+MT

= E(Fθg)−MT

= E(Fθg)−MT

= E(Fθg),

2M−1

1E(?Lθδg)

?

1?)δLθg}

2M−1

= −E

Lθ

E(Fθ|Z)

E(δLθθ|Z)Lθδg

2M−1

2M−1

2M−1

1E{(LB+LθθB)Lθδg}

1E{(LBθ+LθθθB)δg}

1E{E(δLBθ+δLθθθB|Z)g}

where again we have used (A.2). Finally, because the responses are

missing at random, (A.6) is immediate. This completes the proof.

A.3 Sketch of Lemma 1

We have

? κmarg= n−1

n

?

i=1

?

δi

? πmarg(Zi)G(Yi)

+

? πmarg(Zi)

?

1−

δi

?

F{Xi,? θ(Zi,? B),? B}

?

= A1+A2.

Page 14

136Journal of the American Statistical Association, March 2007

By calculations that are similar to those given previously, and us-

ing (11), we can readily show that

A1= n−1

n

?

i=1

δi

πmarg(Zi)G(Yi)

−n−1

n

?

i=1

{δi−πmarg(Zi)}E

?

δiG(Yi)

{πmarg(Zi)}2

???Zi

?

+op

?n−1/2?.

We can write

A2= B1+B2+op

?n−1/2?,

δi

πmarg(Zi)

B1= n−1

n

?

n

?

i=1

?

1−

?

F{Xi,? θ(Zi,? B),? B},

{? πmarg(Zi)−πmarg(Zi)}.

B2= n−1

i=1

δiF{Xi,? θ(Zi,? B),? B}

{πmarg(Zi)}2

Using (5) and (6), we can show that

?

B1= n−1

n

?

i=1

1−

δi

πmarg(Zi)

?

Fi(·)+M2,margM−1

1n−1

n

?

i=1

δi?i

+n−1

n

?

i=1

δiDi,marg(·)+op

?n−1/2?.

Using (11) once again, we see that

B2= n−1

n

?

i=1

{δi−πmarg(Zi)}E

?

δiFi(·)

{πmarg(Zi)}2

???Zi

?

?

+op

?n−1/2?.

Collecting terms and noting that

0 = E

?δi{G(Yi)−Fi(·)}

{πmarg(Zi)}2

???Zi

proves (12).

A.4 Sketch of Lemma 2

We have

? κ = n−1

n

?

i=1

?

δi

π(Xi,Zi,? ζ)G(Yi)

+

?

1−

δi

π(Xi,Zi,? ζ)

?

F{Xi,? θ(Zi,? B),? B}

?

= A1+A2,

say. By a simple Taylor series expansion,

A1= n−1

n

?

?

i=1

δi

π(Xi,Zi,ζ)G(Yi)

−E

1

π(X,Z,ζ)G(Y)πζ(X,Z,ζ)

?T

n−1

n

?

i=1

ψiζ+op

?n−1/2?.

In addition,

A2= B1+B2+op

?n−1/2?,

B1= n−1

n

?

n

?

?n−1/2?.

i=1

?

1−

δi

π(Xi,Zi,ζ)

?

F{Xi,? θ(Zi,? B),? B},

πζ(Xi,Zi,ζ)T(? ζ −ζ)

B2= n−1

i=1

δiF{Xi,? θ(Zi,? B),? B}

{π(Xi,Zi,ζ)}2

+op

Using the fact that

0 = E

?

1−

δi

π(Xi,Zi,ζ)

???X,Z

?

,

we can easily show that

B1= n−1

n

?

i=1

?

1−

δi

π(Xi,Zi,ζ)

?

Fi(·)+op

?n−1/2?.

It also follows that

?

Collecting terms and using the fact that E{G(Y)|X,Z} = F(·), we ob-

tain the result.

B2= E

1

π(X,Z,ζ)F(·)πζ(X,Z,ζ)

?T

n−1

n

?

i=1

ψiζ(·)+op

?n−1/2?.

A.5 Proof of Theorem 2

A.5.1 Asymptotic Expansion.

Recall that B = (γ,β). The only things that differ with the calculations

of Carroll et al. (1997) is that we add terms involving δiand we need

not worry about any constraint on γ, and, thus, we avoid terms such as

their Pα on their page 487.

In their equation (A.12), they showed that

We first show the expansion (15).

n1/2(? B −B0) = n−1/2Q−1

Define H(u) = [E{ρ2(·)|U = u}]−1. In their equation (A.13), Carroll

et al. (1997) showed that

? θ(R+ST? γ,? B)−θ0(R+STγ0)

+? θ(R+STγ0,? B)−θ0(R+STγ0)+op

? θ(u,? B)−θ0(u)

= n−1

i=1

−H(u)?E{δ?ρ2(·)|U = u}?T(? B −B0)+op

G{φ,Y,X,B,θ(U)} = Dφ(Y,φ)−?Yc{XTβ +θ(U)}−C{c(·)}?/φ2.

Of course, G(·) is the likelihood score for φ. If there are no argu-

ments, G = G{φ0,Y,X,B0,θ0(R + STγ0)}. The estimating function

for φ solves

n

?

Because G is a likelihood score, it follows that

E?Gφ{φ0,Y,X,B0,θ0(R+STγ0)}|X,R,S?= −E{G2|X,R,S}.

By a Taylor series,

E(δG2)n1/2(? φ −φ0)

= n−1/2

i=1

n

?

n

?

n

?

i=1

δiNi?i+op(1).

(A.10)

= θ(1)

0(R+STγ0)ST(? γ −γ0)

Also, in their equation (A.11), they showed that

?n−1/2?.

(A.11)

n

?

δiKh(Ui−u)?iH(u)/f(u)

?n−1/2?.

(A.12)

Carroll et al. (1997) did not consider an estimate of φ. Define

0 = n−1/2

i=1

δiG{? φ,Yi,Xi,? B,? θ(Ri+ST

i? γ,? B)}.

n

?

δiG{φ0,Yi,Xi,? B,? θ(Ri+ST

δiGi+E(δGT

i? γ,? B)}+op(1)

= n−1/2

i=1

B)n1/2(? B −B0)

i? γ,? B)−θ0(Ri+ST

+n−1/2

i=1

δiGiθ{? θ(Ri+ST

iγ0)}+op(1).

Page 15

Maity, Ma, and Carrol: Efficient Estimation of Population Quantities137

However, it is readily verified that E(δGB|X,R,S) = 0 and that

E(δGθ|X,R,S) = 0. It, thus, follows via a simple calculation using

(A.11) that

E(δG2)n1/2(? φ −φ0)

= n−1/2

i=1

n

?

the last step following from an application of (A.12).

With some considerable algebra, (15) now follows from calcula-

tions similar to those in Section A.2. The variance calculation follows

because it is readily shown that, for any function h(U),

0 = E?(N?){δh(U)?}?.

A.5.2 Efficiency.

We now turn to semiparametric efficiency. Recall

that the GPLSIM follows the form (20) with XTβ0+ θ0(R + STγ0)

and that U = R+STγ0. It is immediate that V{µ(t)} = µ(1)(t)/c(1)(t),

that c(1)(t) = ρ1(t), and that ρ2(t) = ρ2

We also have

n

?

δiGi+n−1/2

n

?

i=1

δiGiθ{? θ(Ui,B0)−θ0(Ui)}+op(1)

= n−1/2

i=1

δiGi+op(1),

(A.13)

1(t)V{µ(t)} = c(1)(t)µ(1)(t).

E(?|X,Z) = 0,

(A.14)

E(?2|X,Z)

= E??Y −µ{XTβ0+θ0(U)}?2|X,Z??ρ1{XTβ0+θ0(U)}?2

= var(Y|X,Z)?ρ1{XTβ0+θ0(U)}?2

= φρ2(·).

Let the semiparametric model be denoted by M0. Consider a

parametric submodel Mλwith fX,Z(X,Z;ν1), θ0(R + STγ0,ν2), and

π(X,Z,ν3). The joint log-likelihood of Y,X, and Z under Mλis given

by

L(·) = (δ/φ)?Yc{XTβ0+θ0(R+STγ0,ν2)}

(A.15)

−C?c{XTβ0+θ0(R+STγ0,ν2)}??

+δD(Y,φ)+log{fX,Z(X,Z,ν1)}

+δ log{π(X,Z,ν3)}+(1−δ)log{1−π(X,Z,ν3)}.

As before, recall that ? = ρ1(·){Y − µ(·)} = c(1)(·){Y − µ(·)}. Then

the score functions evaluated at M0are

∂L/∂β = δXc(1)(·){Y −µ(·)}/φ = δX?/φ,

∂L/∂γ = δθ(1)(U)Sc(1)(·){Y −µ(·)}φ = δθ(1)(U)S?/φ,

∂L/∂ν1= sf(X,Z),

∂L/∂ν2= δh(U)c(1)(·){Y −µ(·)}/φ = δh(U)?/φ,

∂L/∂ν3= a(X,Z){δ −π(X,Z)},

∂L/∂φ = δDφ(Y,φ)−δ?Yc(·)−C{c(·)}?/φ2= δG,

where Dφ(Y,φ) isthederivativeof D(Y,φ) withrespectto φ, sf(X,Z)

is a mean-zero function and h(U), and a(X,Z) are any functions. This

means that the tangent space is spanned by

?T1= δ?STθ(1)

T3= δh(U)?/φ,T4= a(X,Z){δ −π(X,Z)},

T5= δG?.

An orthogonal basis of the tangent space is given by [T1=

δNT?,T2= sf(X,Z),T3= δh(U)?,T4= a(X,Z){δ − π(X,Z)}] and

0(U),XT??/φ,T2= sf(X,Z),

T5= δG; the orthogonality is a straightforward calculation. Now no-

tice that

?

and, hence,

κ0=

F{x,θ0(z;ν2),B0,φ0}fX,Z(x,z;γ)dxdz

∂κ0/∂β = E{Fβ(·)},

∂κ0/∂γ = E?Fθ(·)θ(1)(U)S?,

∂κ0/∂ν1= E{F(·)sf(X,Z)},

∂κ0/∂ν2= E[Fθ(·)h(Z)],

∂κ0/∂ν3= 0,

∂κ0/∂φ = E{Fφ(·)}.

As before, we first assume pathwise differentiability to construct the

efficient score. We then verify this in Section A.5.3.

By equation (7) of Newey (1990), there is a random quantity d such

that

E(dδX?/φ) = E{Fβ(·)},

E?dδθ(1)(U)S?/φ?= E?Fθ(·)θ(1)(U)S?,

E{dsf(X,Z)} = E{F(·)sf(X,Z)},

E{dδh(U)?/φ} = E{Fθ(·)h(U)},

E?da(X,Z){δ −π(X,Z)}?= 0,

E(dδG) = E{Fφ(·)}.

Now we compute the projection of d onto the tangent space. It is

immediate that ?(d|T2) = F(·) − κ0and that ?(d|T4) = 0. Be-

cause E[{δJ(U)?}{δh(U)?/φ}] = E{h(U)Fθ(·)}, it is readily shown

that ?(d|T3) = δJ(U)?. It is a similarly direct calculation to show that

?(d|T1) = DTQ−1δN?. Finally, ?(d|T5) = δGE{Fφ(·)}/E(δG2).

These calculations, thus, show that, assuming pathwise differentia-

bility, the efficient influence function for κ0is

? = DTQ−1δN? +F(·)−κ0+δJ(U)? +δGE{Fφ(·)}/E(δG2).

Hence, from (15), we see that ? κSIhas the semiparametric optimal in-

A.5.3 Pathwise Differentiability.

κ0+ δJ(U)? + δGE{Fφ(·)}/E(δG2), we have to show that (A.16)–

(A.21) hold. Let

d1= DTQ−1δN?,

d2= F(·)−κ0,

d3= δJ(U)?,

d4= 0,

d5= δGE{Fφ(·)}/E(δG2).

Then d = d1+···+d5. Because T1, T2, T3, T4, and T5are orthogonal

and di∈ Tifor i = 1,...,5, we have

E(d1T1) = E(DTQ−1δNNT?2) = φE(DT),

E(d2T2) = E?{F(·)−κ0}sf(X,Z)?= E{F(·)sf(X,Z)},

E(d3T3) = E{δJ(U)h(U)?2}

= E{π(X,Z)J(U)h(U)φρ2(·)},

E(d4T4) = 0,

E(d5T5) = E?δG2E{Fφ(·)}/E(δG2)?= E{Fφ(·)},

E(diTj) = 0,

(A.16)

(A.17)

(A.18)

(A.19)

(A.20)

(A.21)

fluence function and is asymptotically efficient.

For d = DTQ−1δN? + F(·) −

(A.22)

(A.23)

(A.24)

(A.25)

(A.26)

i ?= j.

(A.27)