Page 1
Efficient Estimation of Population-Level Summaries
in General Semiparametric Regression Models
Arnab MAITY, Yanyuan MA, and Raymond J. CARROLL
This article considers a wide class of semiparametric regression models in which interest focuses on population-level quantities that combine
both the parametric and the nonparametric parts of the model. Special cases in this approach include generalized partially linear models,
generalized partially linear single-index models, structural measurement error models, and many others. For estimating the parametric part
of the model efficiently, profile likelihood kernel estimation methods are well established in the literature. Here our focus is on estimating
general population-level quantities that combine the parametric and nonparametric parts of the model (e.g., population mean, probabilities,
etc.). We place this problem in a general context, provide a general kernel-based methodology, and derive the asymptotic distributions of
estimates of these population-level quantities, showing that in many cases the estimates are semiparametric efficient. For estimating the
population mean with no missing data, we show that the sample mean is semiparametric efficient for canonical exponential families, but not
in general. We apply the methods to a problem in nutritional epidemiology, where estimating the distribution of usual intake is of primary
interest and semiparametric methods are not available. Extensions to the case of missing response data are also discussed.
KEY WORDS:Generalized estimating equations; Kernel methods; Measurement error; Missing data; Nonparametric regression; Nutri-
tion; Partially linear model; Profile method; Semiparametric efficient score; Semiparametric information bound; Single-
index models.
1. INTRODUCTION
This article is about semiparametric regression models when
one is interested in estimating a population quantity such as
the mean, variance, and probabilities. The unique feature of
the problem is that the quantities of interest are functions of
both the parametric and the nonparametric parts of the model.
We will also allow for partially missing responses, but handling
such a modification is relatively easy. The main aim of the ar-
ticle is to estimate population quantities that involve both the
parametric and the nonparametric parts of the model and to do
so efficiently and in considerable generality.
We will construct estimators of these population-level quan-
tities that exploit the semiparametric structure of the problem,
derive their limiting distributions, and show in many cases that
the methods are semiparametric efficient. The work is moti-
vatedbyandillustratedwithanimportantprobleminnutritional
epidemiology, namely, estimating the distribution of usual in-
take for episodically consumed foods such as red meat.
A special simple case of our results is already established
in the literature (Wang, Linton, and Härdle 2004 and the refer-
ences therein), namely, the partially linear model
Yi= XT
iβ0+θ0(Zi)+ξi,
(1)
where θ0(·) is an unknown function and ξi= Normal(0,σ2
We allow the responses to be partially missing, important in
0).
Arnab Maity is a Graduate Student (E-mail: amaity@stat.tamu.edu),
Yanyuan Ma is Assistant Professor (E-mail: ma@stat.tamu.edu), and Raymond
J. Carroll is Distinguished Professor (E-mail: carroll@stat.tamu.edu), Depart-
ment of Statistics, Texas A&M University, College Station, TX 77843. This
work was supported by grants from the National Cancer Institute (CA57030
for AM and RJC; CA74552 for YM) and by the Texas A&M Center for En-
vironmental and Rural Health via a grant from the National Institute of Envi-
ronmental Health Sciences (P30-ES09106). The authors are grateful to Janet
Tooze, Amy Subar, Victor Kipnis, and Douglas Midthune for introducing us to
the problem of episodically consumed foods and for allowing us to use their
data.TheauthorsthankNaisyinWangforreadingthefinalmanuscriptandhelp-
ing us with replies to a referee. Part of the original work of the last two authors
originally occurred during a visit to the Centre of Excellence for Mathematics
and Statistics of Complex Systems at the Australian National University, whose
support they gratefully acknowledge. The authors especially wish to thank three
referees, the associate editor, and the joint editor for helping turn the original
submission into a publishable article. Their patience and many helpful sugges-
tions are very greatly appreciated.
cases where the response is difficult to measure but the pre-
dictors are not. Suppose that Y is partially missing and let
δ = 1 indicate that Y is observed, so that the observed data are
(δiYi,Xi,Zi,δi). Suppose further that Y is missing at random, so
that pr(δ = 1|Y,X,Z) = pr(δ = 1|X,Z).
Usually, of course, the main interest is in estimating β0effi-
ciently. This is not the problem we discuss, because in our ex-
ample the parameters β0are themselves of relatively minor in-
terest. In their work, Wang et al. (2004) estimated the marginal
mean κ0= E(Y) = E{XTβ0+ θ0(Z)}. Note how this combines
both the parametric and the nonparametric parts of the model.
One of the results of Wang et al. is that if one uses only the
complete data that Y is observed, then fits the standard profile
likelihood estimator to obtain? β and? θ(·,? β), it transpires that
sample mean is also semiparametric efficient.
Actually, quite a bit more is true even in this relatively simple
Gaussian case. Let B = (βT,σ2)Tand let? B and? θ(·,? B) be the
ample, Severini and Wong (1992) for local constant estimation
and Claeskens and Carroll (2007) for local linear estimation.
Consider estimating any functional κ0= E[F{X,θ0(Z),B0}]
for some function F(·) that is thrice continuously differen-
tiable: This, of course, includes such quantities as population
mean, and probabilities. Then one very special case of our re-
sults is that the semiparametric efficient estimate of κ0is just
? κ = n−1?n
ties. Thus, consider a semiparametric problem in which the
log-likelihood function given (X,Z) is L{Y,X,θ(Z),B}. If we
define LB(·) and Lθ(·) to be derivatives of the log-likelihood
with respect to B and θ(Z), we have the properties that
E[LB{Y,X,θ0(Z),B0}|X,Z] = 0 and similarly for Lθ(·). We
use profile likelihood methods computed at the observed data.
With missing data, this local linear kernel version of the pro-
file likelihood method of Severini and Wong (1992) works
a semiparametric efficient estimator of the population mean κ0
is n−1?n
i=1{XT
i? β +? θ(Zi,? β)}. If there are no missing data, the
profile likelihood estimates in the complete data; see, for ex-
i=1F{Xi,? θ(Zi,? B),? B}.
semiparametric models and general population-level quanti-
In contrast to Wang et al. (2004), we deal with general
© 2007 American Statistical Association
Journal of the American Statistical Association
March 2007, Vol. 102, No. 477, Theory and Methods
DOI 10.1198/016214506000001103
123
Page 2
124Journal of the American Statistical Association, March 2007
as follows. Let K(·) be a smooth symmetric density func-
tion with bounded support, let h be a bandwidth, and let
Kh(z) = h−1K(z/h). For any fixed B, let (? α0,? α1) be the local
n
?
and then setting? θ(z,B) =? α0. The profile likelihood estimator
ing in B
n
?
Our estimator of κ0= E[F{X,θ0(Z),B0}] is then
n
?
We emphasize that the possibility of missing response data
and finding a semiparametric efficient estimate of B0is not the
focus of the article. Instead, the focus is on estimating quanti-
ties κ0= E[F{X,θ0(Z),B0}] that depend on both the paramet-
ric and the nonparametric parts of the model: This is a very
different problem than simply estimating B0. Previous work in
the area considered only the partially linear model and only es-
timation of the population mean: Our work deals with general
semiparametric models and general population-level quantities.
An outline of this article is as follows. In Section 2 we dis-
cuss the general semiparametric problem with log-likelihood
L{Y,X,θ(Z),B} and a general goal of estimating κ0= E[F{X,
θ0(Z),B0}]. We derive the limiting distribution of (4) and show
that it is semiparametric efficient. We also discuss the general
problem where the population quantity κ0of interest is the ex-
pectation of a function of Y alone and describe doubly robust
estimators in this context.
InSection3weconsidertheclassofgeneralizedpartiallylin-
earsingle-indexmodels(Carroll,Fan,Gijbels,andWand1997).
Single-index modeling, see Härdle and Stoker (1989) and Här-
dle, Hall, and Ichimura (1993), is an important means of di-
mension reduction, one that is finding increased use in this age
of high-dimensional data. We develop methods for estimating
population quantities in the generalized partially linear single-
indexmodelingframeworkandshowthatthemethodsaresemi-
parametric efficient.
Section 4 describes an example from nutritional epidemiol-
ogy that motivated this work, namely, estimating the distribu-
tion of usual intake of episodically consumed foods such as red
meat. The model used in this area is far more complex than the
simple partially linear Gaussian model (1), and while the pop-
ulation mean is of some interest, of considerably more interest
is the probability that usual intake exceeds thresholds. We will
illustrate why in this context one cannot simply adopt the per-
centages of the observed responses that exceed a certain thresh-
old.
Section5describesthreeissues ofimportance:(1) bandwidth
selection (Sec. 5.1), (2) the efficiency and robustness of the
sample mean when the population mean is of interest (Sec. 5.2),
and numerical and theoretical insights into the partially linear
likelihood estimator obtained by maximizing, in (α0,α1),
i=1
δiKh(Zi−z)L{Yi,Xi,α0+α1(Zi−z),B},
(2)
of B0modified for missing responses is obtained by maximiz-
i=1
δiL{Yi,Xi,? θ(Zi,B),B}.
(3)
? κ = n−1
i=1
F{Xi,? θ(Zi,? B),? B}.
(4)
model and the nature of our assumptions (Sec. 5.3). An inter-
esting special case is, of course, the partially linear model when
κ0is the population mean. For this problem, we show in Sec-
tion 5.2 that, with no missing data, the sample mean is semi-
parametric efficient for canonical exponential families but not
of course in general, thus extending and clarifying the results of
Wang et al. (2004) that were specific to the Gaussian case.
Section 6 gives concluding remarks and results. All technical
results are given in the Appendix.
2. SEMIPARAMETRIC MODELS WITH
A SINGLE COMPONENT
2.1 Main Results
We benefit from the fact that the limiting expansions for? B
modification of incorporating the missing response indicators.
Let f(z) be the density function of Z, which is assumed to have
bounded support and to be positive on that support. Let ?(z) =
f(z)E{δLθθ(·)|Z = z}. Let Liθ(·) = Lθ{Yi,Xi,θ0(Zi),B0} and
so on. Then it follows from standard results (see the App. for
more discussion) that as a minor modification of the work of
Severini and Wong (1992),
and? θ(·) are essentially already well known, with the minor
? θ(z,? B)−θ0(z)
= (h2/2)θ(2)
0(z)−n−1
n
?
i=1
δiKh(Zi−z)Liθ(·)/?(z)
?n−1/2?,
δi?i+op
+θB(z,B0)(? B −B0)+op
? B −B0= M−1
where
(5)
1n−1
n
?
i=1
?n−1/2?,
(6)
θB(z,B0) = −E{δLBθ(·)|Z = z}/E{δLθθ(·)|Z = z},
?i= {LiB(·)+Liθ(·)θB(Zi,B0)},
M1= E(δ??T) = −E?δ{LBB(·)+LBθ(·)θT
andwhere,under regularityconditions,(5) is uniformin z. Con-
ditions guaranteeing (6) are well known; see the Appendix.
Define
Di(·) = −Liθ(·)E{Fθ(·)|Zi}
M2= E{FB(·)+Fθ(·)θB(Z,B0)}.
In the Appendix we show the following result.
(7)
(8)
B(Z,B0)}?,
E{δLθθ(·)|Zi},
Theorem 1. Suppose that nh4→ 0 and that (5) and (6) hold,
the former uniformly in z. Suppose also that Z has compact
support, that its density is bounded away from 0 on that sup-
port, and that the kernel function also has a finite support. Then
the estimator ? κ of κ0= E[F{X,θ0(Z),B0}] is semiparametric
n1/2(? κ −κ0)
= n−1/2
i=1
+δiDi(·)?+op(1)
efficient in the sense of Newey (1990). In addition, as n → ∞,
n
?
?Fi(·)−κ0+MT
2M−1
1δi?i
(9)
Page 3
Maity, Ma, and Carrol: Efficient Estimation of Population Quantities125
⇒ Normal?0,E{F(·)−κ0}2+MT
2M−1
1M2
+E{δD2(·)}?.
(10)
Remark 1. To obtain asymptotically correct inference
about κ0, there are two possible routes. The first is to use the
bootstrap: Whereas Chen, Linton, and Van Keilegom (2003)
only justified the bootstrap for estimating B0, we conjecture
that the bootstrap works for κ0as well. More formally, one re-
quiresonlyaconsistentestimateofthelimitingvariancein(10).
This is a straightforward exercise, although programming in-
tense: One merely replaces all the expectations by sums in that
expression and all the regression functions by kernel estimates.
Remark 2. Our analysis of semiparametric efficiency in the
sense of Newey (1990) has this outline. We first assume path-
wise differentiability of κ; see Section A.2.2 for a definition.
Working with this assumption, we derive the semiparametric
efficient score. With this score in hand, we then prove pathwise
differentiability. Details are given in the Appendix.
Remark 3. With a slight modification using a device in-
troduced to semiparametric methods by Bickel (1982), The-
orem 1 also holds for estimated bandwidths. We confine our
discussion to bandwidths of order n−1/3; see Section 5.1.2 for
a reason. Write such bandwidths as hn= cn−1/3, where, fol-
lowing Bickel, the values for c are allowed to take values in
the set U = a{0,±1,±2,...}, where a is an arbitrary small
number. We discretize bandwidths so that they take on values
cn−1/3with c ∈ U. Denote estimators as? κ(hn) and note that for
? κ(c0n−1/3)} = op(1) and that n1/2{? κ(c0n−1/3)−? κ(c∗n−1/3)} =
mated bandwidth with the property that?hn= Op(n−1/3), then
holds for these estimated bandwidths.
an arbitrary c∗and an arbitrary fixed, deterministic sequence
cn→ c0for finite c0, Theorem 1 shows that n1/2{? κ(cnn−1/3)−
op(1). Hence, it follows from Bickel (1982, p. 653, just af-
ter eq. 3.7) that if?hn= ? cnn−1/3, with ? c ∈ U, is an esti-
n1/2{? κ(? cnn−1/3) − ? κ(c∗n−1/3)} = op(1). Hence, Theorem 1
2.2 General Functions of the Response
and Double Robustness
It is important to consider estimation in problems where
κ0 can be constructed outside the model. Suppose that κ0=
E{G(Y)} and define F{X,θ0(Z),B0} = E{G(Y)|X,Z}. We will
discuss two estimators with the properties that (1) if there are
no missing response data, the semiparametric model is not used
and the estimator is consistent; and (2) under certain circum-
stances, the estimator is consistent if either the semiparametric
model is correct or if a model for the missing-data process is
correct.
Our motivating example discussed in Section 4 dose not fall
into the category discussed in this section.
The two estimators are based on different constructions for
estimating the missing-data process. The first is based on a
nonparametric formulation for estimating pr(δ = 1|Z) = πmarg,
where the subscript indicates a marginal estimation of the prob-
ability that Y is observed. The second is based on a paramet-
ric formulation for estimating pr(δ = 1|Y,X,Z) = π(X,Z,ζ),
where ζ is an unknown parameter estimated by standard logis-
tic regression of δ on (X,Z).
The first estimator, similar to one defined by Wang et al.
(2004) and efficient in the Gaussian partially linear model, can
be constructed as follows. Estimate πmargby local linear logis-
tic regression of δ on Z, leading to the usual asymptotic expan-
sion
? πmarg(z)−πmarg(z)
= n−1
n
?
j=1
{δj−πmarg(Zj)}Kh(z−Zj)/fZ(z)+op
?n−1/2?,
(11)
assuming that nh4→ 0. Then construct the estimator
n
?
? κmarg= n−1
i=1
?
δi
? πmarg(Zi)G(Yi)+
?
1−
δi
? πmarg(Zi)
×F{Xi,? θ(Zi,? B),? B}
?
?
.
The estimatorhas two useful properties:(1) if there are no miss-
ing data, it does not depend on the model and is, hence, consis-
tent for κ0; and (2) if observation of the response Y depends
only on Z, it is consistent even if the semiparametric model is
not correct.
In a similar vein, the second estimate, also similar to another
estimate of Wang et al. (2004), is given as
?
? κ = n−1
n
?
i=1
δi
π(Xi,Zi,? ζ)G(Yi)+
?
1−
δi
π(Xi,Zi,? ζ)
×F{Xi,? θ(Zi,? B),? B}
?
?
.
This estimator has the double-robustness property that if either
the parametric model π(X,Z,ζ) or the underlying semipara-
metric modelfor {B,θ(·)} is correct, then? κ is consistentand as-
in both? κmargand? κ improve efficiency: They are also important
If both models are correct, then the following results are ob-
tained as a consequence of (5) and (6); see the Appendix for a
sketch.
ymptotically normally distributed. Generally, the second terms
for the double-robustness property of? κ.
Lemma 1. Define
M2,marg= E
??
1−
δ
πmarg(Z)
??
?
{FB(·)+Fθ(·)θB(Z,B0)}T
?
?
,
Di,marg(·) = −Liθ(·)E1−
δi
πmarg(Zi)
Fiθ(·)
???Zi
?
?E{δLθθ(·)|Zi}.
Then, to terms of order op(1),
n1/2(? κmarg−κ0)
≈ n−1/2
?
n
?
i=1
?
δi
πmarg(Zi)G(Yi)
?
+
1−
δi
πmarg(Zi)
Fi(·)−κ0
?
Page 4
126Journal of the American Statistical Association, March 2007
+M2,margM−1
1n−1/2
n
?
i=1
δi?i
+n−1/2
n
?
i=1
δiDi,marg(·).
(12)
Lemma 2. Define πζ(X,Z,ζ) = ∂π(X,Z,ζ)/∂ζ. Assume
that n1/2(? ζ − ζ) = n−1/2?n
n1/2(? κ −κ0)
≈ n−1/2
i=1
?
Remark 4. The expansions (12) and (13) show that? κmargand
the asymptotic variances are given as
?
i=1ψiζ(·) + op(1) with E{ψζ(·)|
X,Z} = 0. Then, to terms of order op(1),
n
?
?
δi
π(Xi,Zi,ζ){G(Yi)−κ0}
+
1−
δi
π(Xi,Zi,ζ)
?
{Fi(·)−κ0}
?
.
(13)
? κ are asymptotically normally distributed. One can show that
Vκ,marg= var
δ
πmarg(Z)G(Y)+
?
1−
δ
πmarg(Z)
?
F(·)
+M2,margM−1
1δ? +δDmarg(·)
?
,
Vκ= var
?
δi
π(Xi,Zi,ζ)G(Yi)
?
+
1−
δi
π(Xi,Zi,ζ)
?
Fi(·)
?
,
respectively, from which estimates are readily derived.
Finally, we note that Claeskens and Carroll (2007) showed
that in general likelihood problems, if there is an omitted co-
variate, then under contiguous alternatives the effect on estima-
tors is to add an asymptotic bias, without changing the asymp-
totic variance.
3. SINGLE–INDEX MODELS
One means of dimension reduction is single-index modeling.
Single-index models can be viewed as a generalized version of
projection pursuit, in that only the most influential direction is
retained to keep the model tractable and to reduce dimension.
Since its introduction in Härdle and Stoker (1989), single-index
modeling has been widely studied and used. A comprehensive
summary of the model is given in Härdle, Müller, Sperlich, and
Werwatz (2004). Let Z = (R,ST)Twhere R is a scalar. We con-
sider here the generalized partially linear single-index model
(GPLSIM)ofCarrolletal.(1997),namely,theexponentialfam-
ily (20) with η(X,Z) = XTβ0+θ0(ZTα0), where θ0(·) is an un-
known function and for identifiability purposes ?α0? = 1. Be-
cause identifiability requires that one of the components of Z
be a nontrivial predictor of Y, for convenience we will make
the very small modification that one component of Z, what we
call R, is a known nontrivial predictor of Y. The reason for
making this modification can be seen in theorem 4 of Carroll
et al. (1997) where the final limit distribution of the estimate
of α0has a singular covariance matrix. In addition, their main
asymptotic expansion, given in their equation (A.12), is about
the nonsingular transformation (I −α0αT
With this modification, we write the model as
E(Y|X,Z) = C(1)?c{η(X,Z)}?= µ{XTβ0+θ0(R+STγ0)},
where γ0is unrestricted.
Carroll et al. (1997) used profile likelihood to estimate B0=
(γ0,β0) and θ0(·), although they presented no results con-
cerning the estimate of φ0, their interest largely being in lo-
gistic regression where φ0= 1 is known. Rewrite the likeli-
hood function (20) as L{Y,X,β,θ(R + STγ),φ}. Then, given
B = (γT,βT)T, they formed U(γ) = R + STγ and computed
the estimate? θ{u(γ),B} by local likelihood of Y on {X,U(γ)}
ST
iγ,B),φ}] in B and φ.
Our goal is to estimate κSI= E[F{X,θ0(R+STγ0),β0,φ0}].
Ourproposedestimateis? κSI= n−1?n
G = Dφ(Y,φ0)−?Yc{XTβ0+θ0(U)}−C{c(·)}?/φ2
Also define ? = {STθ(1)
and ? = [Y−µ{XTβ0+θ0(U)}]ρ1{XTβ0+θ0(U)}.Define Ni=
?i− [E{δρ2(·)|Ui}]−1E{δi?iρ2(·)|Ui} and Q = E{δNNT×
ρ2(·)}. Make the further definitions Fβ(·) = ∂F{X,θ0(U),
β0,φ0}/∂β0, Fφ(·) = ∂F{X,θ0(U),β0,φ0}/∂φ0, and Fθ(·) =
∂F{X,θ0(U),β0,φ0}/∂θ0(U). Also define
J(U) = [E{δρ2(·)|U}]−1E{Fθ(·)|U},
D =
E{Fβ(·)}−E(Fθ(·)[E{δρ2(·)|U}]−1E{δXρ2(·)|U})
Then we have the following result regarding the asymptotic dis-
tribution of? κSI.
iid and that the conditions in Carroll et al. (1997) hold, in par-
ticular, that nh4→ 0. Then
n1/2(? κSI−κSI)
= n−1/2
i=1
+DTQ−1δiNi?i+δiJ(Ui)?i
+δiGiE{Fφ(·)}/E(δG2)?+op(1)
⇒ Normal(0,V),
where V = E[F{X,θ0(U),β0,φ0}−κSI]2+DTQ−1D+var{δ×
J(U)?} + E(δG2)[E{Fφ(·)}]2/{E(δG2)}2. Further, ? κSIis semi-
4. MOTIVATING EXAMPLE
0)(? α −α0).
(14)
as in Severini and Staniswalis (1994), using the data with
δ = 1. Then they maximized?n
i=1δilog[L{Yi,Xi,β,? θ(Ri+
i=1F{Xi,? θ(Ri+ST
i? γ,? B),
? β,? φ}.
Our main result is as follows. First, define U = R+STγ0and
0.
0(U),XT}T, ρ?(·) = {µ(1)(·)}?/V(·),
?
E{Fθ(·)θ(1)(U)S}−E(Fθ(·)[E{δρ2(·)|U}]−1θ(1)(U)E{δSρ2(·)|U})
?
.
Theorem 2. Assume that (Yi,δi,Xi,Zi),i = 1,2,...,n, are
n
?
?F{Xi,θ0(Ui),β0,φ0}−κSI
(15)
parametric efficient.
4.1 Introduction
There is considerable interest in understanding the distribu-
tion of dietary intake in various populations. For example, as
obesity rates continue to rise in the United States (Flegal, Car-
roll, Ogden, and Johnson 2002), the demand for information
Page 5
Maity, Ma, and Carrol: Efficient Estimation of Population Quantities127
about diet and nutrition is increasing. Information on dietary
intake has implications for establishing population norms, con-
ducting research, and making public policy decisions (Woteki
2003).
We wish toemphasizethatthereare no missingresponse data
in this example. We also emphasize that the problem is vastly
different from simply estimating the population mean using a
Gaussian partially linear model. The strength of our approach
is that once we have proposed a semiparametric model, then
our methodology, asymptotics, and semiparametric efficiency
results are readily employed.
This article was motivated by the analysis of the Eating at
America’s Table Study (EATS) (Subar et al. 2001), where esti-
mating the distribution of the consumption of episodically con-
sumed foods is of interest. The data consist of four 24-hour re-
calls over the course of a year as well as the National Cancer
Institute’s (NCI) dietary history questionnaire (DHQ), a partic-
ularversionofafoodfrequencyquestionnaire(FFQ;seeWillett
et al. 1985 and Block et al. 1986). The goal is to estimate the
distribution of usual intake, defined as the average daily intake
of a dietary component by an individual in a fixed time period,
a year in the case of EATS. There were n = 886 individuals in
the dataset.
When the responses are continuous random variables, this
is a classical problem of measurement error, with a large
literature. However, little of the literature is relevant to episodi-
cally consumed foods, as we now describe. Consider, for ex-
ample, consumption of red meat, dark-green vegetables, and
deep-yellow vegetables, all of interest in nutritional surveil-
lance. In the EATS data, 45% of the 24-hour recalls reported
no red-meat consumption. In addition, 5.5% of the individu-
als reported no red-meat consumption on any of the four sepa-
rate 24-hour recalls: For deep-yellow vegetables these numbers
are 63% and 20%, respectively, while for dark-green vegetables
the numbers are 78% and 46%, respectively. Clearly, methods
aimed at understanding usual intakes for continuous data are
inappropriate for episodically consumed foods with so many
zero-reported intakes.
4.2 Model
To handle episodically consumed foods, two-part models
have been developed (Tooze, Grunwald, and Jones 2002).
These are basically zero-inflated repeated-measures examples.
Our methods are applicable to such problems when the covari-
ate Z is evaluated only once for each subject, as it is in our
example.
We describe here a simplification of this approach, used to il-
lustrate our methodology. On each individual, we measure age
and gender, the collection being what we call R. We also ob-
serve energy (calories) as measured by the DHQ, the logarithm
of which we call Z. The reader should note that Z is evalu-
ated only once per individual, and, hence, while there are re-
peated measures on the responses, there are no repeated mea-
sures on Z: θ0(Z) occurs only once in the likelihood function,
and our methodology applies.
Let X = (R,Z). The response data for an individual i consist
of four 24-hour recalls of red-meat consumption. Let ?ij= 1
if red meat is reported consumed on the jth 24-hour recall for
j = 1,...,4. Let Yijbe the product of ?ijand the logarithm
of reported red-meat consumption, with the convention that
0log(0) = 0. Then the response data are Yi= (?ij,Yij)4
j=1.
4.2.1 Modeling the Probability of Zero Response.
part of the model is whether the subject reports red-meat con-
sumption.Wemodelthisasarepeated-measureslogisticregres-
sion, so that
The first
pr(?ij= 1|Ri,Zi,Ui1) = H(β0+XT
where H(·) is the logistic distribution function and Ui1=
Normal(0,σ2
u1) is a person-specific random effect. Note that,
for simplicity, we have modeled the effect of energy consump-
tion as linear, because in the data there is little hint of nonlin-
earity.
iβ1+Ui1),
(16)
4.2.2 Modeling Positive Responses.
model consists of a distribution of the logarithm of red-meat
consumption on days when consumption is reported, namely,
The second part of the
[Yij|?ij= 1,Ri,Zi,Ui2] = Normal{RT
iβ2+θ(Zi)+Ui2,σ2},
(17)
where Ui2= Normal(0,σ2
which we take to be independent of Ui1. Note that (17) means
that the nonzero Y data within an individual marginally have
the same mean RT
covariance σ2
u2.
4.2.3 Likelihood Function.
is B, consisting of β0, β1, β2, σ2
likelihood function L(·) is readily computed with numerical in-
tegration as follows:
?u1
j=1
×{1−H(β0+XTβ1+u1)}1−?ijdu1
×σ−1
?
Of course, the second numerical integral is not necessary, be-
cause the integration can be done analytically.
u2) is a person-specific random effect,
iβ2+ θ(Zi), variance σ2+ σ2
u2, and common
The collection of parameters
u1, σ2
u2, and σ2. The log-
exp{L(·)} =
1
σu1
?
φ
σu1
? 4?
{H(β0+XTβ1+u1)}?ij
u2σ−?
4?
j?ij
?
φ
?u2
iβ2+θ(Zi)+u2}
σ
σu2
?
×
j=1
φ
?Yij−{RT
???ij
du2.
4.2.4 Defining Usual Intake at the Individual Level.
ing from (17) that reported intake on days of consumption fol-
lows a log-normal distribution, the usual intake for an individ-
ual is defined as
Not-
G{X,U1,U2,B,θ(Z)}
= H(β0+XT
×exp{RTβ2+θ(Z)+U2+σ2/2}.
The goal is to understand the distribution of G{X,U1,U2,
B,θ(Z)} across a population. In particular, for arbitrary c
we wish to estimate pr[G{X,U1,U2,B,θ(Z)} > c]. Define
F{X,B,θ(Z)} = pr[G{X,U1,U2,B,θ(Z)} > c|X,Z], a quan-
tity that can be computed by numerical integration. Then κ0=
E[F{X,B,θ(Z)}] is the percentage of the population whose
long-term reported daily average consumption of red meat ex-
ceeds c.
iβ1+U1)
(18)
Page 6
128Journal of the American Statistical Association, March 2007
4.3 Bias in Naive Estimates, and a Simulation Study
We emphasize that the distribution of mean intake cannot
be estimated consistently by the simple device of computing
the sample percentage of the observed 24-hour recalls that ex-
ceed c, and, as a consequence, going through the model-fitting
process is actually necessary. To see this, suppose only one
24-hour recall per person was computed and the percentage of
these 24-hour recalls exceeding c was computed. In large sam-
ples, this percentage converges to
κ24hr= E?H(β0+XTβ1+U1)
×??{RTβ2+θ(Z)−log(c)}/(σ2+σ2
In contrast, for σ2> 0,
κ0= E????RTβ2+θ(Z)+σ2/2
2)1/2??.
−log{c/H(β0+XTβ1+U1)}?/σ2
??.
As the number of replicates m of the 24-hour recall ap-
proaches ∞, the percentage κm,24hrof the means of the 24-hour
recalls that exceed c → κ0, so we would expect that the fewer
the replicates, the less our estimate agrees with the sample ver-
sion of κm,24hr, a phenomenon observed in our data; see below.
To see this numerically, we ran the following simulation
study. Gender, age, and the DHQ were kept the same as in
the EATS. The parameters (β0,β1,β2,σ2,σ2
same as our estimated values; see below. The function θ(·) was
roughly in accord with our estimated function, for simplicity,
being quadratic in the logarithm of the DHQ, standardized to
have minimum .0 and maximum 1.0, with intercept, slope, and
quadratic parameters being .50, 1.50, and −.75, respectively.
The true survival function, that is, 1 − the cdf, was computed
analytically, while the survival functions for the mean of two
24-hour recalls and the mean of four 24-hour recalls were com-
puted by 1,000 simulated datasets.
The results are given in Figure 1, where the bias from not
using a model is evident.
We used our methods with a nonparametrically estimated
function, a bandwidth h = .30, and the Epanechnikov kernel
function. We generated 300 datasets, with results displayed in
Figure 2. The mean over the simulation was almost exactly the
correct function, not surprising given that the sample size is
large (n = 886). In Figure 2 we also display a 90% confidence
range from the simulated datasets, indicating that in the EATS
data at least, the results of our approach are relatively accurate.
1,σ2
2) were the
4.4 Data Analysis
We standardized age to have mean 0 and variance 1. In the
logistic part of the model, the intercept was estimated as −8.15,
with the coefficients for (gender, age, DHQ) = (.13,.14,1.09).
The random-effect variance was estimated as ? σ2
.05 to .40, with little change in any of the estimates, as de-
scribed in more detail in Section 5.1. With a bandwidth h = .30,
our estimates were? σ2= .76,? σ2
data: We used other methods such as mixed models with poly-
nomial fits and obtained roughly the same answers.
1= .66. In the
continuous part of the model, we used bandwidths ranging from
2= .043,and the coefficientsfor
gender and age were −.25 and .02, respectively. The coefficient
for the person-specific random effect σ2
2appears intrinsic to the
Figure 1. Results of the Simulation Study Meant to Mimic the EATS
Study. All results are averages over 1,000 simulated datasets. The mean
of the semiparametric estimator (
almost identical to the true survival curve. The empirical survival func-
tion of the mean of two 24-hour recalls (
datasets. The empirical survival function of the mean of four 24-hour
recalls (
) from 1,000 simulated datasets.
) of the survival curse, which is
) from 1,000 simulated
We display the computed survival function in Figure 3. Dis-
played there are our method, along with the empirical survival
functions for the mean of the first two 24-hour recalls and the
mean of all four 24-hour recalls. While these are biased, it is
interesting to note that using the mean of only two 24-hour re-
calls is more different from our method than using the mean of
four 24-hour recalls, which is expected as described previously.
The similarity of Figures 1 and 3 is striking, mainly indicating
that naive approaches, such as using the mean of two 24-hour
recalls, can result in badly biased estimates of κ0.
Figure 2. Results of the Simulation Study Meant to Mimic the EATS
Study. Plotted is the mean survival function for 300 simulated datasets,
along with the 90% pointwise confidence intervals. The mean fitted func-
tion is almost exact.
Page 7
Maity, Ma, and Carrol: Efficient Estimation of Population Quantities129
Figure 3. Results From the EATS Example. Plotted are estimates of
the survival function (1 − the cdf) of usual intake of red meat. The solid
line is the semiparametric method described in Section 4. The dotted
line is the empirical survival function of the mean of the first two 24-hour
recalls per person, while the dashed line is survival function of the mean
of all the 24-hour recalls per person.
5. BANDWIDTH SELECTION, THE PARTIALLY LINEAR
MODEL, AND THE SAMPLE MEAN
5.1 Bandwidth Selection
5.1.1 Background.
nel density function, that is, one with mean 0 and positive vari-
ance. With this choice, in Theorem 1 we have assumed that the
bandwidth satisfies nh4→ 0: for estimation of the population
mean in the partially linear model. In contrast, if one were in-
terested only in B0, then it is well known that by using profile
likelihood the usual bandwidth order h ∼ n−1/5is acceptable,
and off-the-shelf bandwidth selection techniques yield an as-
ymptotically normal limit distribution.
The reason for the technical need for undersmoothing is
the inclusion of θ0(·) in κ0. For example, suppose that κ0=
E{θ0(Z)}. Then it follows from (5) that ? κ − κ0= Op(h2+
moves the bias term entirely.
Note that κ0is not a parameter in the model, being a mixture
of the parametric part B0, the nonparametric part θ0(·), and the
joint distribution of (X,Z). Thus, it does not appear that κ0can
be estimated by profiling ideas.
We have used a standard first-order ker-
n−1/2). Thus, in order for n1/2(? κ −κ0) = Op(1), we require that
nh4= Op(1). The additional restriction that nh4→ 0 merely re-
5.1.2 Optimal Estimation.
ymptotic distribution of n1/2(? κ −κ0) is unaffected by the band-
and numerical evidence of the lack of sensitivity to the band-
width choice; see also Section 5.3 for further numerical evi-
dence. In Section 5.1.4 we describe three different, simple prac-
tical methods for bandwidth selection in this problem, all of
which work quite well in our simulations and example.
Because first-order calculations do not get squarely at the
choice of bandwidth, other than to suggest that it is not partic-
As seen in Theorem 1, the as-
width, at least to first order. In Section 5.1.3 we give intuitive
ularly crucial, an alternative theoretical device is to do second-
order calculations. Define η(n,h) = n1/2h2+ (n1/2h)−1. In a
problem similar to ours, Sepanski, Knickerbocker, and Carroll
(1994) showed that the variance of linear combinations of the
estimate of B0has a second-order expansion as follows. Sup-
pose we want to estimate ξTB0. Then, for constants (a1,a2),
n1/2(ξT? B −ξTB0) = Vn+op{η(n,h)},
This means that the optimal bandwidth is on the order of
h = cn−1/3for, a constant c depending on (a1,a2), which, in
turn, depend on the problem, that is, on the distribution of
(Y,X,Z) as well as on B0and θ0(·). In their practical imple-
mentation, translated from the Gaussian kernel function to our
Epanechnikov kernel function, Sepanski et al. (1994) suggested
the following device, namely, that if the optimal bandwidth for
estimating θ0(·) is ho= cn−1/5, then one should use the correct-
order bandwidth h = cn−1/3. They also did sensitivity analysis,
for example, h = (1/2)cn−1/3, but found little change in their
simulations. One of our three methods of practical bandwidth
selection is exactly this one.
A problem not within our framework but carrying a sim-
ilar flavor was considered by Powell and Stoker (1996) and
Newey, Hsieh, and Robins (2004), namely, the estimation of
the weighted average derivative κAD= E{Yθ(1)
by Sepanski et al. (1994), Powell and Stoker (1996) showed
that the optimal bandwidth constructed from second-order cal-
culations is an undersmoothed bandwidth. Newey et al. (2004)
suggested that a simple device of choosing the bandwidth is
to choose something optimal when using a standard second-
order kernel function but to then undersmooth, in effect, by us-
ing a higher order kernel such as the twicing kernel. This is our
second bandwidth selection method described in Section 5.1.4.
Like the first, it appears to be an effective means of eliminating
the bias term.
In our problem, the article by Sepanski et al. (1994) is more
relevant. Preliminary calculations based on the basic tools in
that article suggest that for our problem, the optimal bandwidth
is also of order n−1/3. We intend to pursue these very calcula-
tions in another article.
cov(Vn) = constant+?a1n1/2h2+a2
?hn1/2?−1?2.
0(Z)}. As done
5.1.3 Lack of Sensitivity to Bandwidth.
term technical need for undersmoothing because that is what it
really is. In practice, as Theorem 1 states, the asymptotic dis-
tribution of? κ is unaffected by bandwidth choice for very broad
pens with estimation of the function θ0(·), where bandwidth se-
lection is typically critical in practice, and this is seen in theory
through the usual bias–variance tradeoff.
In practice, we expect little effect of the bandwidth selection
on estimation of B0, and even less effect on estimation of κ0.
The reason is that broad ranges of bandwidths lead to no as-
ymptotic effect on the distribution of? B. The extra amount of
? κ will be even less sensitive to the bandwidth, the so-called
To see this issue, consider the simulation in Wang et al.
(2004). They set X and Z to be independent, with X =
Normal(1,1) and Z = Uniform[0,1]. In the partially linear
We have used the
ranges of bandwidths. This is totally different from what hap-
smoothing inherent in the summation in (4) should mean that
double-smoothing phenomenon.
Page 8
130 Journal of the American Statistical Association, March 2007
model, they set B0= 1.5, ? = Normal(0,1), and θ0(z) =
3.2z2− 1. They used the kernel function (15/16)(1 − z2)2×
I(|z| ≤ 1), and they fixed the bandwidth to be h = n−2/3, which
at least asymptotically is very great undersmoothing, because
h ∼ n−1/3is already acceptable and typically something like
nh2/log(n) → ∞ is usually required. In their case 3, they used
effective sample sizes for complete data of 18, 36, and 60, with
corresponding bandwidths .146, .092, and .065, respectively.
We reran the simulation of Wang et al. (2004), with com-
plete response data and n = 60. We used bandwidths .02, .06,
.10, and .14, ranging from a very small bandwidth, less than
1/3 that used by Wang et al. (2004), to a larger bandwidth,
more than double that used. As another perspective, if one sets
h = σzn−c, where σzis the standard deviation of Z, then the
bandwidths used are equivalent to c = .73, .46, .34, and .26.
In other words, a bandwidth here of h = .02 is very great un-
dersmoothing, while even h = .14 satisfies the theoretical con-
straint on the bandwidth.
In Figure 4 we plot the results for a single dataset, where, as
in Wang et al. (2004), interest lies in estimating κ0= E(Y). As
is obvious from this figure, the bandwidth choice is very impor-
tant for estimation of the function, but trivially unimportant for
estimation of κ0, the estimate of which ranged from 1.818 to
1.828.
In Figure 5 we plot the mean estimated functions from 100
simulated datasets. Again, the bandwidth matters a great deal
for estimating the function θ0(·). Again, too, the bandwidth
matters hardly at all for estimating κ0. Thus, for estimating κ0,
the mean estimates across the bandwidths range from 1.513 to
1.526, and the standard deviations of the estimates range from
.249 to .252. There is somewhat more effect of bandwidth on
the estimate of B0: For h ≥ .06, there is almost no effect, but
choosing h = .02 results in a 50% increase in standard devia-
tion.
In other words, as expected by theory and intuition, band-
width selection has little effect on the estimate of B0, except
when the bandwidth is much too small, and very little effect
on the estimation of κ0= E(Y). Similar results occur when one
looks at the variance of the errors as the parameter, and κ0is
the population variance.
5.1.4 Bandwidth Selection.
bandwidth selection is not a vital issue for estimating κ0: Of
course, it is vital for estimating θ0(·). Effectively, what this
means is that the real need is simply to get bandwidths that
satisfy the technical assumption of undersmoothing but are not
too ridiculously small: A precise target is often unnecessary. In
addition, because the asymptotic distribution of ? κ does not de-
are not possible. Thus, in our example, we used three different
methods,allof whichgaveanswers thatwereas nearlyidentical
as in the simulation of Wang et al. (2004).
All the methods are based on a so-called “typical device” to
get an optimal bandwidth for estimating θ0, of the form hopt=
cσzn−1/5. In practice, this can be accomplished by constructing
a finite grid of bandwidths of the form hgrid= cgridσzn−1/5: We
use a grid from .20 to 5.0. After estimating B0by? B(hgrid), this
obtained. The maximizer of the log-likelihood cross-validation
score is selected as hopt.
As described in Section 5.1.3,
pend on the bandwidth, simple first-order methods of the type
that are used in bandwidth selection for function estimation
valueisfixed,andthenalog-likelihoodcross-validationscoreis
• If hopt= cσzn−1/5, an extremelysimple deviceis simply to
set h = hoptn−2/15= cσzn−1/3, which satisfies the techni-
cal condition of undersmoothing without becoming ridicu-
lously small. This device may seem terribly ad hoc, but the
theory, the simulation of Wang et al. (2004), the discus-
sion in Section 5.1.3, and our own work suggest that this
method actually works reasonably well. Note, too, that in
Section 5.1.2 we give evidence that this bandwidth rate is
most likely optimal.
• A second approach is taken by Newey et al. (2004) and
is also an effective practical device. The technical need
for undersmoothing comes from the fact that the bias term
in a first-order local likelihood kernel regression is of or-
der O(h2). One can use higher order kernels to get the
bias to be of order O(h2s) for s ≥ 2, but this does not re-
ally help in that the variance remains of order O{(nh)−1},
so that the optimal mean squared error kernel estima-
tor has h = O{n−1/(4s+1)}, and thus undersmoothing to
estimate κ0 is still required. However, as Newey et al.
(2004) pointed out, if one uses the optimal bandwidth
hopt= cσzn−1/5, but then does the estimation procedure
replacing the first-order kernel by a higher order kernel,
then the bias is O(h2s
nient higher order kernel is the second-order twicing ker-
nel Ktw(u) = 2K(u) −?K(u − v)K(v)dv, where K(·) is a
• One can also use log-likelihood cross-validation, but with
the grid of values being of the form hgrid= cgridσzn−1/3.
Because cross-validation scores often have multiple
modes, this is not the same as optimal smoothing.
opt) = o(n−1/2) if s ≥ 2. A conve-
first-order kernel.
It may be worth pointing out again that Wang et al. (2004) set
h = n−2/3, and even then, with too much undersmoothing (as-
ymptotically), the performance of the method is rather good.
5.2 Efficiency and Robustness of the Sample Mean
In general problems with completedata, with no assumptions
about the response Y other than that it has a second moment, the
sample mean Y is semiparametric efficient for estimating the
population mean κ0= E(Y); see, for example, Newey (1990).
Somewhat remarkably, Wang et al. (2004) showed that in the
partially linear model with Gaussian errors, with complete data
the sample mean is still semiparametric efficient. This fact is
crucial, of course, in establishing that with missing response
data, their estimators are still semiparametric efficient.
It is clear that with complete data, the sample mean will not
be semiparametric efficient for all semiparametric likelihood
models. Simple counterexamples abound, for example, the par-
tially linear model for Laplace or t errors. More complex exam-
ples can be constructed, for example, the partially linear model
in the Gamma family with log-linear mean exp{XTB0+θ0(Z)}:
Details follow from Lemma 4.
The model robustness of the sample mean for estimating
the population mean in complete data is nonetheless a pow-
erful feature. It is, therefore, of considerable interest to know
whether there are cases of semiparametric likelihood problems
where the sample mean is still semiparametric efficient and,
thus, would be used because of its model robustness. It turns
out that such cases exist. In particular, the sample mean for
complete response data is semiparametric efficient in canoni-
cal exponential families with partially linear form.
Page 9
Maity, Ma, and Carrol: Efficient Estimation of Population Quantities131
Figure 4. Results for a Single Dataset in a Simulation as in Wang et al. (2004), the Partially Linear Model With n = 60, Complete Response Data,
and When κ0= E(Y). Various bandwidths are used, and the estimates of the function θ0(·) are displayed. Note how the bandwidth has a major
impact on the function estimate when the bandwidth is too small (h = .02), but very little effect on the estimate of κ0. (
, h = .06, κ = 1.818;
, h = .02, κ = 1.828;
, h = .10, κ = 1.826;, h = .14, κ = 1.818.)
Figure 5. Results for 100 Simulated Datasets in a Simulation as in Wang et al. (2004), the Partially Linear Model With n = 60 and Complete
Response Data. Various bandwidths are used, and the mean estimates of the function θ0(·) are displayed. Note how even over these simulations,
the bandwidth has a clear impact on the function estimate: There is almost no impact on estimates of the population mean and variance. (
h = .02;
,
, h = .06;, h = .10; , h = .14.)
Page 10
132Journal of the American Statistical Association, March 2007
Lemma 3. Recall that ? is defined in (8). If there are no miss-
ing data, the sample mean is a semiparametric efficient estima-
tor of the population mean only if
Y −E(Y|X,Z) = E(Y?T)M−1
1? +Lθ(·)E{YLθ(·)|Z}
E{L2
θ(·)|Z}
.
(19)
It is interesting to consider (19) in the special case of expo-
nential families with likelihood function
?yc{η(x,z)}−C[c{η(x,z)}]
where η(x,z) = xTβ0+ θ0(z), so that E(Y|X,Z) = C(1)[c{η(X,
Z)}] = µ{η(X,Z)} = µ(X,Z) and var(Y|X,Z) = φC(2)[c{η(X,
Z)}] = φV[µ{η(X,Z)}].
As it turns out, effectively, (19) holds and the sample mean is
semiparametric efficient only in the canonical exponential fam-
ily for which c(t) = t. More precisely, we show in Section A.6
the following result.
f(y|x,z) = exp
φ
+D(y,φ)
?
, (20)
Lemma 4. If there are no missing data, under the exponen-
tial model (20), the sample mean is a semiparametric efficient
estimate of the population mean if ∂c{XTβ + θ(Z)}/∂θ(Z) is
a function only of Z for all β, for example, the canonical
exponential family. Otherwise, the sample mean is generally
not semiparametric efficient: The precise condition is given
in (A.29) in the Appendix. In particular, outside the canoni-
cal exponential family, the only possibility for the sample mean
to be semiparametric efficient is that if for some known (a,b),
c{xTβ +θ(z)} = a+blog{xTβ +θ(z)}.
Remark 5. We consider Lemmas 3 and 4 to be positive re-
sults, although an earlier version of the paper had a misplaced
emphasis. Effectively, we have characterized the cases, with
complete data, that the sample mean is both model free and
semiparametric efficient. In these cases, one would use the sam-
ple mean, or perhaps a robust version of it, rather than fit a po-
tentially complex semiparametric model that can do no better
and if that model is incorrect, can incur nontrivial bias.
5.3 Numerical Experience and Theoretical Insights
in the Partially Linear Model, and Some
Tentative Conclusions
In responding to a referee about the estimation of the popu-
lation mean in the partially linear model (1), we collect here
a few remarks based on our numerical experience. Because
the problem of estimating the population mean is the prob-
lem focused on by Chen et al. (2003), we focus on the simu-
lation setup in their article, although some of the conclusions
we reach may be supportable in general cases. To remind the
reader, in their simulation, X and Z are independent, with X =
Normal(0,1), Z = Uniform[0,1], β = 1.5, θ(z) = 3.2z2− 1,
and ? = Normal(0,1).
5.3.1 Can Semiparametric Methods Improve Upon the Sam-
ple Mean?
When there are missing response data, the simu-
lations in Wang et al. (2004) show conclusively that substantial
gains in efficiency can be made over using the sample mean of
the observed responses alone. In addition, if missingness de-
pends on (X,Z), the sample mean of the observed responses
will be biased.
Thisleavestheissueofwhathappenswhentherearenomiss-
ing data. Obviously, if one thought that ? were normally dis-
tributed, it would be delusional to use anything other than the
sample mean, it being efficient.
Theoretically, some insight can be gained by the following
considerations. Suppose that X and Z are independent. Suppose
also that ? has a symmetric density function known up to a scale
parameter. Let σ2
inverse of the Fisher information for estimating the mean in the
model Y = µ + ?. Then, it can be shown that E{FB(·)} = 0,
that θB(z,B) = 0, and that the asymptotic mean squared error
(MSE) efficiency of the semiparametric efficient estimate of the
population mean compared to the sample mean is
?be the variance of ? and let ζ ≤ σ2
?be the
MSE efficiency of sample mean
=
β2var(X)+var{θ(Z)}+ζ
β2var(X)+var{θ(Z)}+σ2
?may be quite small, es-
?
≤ 1.
Note that there are cases where ζ/σ2
pecially when ? is heavy tailed, so that if β = 0 and θ(·) is ap-
proximately constant, the MSE efficiency of the sample mean
would be ζ/σ2
?, and then substantial gains in efficiency would
be gained. However, the usual motivation for fitting semipara-
metric models is that the regression function is not constant, in
which case the MSE efficiency gain will be attenuated toward
1.0, often dramatically.
We conclude then that with no missing data, in the partially
linear model, substantial improvements upon the sample mean
will be realized mainly when the regression errors are heavy
tailed and the regression signal is slight.
We point out that in the example that motivated this work
(Sec. 4), there is no simple analog to the sample mean, one that
could avoid fitting models to the data.
5.3.2 How Critical Are Our Assumptions on Z?
made two assumptions on Z: It has a compact support, and its
density function is positive on that support. We have indicated
in Section A.1.2 that all general articles in the semiparametric
kernel-based literature make this assumption and that it appears
to be critical for deriving asymptotic results for problems such
as our example in Section 4. It is certainly well beyond our
capabilities to weaken this assumption as it applies to problems
such as our motivating example.
The condition that the density of Z be bounded away from 0
warns users that the method will deteriorate if there are a few
sparsely observed outlying Z values; see below for numerical
evidence of this phenomenon.
Estimation in subpopulations formed by compact subsets of
Z can also be of considerable interest in practice, and these
compact subsets can be chosen to avoid density spareness and
meet our assumptions. A simple example might be where Z is
age, and one might be interested in population summaries for
those in the 40- to 60-year age range.
The partially linear model is a special case, however, be-
cause all estimates are explicit and what few Taylor expan-
sions are necessary simplify tremendously. That is, the esti-
mates are simple functions of sums of random variables. Cheng
(1994) considered a different problem where there is no X and
where local constant estimation of the nonparametric function
is used, rather than local linear estimation, so that? θ(z0) =
We have
Page 11
Maity, Ma, and Carrol: Efficient Estimation of Population Quantities133
?n
Z decay exponentially fast.
We tested this numerically in the normal-based simulation
of Wang et al. (2004) with the sample size of n = 500: Simi-
lar results were found with n = 100. We use the Epanechnikov
kernel and estimated the bandwidth using the following meth-
ods. First, we regressed Y and X separately on Z, using the
direct plug-in (DPI) bandwidth selection method of Ruppert,
Sheather, and Wand (1995) to form different estimated band-
widths on each. We then calculated the residuals from these fits
and regressed the residual in Y on the residual in X to get a
preliminary estimate? βstartof β. Following this, we regressed
smoothed it by multiplication by n−2/15to get a bandwidth of
order n−1/3to eliminate bias, and then reestimated β and θ(·).
We found that for various Beta distributions on Z, for ex-
ample, the Beta(2, 1) that violates our assumptions, the sam-
ple mean and the semiparametric efficient method were equally
efficient. The same occurs for the case that Z is normally dis-
tributed. However, when Z has a t distribution with 9 degrees
of freedom, the sample mean greatly outperforms the under-
smoothed estimator (MSE efficiency ≈ 2.0), which, in turn,
outperformed the method that did not employ undersmoothing
(MSE efficiency ≈ 2.5). An interesting quote from Ma, Chiou,
and Wang (2006) is relevant here: Also operating in a partial
linear model, they stated “This condition enables us to sim-
plifyasymptoticexpressionof certainsums of functionsof vari-
ables...also excludes pathological cases where the number of
observations in a window defined by the bandwidth may not
increase to infinity when n → ∞.”
We conclude that if the design density in Z is at all heavy
tailed, then the semiparametric methods will be badly affected.
If such a phenomenon happens in the simple case of the par-
tially linear model, it is likely to hold in most other cases.
Otherwise, in practice at least, as long as there are no design
“stragglers,” the assumption is likely to be one required by the
technicalities of the problem. How well this generalizes to com-
plex nonlinear problems is unknown.
i=1Kh(Zi− z0)Yi/?n
i=1Kh(Zi− z0). He indicated that the
essential condition for this case is that the tails of the density of
Y − XT? βstart on Z to get a common bandwidth, then under-
6. DISCUSSION
In this article we considered the problem of estimating
population-level quantities κ0such as the mean, variance, and
probabilities. Previous literature on the topic applies only to
the simple special case of estimating a population mean in the
Gaussian partially linear model. The problem was motivated
by an important issue in nutritional epidemiology, estimating
the distribution of usual intake for episodically consumed food,
whereweconsideredazero-inflatedmixturemeasurementerror
model:Such a problemis very different from the partially linear
model, and the main interest is not in the population mean.
The key feature of the problem that distinguishes it from
most work in semiparametric modeling is that the quantities
of interest are based on both the parametric and the non-
parametric parts of the model. Results were obtained for two
general classes of semiparametric ones: (1) general semipara-
metric regression models depending on a function θ0(Z) and
(2) generalized linear single-index models. Within these semi-
parametric frameworks, we suggested a straightforward estima-
tion methodology, derived its limiting distribution, and showed
semiparametric efficiency. An interesting part of the approach
is that we also allow for partially missing responses.
In the case of standard semiparametric models, we have con-
sidered the case where the unknown function θ0(Z) is a scalar
function of a scalar argument. The results, though, readily ex-
tend to the case of a multivariate function of a scalar argument.
We have also assumed that κ0= E[F{X,θ0(Z),B0}] and
F(·) are scalar, which, in principle, excludes the estimation of
the population variance and standard deviation. It is, however,
readily seen that both F(·) and κ0or κSIcan be multivariate,
and, hence, the obvious modification of our estimates is semi-
parametric efficient.
APPENDIX: SKETCH OF TECHNICAL ARGUMENTS
In what follows, the arguments for L and its derivatives are in the
form L(·) = L{Y,X,B0,θ0(Z)}. The arguments for F and its deriva-
tives are F(·) = F{X,θ0(Z),B0}.
Also, note that in our arguments about semiparametric efficiency,
we use the symbol d exactly as it was used by Newey (1990). It does
not stand for differential.
A.1 Assumptions and Remarks
A.1.1 General Considerations.
asymptotic distribution of our estimator are (5) and (6). The single-
index model assumptions were given already in Carroll et al. (1997).
Results (5) and (6) hold under smoothness and moment conditions
for the likelihood function and under smoothness and boundedness
conditions for θ(·). The strength of these conditions depends on the
generality of the problem. For the partially linear Gaussian model of
Wang et al. (2004), because the profile likelihood estimator of β is an
explicit function of regressions of Y and X on Z, the conditions are
simply conditions about uniform expansions for kernel regression es-
timators, as in, for example, Claeskens and Van Keilegom (2003). For
generalized partially linear models, Severini, and Staniswalis (1994)
gave a series of moment and smoothness conditions toward this end.
For general likelihood problems, Claeskens and Carroll (2007) stated
that the conditions needed are as follows.
The main results needed for the
(C1) The bandwidth sequence hn→ 0 as n → ∞ in such a way
that nhn/log(n) → ∞ and hn≥ {log(n)/n}1−2/λfor λ as in condi-
tion (C4).
(C2) The kernel function K is a symmetric, continuously differ-
entiable pdf on [−1,1] taking on the value 0 at the boundaries. The
design density f(·) is differentiable on an interval B = [b1,b2], the
derivative is continuous, and infz∈Bf(z) > 0. The function θ(·,B) has
two continuous derivatives on B and is also twice differentiable with
respect to B.
(C3) For B ?= B?, the Kullback–Leibler distance between L{·,·,B,
θ(·,B)} and L{·,·,B?,θ(·,B?)} is strictly positive. For every (y,x),
third partial derivatives of L{y,x,B,θ(z)} with respect to B exist and
are continuous in B. The fourth partial derivative exists for almost all
(y,x). Further, mixed partial derivatives
with 0 ≤ r,s ≤ 4,r + s ≤ 4 exist for almost all (y,x) and
E{supBsupv|
G(z), possesses a continuous derivative and infz∈BG(z) > 0.
(C4) There exists a neighborhood N{B0,θ0(z)} such that
????
for some λ ∈ (2,∞], where ? · ?λ,zis the Lλ-norm, conditional on
Z = z. Further,
?
∂r+s
∂Br∂vsL{y,x,B,v}|v=θ(z),
∂r+s
∂Br∂vsL{y,x,B,v}|2} < ∞. The Fisher information,
max
k=1,2sup
z∈B
sup
(B,θ)∈N{B0,θ0(z)}
????
∂k
∂θklog{L(Y,X,B,θ)}
????
????λ,z
< ∞
sup
z∈B
Ez
sup
(B,θ)∈N{B0,θ0(z)}
????
∂3
∂θ3log{L(Y,X,B,θ)}
????
?
< ∞.
Page 12
134 Journal of the American Statistical Association, March 2007
The preceding regularity conditions are the same as those used in a lo-
cal likelihood setting where one wishes to obtain strong uniform con-
sistency of the local likelihood estimators. Condition (C3) requires the
fourth partial derivative of the log profile likelihood to have a bounded
second moment; it further requires the Fisher information matrix to
be invertible and to be differentiable with respect to z. Condition (C4)
requires a bound on the first and second derivatives of the log profile
likelihood and of the first moment of the third partial derivative, in a
neighborhood of the true parameter values.
A.1.2 Compactly Supported Z.
of this article commented that the assumption that Z be compactly sup-
ported with density positive on this support is too strong.
However, this assumption is completely standard in the kernel-
based semiparametric literature for estimation of B0, because it is
needed for uniform expansions for estimation of θ0(·). The assump-
tion was made in the founding articles on semiparametric likelihood
estimation (Severini and Wong 1992, p. 1875, part e); the first article
on generalized linear models (Severini and Staniswalis 1994, p. 511,
assumption D), the first article on efficient estimation of partially lin-
ear single index models (Carroll et al. 1997, p. 485, condition 2a); and
the precursor article to ours that was focused on estimation of the pop-
ulation mean in a partially linear model (Wang et al. 2004, p. 341,
condition C.T). The uniform expansions for local likelihood given in
Claeskens and Van Keilegom (2003) also make this assumption; see
their page 1869, condition R0. Thus, our assumption on the design
density of Z is a standard one.
The reason this assumption is made has to do with kernel technol-
ogy, where proofs generally require a uniform expansion for the kernel
regression or at least uniform in all observed values of Z, which is the
same thing. The Nadaraya–Watson estimator, for example, has a de-
nominator that is a density estimate, and the condition on Z stops this
denominator from getting too close to 0. Ma et al. (2006), who made
the same assumption (their condition 6 on p. 83), stated that it is nec-
essary to avoid “pathological cases.”
Multiple reviewers of earlier drafts
A.2 Proof of Theorem 1
A.2.1 Asymptotic Expansion.
is a log-likelihood function conditioned on (X,Z), so that we have
We first show (9). First, note that L
E{δLθθ(·)|X,Z} = −E{δLθ(·)Lθ(·)|X,Z},
E{δLθB(·)|X,Z} = −E{δLθ(·)LB(·)|X,Z}.
By a Taylor expansion,
n1/2(? κ −κ0)
= n−1/2
i=1
+{FiB(·)+Fiθ(·)θB(Zi,B0)}T(? B −B0)
= MT
n
?
+op(1).
Because nh4→ 0, using (5), we see that
n−1/2
i=1
n
?
(A.1)
n
?
?Fi(·)−κ0
+Fiθ(·){? θ(Zi,B0)−θ0(Zi)}?+op(1)
2n1/2(? B −B0)
+n−1/2
i=1
?Fi(·)−κ0+Fiθ(·){? θ(Zi,B0)−θ0(Zi)}?
n
?
Fiθ(·){? θ(Zi,B0)−θ0(Zi)}
= −n−1/2
i=1
Fiθ(·)n−1
n
?
j=1
δjKh(Zj−Zi)Ljθ(·)/?(Zi)+op(1)
= −n−1/2
n
?
n
?
i=1
δiLiθ(·)n−1
n
?
j=1
Kh(Zj−Zi)Fjθ(·)/?(Zj)+op(1)
= n−1/2
i=1
δiDi(·)+op(1),
the last step following because the interior sum is a kernel regression
converging to Di; see Carroll et al. (1997) for details. Result (9) now
follows from (6). The limiting variance (10) is an easy calculation;
noting that (A.1) implies that
E{δ?Lθ(·)|Z}
= E{δLθ(·)LB(·)+δLθ(·)Lθ(·)θB(Z,B0)|Z}
= −E{δLBθ(·)+δLθθ(·)θB(Z,B0)|Z}
= 0
by the definition of θB(·) given in (7), and, hence, the last two terms
in (9) are uncorrelated. We will use (A.2) repeatedly in what follows.
(A.2)
A.2.2 Pathwise Differentiability.
ric efficiency, using the results of Newey (1990). The relevant text of
his article is in his section 3, especially through his equation (9). A pa-
rameter κ = κ(?) is pathwise differentiable under two conditions. The
first is that κ(?) is differentiable for all smooth parametric submodels:
In our case, the parametric submodels include B, parametric submod-
elsfor θ(·),andparametricsubmodelsforthedistributionof (X,Z) and
the probability function pr(δ = 1|X,Z). This condition is standard in
the literature and fairly well required. Our motivating example clearly
satisfies this condition.
The second condition is that there exists a random vector d such
that E(dTd) < ∞ and ∂κ(?)/∂? = E(dST
likelihood score for the parametric submodel. Newey noted that path-
wise differentiability also holds if the first condition holds and if there
is a regular estimator in the semiparametric problem. Generally, as
Newey noted, finding a suitable random variable d can be difficult.
Assuming pathwise differentiability, which we show later, the effi-
cient influence function is calculated by projecting d onto the nuisance
tangent space. One innovation here is that we can calculate the efficient
influence function without having an explicit representation for d.
Our development in Section A.2.3 will consist of two steps. In the
first, we will assume pathwise differentiability and derive the efficient
score function under that assumption. Using this derivation, we will
then exhibit a random variable d that has the requisite property.
We now turn to the semiparamet-
?), where S?is the log-
A.2.3 Efficiency.
fX,Z(x,z) be the density function of (X,Z). Let the model under con-
sideration be denoted by M0. Now consider a smooth parametric sub-
model Mλ, with fX,Z(x,z,α1), θ(z,α2), and π(X,Z,α3) in place of
fX,Z(x,z), θ0(z), and π(X,Z), respectively. Then, under Mλ, the log-
likelihood is given by
Recall that pr(δ = 1|X,Z) = π(X,Z). Let
L(·) = δL(·)+δlog{π(X,Z,α3)}
+(1−δ)log{1−π(X,Z,α3)}
+log{fX,Z(X,Z,α1)},
where (·) represents the argument {Y,X,θ(Z,α2),B0}. Then the score
functions in this parametric submodel are given by
∂L(·)/∂B = δLB(·),
∂L(·)/∂α1= ∂ log{fX,Z(X,Z,α1)}/∂α1,
∂L(·)/∂α2= δLθ(·)∂θ(Z,α2)/∂α2,
∂L(·)/∂α3= {∂π(X,Z,α3)/∂α3}{δ −π(X,Z,α3)}
/?π(X,Z,α3){1−π(X,Z,α3)}?.
Page 13
Maity, Ma, and Carrol: Efficient Estimation of Population Quantities135
Thus, the tangent space is spanned by the functions δLB(·)T,sf(x,z),
δLθ(·)g(Z), and a(X,Z){δ − π(X,Z)}, where sf(x,z) is any function
with mean 0, while g(z) and a(X,Z) are any functions. For compu-
tational convenience, we rewrite the tangent space as the linear span
of four subspaces T1,T2,T3, and T4that are orthogonal to each other
(see below) and defined as follows:
T1= δLB(·)T+δLθ(·)θT
T2= sf(x,z),
T3= δLθ(·)g(Z),
T4= a(X,Z){δ −π(X,Z)}.
Toshowthat thesespaces areorthogonal, wefirstnote that, byassump-
tion, the data are missing at random, and, hence, pr(δ = 1|Y,X,Z) =
π(X,Z). This means that T4is orthogonal to the other three spaces.
Note also that, by assumption, E{LB(·)|X,Z} = E{Lθ(·)|X,Z} = 0.
This shows that T2is orthogonal to T1and T3. It remains to show that
T1and T3are orthogonal, which we showed in (A.2). Thus, the spaces
T1–T4are orthogonal.
Note that, under model Mλ,
?
Hence, we have
B(Z,B0),
κ0=
F{X,θ(Z,α2),B0}fX,Z(x,z,α1)dxdz.
∂κ0/∂B = E{FB(·)},
∂κ0/∂α1= E?F(·)∂ log{fX,Z(X,Z,α1)}/∂α1
∂κ0/∂α2= E{Fθ(·)∂θ(Z,α2)/∂α2},
∂κ0/∂α3= 0.
Now, by pathwise differentiability and equation (7) of Newey (1990),
there exists a random variable d, which we need not compute, such
that
E{FB(·)} = E?d{δLB(·)}?,
E{F(·)sf(X,Z)} = E{dsf(X,Z)},
E{Fθ(·)g(Z)} = E{dδLθ(·)g(Z)},
0 = E?da(X,Z){δ −π(X,Z)}?.
Next,wecomputetheprojectionsof d onto T1,T2,T3,and T4.First,
note that, by (A.4), for any function sf(X,Z) with expectation 0, we
have E[{d−F(·)+κ0}sf(X,Z)] = 0, which implies that the projection
of d onto T2is given by
?(d|T2) = F(·)−κ0.
Also, by (A.1) and (A.5), for any function g(Z), we have
?,
(A.3)
(A.4)
(A.5)
(A.6)
(A.7)
E[{d −δD(·)}δg(Z)Lθ(·)]
= E{Fθ(·)g(Z)}+E?δg(Z)L2
= 0,
and, hence, the projection of d onto T3is given by
θ(·)E{Fθ(·)|Z}/E{δLθθ(·)|Z}?
?(d|T3) = δD(·).
(A.8)
In addition, by (A.3) and (A.5),
E[{d −MT
= E{FT
= 0.
2M−1
B(·)}−E{Fθ(·)θT
1δ?}δ?T]
B(Z,B0)}−E(MT
2M−1
1δ??T)
Hence, the projection of d onto T1is given by
?(d|T1) = δMT
2M−1
1?.
(A.9)
Also, by (A.6), we have ?(d|T4) = 0. Using (A.7), (A.8), and (A.9),
we get that the efficient influence function for κ0is
ψeff= ?(d|T1)+?(d|T2)+?(d|T3)+?(d|T4)
= F(·)−κ0+δMT
2M−1
1? +δD(·),
which is the same as (9), hence completing the proof under the as-
sumption of pathwise differentiability. In the calculations that follow,
we will write FBrather than FB(·), a rather than a(X,Z), and so on.
We now show pathwise differentiability and, hence, semiparametric
efficiency; that is, we show that (A.3)–(A.6) hold for d = F − κ0+
δD+δMT
To verify (A.3), we see that
2M−1
1?.
E(dδLB) = E[(F −κ0+δD+δMT
= E[δDLB+δMT
?
+E{δLB(LB+LθθB)T}M−1
?
= −E(FθθB)+M2
= E(FB).
2M−1
1?LB]
?
1?)δLB]
2M−1
= −E
Lθ
E(Fθ|Z)
E(δLθθ|Z)LBδ
1M2
= E
δLθB
E(Fθ|Z)
E(δLθθ|Z)
?
−E{δ(LBB+LBθθT
B)}M−1
1M2
To verify (A.4), we see that
E(dsf) = E{(F −κ0+δD+δMT
= E(Fsf)−κ0E(sf)+E{E(δD+δMT
= E(Fsf).
2M−1
1?)sf}
2M−1
1?|X,Z)sf}
To verify (A.5), we see that
E(dδLθg) = E{(F −κ0+δD+δMT
= E(DLθδg)+MT
?
+MT
= E(Fθg)−MT
= E(Fθg)−MT
= E(Fθg),
2M−1
1E(?Lθδg)
?
1?)δLθg}
2M−1
= −E
Lθ
E(Fθ|Z)
E(δLθθ|Z)Lθδg
2M−1
2M−1
2M−1
1E{(LB+LθθB)Lθδg}
1E{(LBθ+LθθθB)δg}
1E{E(δLBθ+δLθθθB|Z)g}
where again we have used (A.2). Finally, because the responses are
missing at random, (A.6) is immediate. This completes the proof.
A.3 Sketch of Lemma 1
We have
? κmarg= n−1
n
?
i=1
?
δi
? πmarg(Zi)G(Yi)
+
? πmarg(Zi)
?
1−
δi
?
F{Xi,? θ(Zi,? B),? B}
?
= A1+A2.
Page 14
136Journal of the American Statistical Association, March 2007
By calculations that are similar to those given previously, and us-
ing (11), we can readily show that
A1= n−1
n
?
i=1
δi
πmarg(Zi)G(Yi)
−n−1
n
?
i=1
{δi−πmarg(Zi)}E
?
δiG(Yi)
{πmarg(Zi)}2
???Zi
?
+op
?n−1/2?.
We can write
A2= B1+B2+op
?n−1/2?,
δi
πmarg(Zi)
B1= n−1
n
?
n
?
i=1
?
1−
?
F{Xi,? θ(Zi,? B),? B},
{? πmarg(Zi)−πmarg(Zi)}.
B2= n−1
i=1
δiF{Xi,? θ(Zi,? B),? B}
{πmarg(Zi)}2
Using (5) and (6), we can show that
?
B1= n−1
n
?
i=1
1−
δi
πmarg(Zi)
?
Fi(·)+M2,margM−1
1n−1
n
?
i=1
δi?i
+n−1
n
?
i=1
δiDi,marg(·)+op
?n−1/2?.
Using (11) once again, we see that
B2= n−1
n
?
i=1
{δi−πmarg(Zi)}E
?
δiFi(·)
{πmarg(Zi)}2
???Zi
?
?
+op
?n−1/2?.
Collecting terms and noting that
0 = E
?δi{G(Yi)−Fi(·)}
{πmarg(Zi)}2
???Zi
proves (12).
A.4 Sketch of Lemma 2
We have
? κ = n−1
n
?
i=1
?
δi
π(Xi,Zi,? ζ)G(Yi)
+
?
1−
δi
π(Xi,Zi,? ζ)
?
F{Xi,? θ(Zi,? B),? B}
?
= A1+A2,
say. By a simple Taylor series expansion,
A1= n−1
n
?
?
i=1
δi
π(Xi,Zi,ζ)G(Yi)
−E
1
π(X,Z,ζ)G(Y)πζ(X,Z,ζ)
?T
n−1
n
?
i=1
ψiζ+op
?n−1/2?.
In addition,
A2= B1+B2+op
?n−1/2?,
B1= n−1
n
?
n
?
?n−1/2?.
i=1
?
1−
δi
π(Xi,Zi,ζ)
?
F{Xi,? θ(Zi,? B),? B},
πζ(Xi,Zi,ζ)T(? ζ −ζ)
B2= n−1
i=1
δiF{Xi,? θ(Zi,? B),? B}
{π(Xi,Zi,ζ)}2
+op
Using the fact that
0 = E
?
1−
δi
π(Xi,Zi,ζ)
???X,Z
?
,
we can easily show that
B1= n−1
n
?
i=1
?
1−
δi
π(Xi,Zi,ζ)
?
Fi(·)+op
?n−1/2?.
It also follows that
?
Collecting terms and using the fact that E{G(Y)|X,Z} = F(·), we ob-
tain the result.
B2= E
1
π(X,Z,ζ)F(·)πζ(X,Z,ζ)
?T
n−1
n
?
i=1
ψiζ(·)+op
?n−1/2?.
A.5 Proof of Theorem 2
A.5.1 Asymptotic Expansion.
Recall that B = (γ,β). The only things that differ with the calculations
of Carroll et al. (1997) is that we add terms involving δiand we need
not worry about any constraint on γ, and, thus, we avoid terms such as
their Pα on their page 487.
In their equation (A.12), they showed that
We first show the expansion (15).
n1/2(? B −B0) = n−1/2Q−1
Define H(u) = [E{ρ2(·)|U = u}]−1. In their equation (A.13), Carroll
et al. (1997) showed that
? θ(R+ST? γ,? B)−θ0(R+STγ0)
+? θ(R+STγ0,? B)−θ0(R+STγ0)+op
? θ(u,? B)−θ0(u)
= n−1
i=1
−H(u)?E{δ?ρ2(·)|U = u}?T(? B −B0)+op
G{φ,Y,X,B,θ(U)} = Dφ(Y,φ)−?Yc{XTβ +θ(U)}−C{c(·)}?/φ2.
Of course, G(·) is the likelihood score for φ. If there are no argu-
ments, G = G{φ0,Y,X,B0,θ0(R + STγ0)}. The estimating function
for φ solves
n
?
Because G is a likelihood score, it follows that
E?Gφ{φ0,Y,X,B0,θ0(R+STγ0)}|X,R,S?= −E{G2|X,R,S}.
By a Taylor series,
E(δG2)n1/2(? φ −φ0)
= n−1/2
i=1
n
?
n
?
n
?
i=1
δiNi?i+op(1).
(A.10)
= θ(1)
0(R+STγ0)ST(? γ −γ0)
Also, in their equation (A.11), they showed that
?n−1/2?.
(A.11)
n
?
δiKh(Ui−u)?iH(u)/f(u)
?n−1/2?.
(A.12)
Carroll et al. (1997) did not consider an estimate of φ. Define
0 = n−1/2
i=1
δiG{? φ,Yi,Xi,? B,? θ(Ri+ST
i? γ,? B)}.
n
?
δiG{φ0,Yi,Xi,? B,? θ(Ri+ST
δiGi+E(δGT
i? γ,? B)}+op(1)
= n−1/2
i=1
B)n1/2(? B −B0)
i? γ,? B)−θ0(Ri+ST
+n−1/2
i=1
δiGiθ{? θ(Ri+ST
iγ0)}+op(1).
Page 15
Maity, Ma, and Carrol: Efficient Estimation of Population Quantities137
However, it is readily verified that E(δGB|X,R,S) = 0 and that
E(δGθ|X,R,S) = 0. It, thus, follows via a simple calculation using
(A.11) that
E(δG2)n1/2(? φ −φ0)
= n−1/2
i=1
n
?
the last step following from an application of (A.12).
With some considerable algebra, (15) now follows from calcula-
tions similar to those in Section A.2. The variance calculation follows
because it is readily shown that, for any function h(U),
0 = E?(N?){δh(U)?}?.
A.5.2 Efficiency.
We now turn to semiparametric efficiency. Recall
that the GPLSIM follows the form (20) with XTβ0+ θ0(R + STγ0)
and that U = R+STγ0. It is immediate that V{µ(t)} = µ(1)(t)/c(1)(t),
that c(1)(t) = ρ1(t), and that ρ2(t) = ρ2
We also have
n
?
δiGi+n−1/2
n
?
i=1
δiGiθ{? θ(Ui,B0)−θ0(Ui)}+op(1)
= n−1/2
i=1
δiGi+op(1),
(A.13)
1(t)V{µ(t)} = c(1)(t)µ(1)(t).
E(?|X,Z) = 0,
(A.14)
E(?2|X,Z)
= E??Y −µ{XTβ0+θ0(U)}?2|X,Z??ρ1{XTβ0+θ0(U)}?2
= var(Y|X,Z)?ρ1{XTβ0+θ0(U)}?2
= φρ2(·).
Let the semiparametric model be denoted by M0. Consider a
parametric submodel Mλwith fX,Z(X,Z;ν1), θ0(R + STγ0,ν2), and
π(X,Z,ν3). The joint log-likelihood of Y,X, and Z under Mλis given
by
L(·) = (δ/φ)?Yc{XTβ0+θ0(R+STγ0,ν2)}
(A.15)
−C?c{XTβ0+θ0(R+STγ0,ν2)}??
+δD(Y,φ)+log{fX,Z(X,Z,ν1)}
+δ log{π(X,Z,ν3)}+(1−δ)log{1−π(X,Z,ν3)}.
As before, recall that ? = ρ1(·){Y − µ(·)} = c(1)(·){Y − µ(·)}. Then
the score functions evaluated at M0are
∂L/∂β = δXc(1)(·){Y −µ(·)}/φ = δX?/φ,
∂L/∂γ = δθ(1)(U)Sc(1)(·){Y −µ(·)}φ = δθ(1)(U)S?/φ,
∂L/∂ν1= sf(X,Z),
∂L/∂ν2= δh(U)c(1)(·){Y −µ(·)}/φ = δh(U)?/φ,
∂L/∂ν3= a(X,Z){δ −π(X,Z)},
∂L/∂φ = δDφ(Y,φ)−δ?Yc(·)−C{c(·)}?/φ2= δG,
where Dφ(Y,φ) isthederivativeof D(Y,φ) withrespectto φ, sf(X,Z)
is a mean-zero function and h(U), and a(X,Z) are any functions. This
means that the tangent space is spanned by
?T1= δ?STθ(1)
T3= δh(U)?/φ,T4= a(X,Z){δ −π(X,Z)},
T5= δG?.
An orthogonal basis of the tangent space is given by [T1=
δNT?,T2= sf(X,Z),T3= δh(U)?,T4= a(X,Z){δ − π(X,Z)}] and
0(U),XT??/φ,T2= sf(X,Z),
T5= δG; the orthogonality is a straightforward calculation. Now no-
tice that
?
and, hence,
κ0=
F{x,θ0(z;ν2),B0,φ0}fX,Z(x,z;γ)dxdz
∂κ0/∂β = E{Fβ(·)},
∂κ0/∂γ = E?Fθ(·)θ(1)(U)S?,
∂κ0/∂ν1= E{F(·)sf(X,Z)},
∂κ0/∂ν2= E[Fθ(·)h(Z)],
∂κ0/∂ν3= 0,
∂κ0/∂φ = E{Fφ(·)}.
As before, we first assume pathwise differentiability to construct the
efficient score. We then verify this in Section A.5.3.
By equation (7) of Newey (1990), there is a random quantity d such
that
E(dδX?/φ) = E{Fβ(·)},
E?dδθ(1)(U)S?/φ?= E?Fθ(·)θ(1)(U)S?,
E{dsf(X,Z)} = E{F(·)sf(X,Z)},
E{dδh(U)?/φ} = E{Fθ(·)h(U)},
E?da(X,Z){δ −π(X,Z)}?= 0,
E(dδG) = E{Fφ(·)}.
Now we compute the projection of d onto the tangent space. It is
immediate that ?(d|T2) = F(·) − κ0and that ?(d|T4) = 0. Be-
cause E[{δJ(U)?}{δh(U)?/φ}] = E{h(U)Fθ(·)}, it is readily shown
that ?(d|T3) = δJ(U)?. It is a similarly direct calculation to show that
?(d|T1) = DTQ−1δN?. Finally, ?(d|T5) = δGE{Fφ(·)}/E(δG2).
These calculations, thus, show that, assuming pathwise differentia-
bility, the efficient influence function for κ0is
? = DTQ−1δN? +F(·)−κ0+δJ(U)? +δGE{Fφ(·)}/E(δG2).
Hence, from (15), we see that ? κSIhas the semiparametric optimal in-
A.5.3 Pathwise Differentiability.
κ0+ δJ(U)? + δGE{Fφ(·)}/E(δG2), we have to show that (A.16)–
(A.21) hold. Let
d1= DTQ−1δN?,
d2= F(·)−κ0,
d3= δJ(U)?,
d4= 0,
d5= δGE{Fφ(·)}/E(δG2).
Then d = d1+···+d5. Because T1, T2, T3, T4, and T5are orthogonal
and di∈ Tifor i = 1,...,5, we have
E(d1T1) = E(DTQ−1δNNT?2) = φE(DT),
E(d2T2) = E?{F(·)−κ0}sf(X,Z)?= E{F(·)sf(X,Z)},
E(d3T3) = E{δJ(U)h(U)?2}
= E{π(X,Z)J(U)h(U)φρ2(·)},
E(d4T4) = 0,
E(d5T5) = E?δG2E{Fφ(·)}/E(δG2)?= E{Fφ(·)},
E(diTj) = 0,
(A.16)
(A.17)
(A.18)
(A.19)
(A.20)
(A.21)
fluence function and is asymptotically efficient.
For d = DTQ−1δN? + F(·) −
(A.22)
(A.23)
(A.24)
(A.25)
(A.26)
i ?= j.
(A.27)
Page 16
138 Journal of the American Statistical Association, March 2007
To verify (A.16) and (A.17), we have to prove
?dδθ(1)(U)S?/φ
Recall that ? = {θ(1)(U)ST,XT}T. Therefore,
?dδθ(1)(U)S?/φ
= E(dδ?T?/φ)
= E?dδ?NT+[E{δρ2(·)|Ui}]−1E{δi?T
= E{dδNT?/φ}+E?dδ??E{δρ2(·)|Ui}?−1E{δi?T
= E(dT1/φ)+E(dδh(U)?/φ)
= B1+B2,
where h(U) = [E{δρ2(·)|U}]−1E{δ?Tρ2(·)|U}. Hence, using (A.22),
(A.24), and (A.27), we see B1= E(d1T1/φ) = E(DT) and
B2= E(d3δh(U)?/φ)
= E{π(X,Z)J(U)h(U)ρ2(·)}
= E{δJ(U)h(U)ρ2(·)}
= E?J(U)h(U)E{δρ2(·)|U}?
= E?Fθ(·)?E{δρ2(·)|U}?−1E{δ?Tρ2(·)|U}?,
and, hence,
B1+B2= E(DT)+E?Fθ(·)?E{δρ2(·)|U}?−1E{δ?Tρ2(·)|U}?
= E
Fβ(·)
To verify (A.19), we use (A.24) and (A.27) to get
E
dδX?/φ
?T
= E
?Fθ(·)θ(1)(U)S
Fβ(·)
?T
.
E
dδX?/φ
?T
iρ2(·)|Ui}??/φ?
iρ2(·)|Ui}??/φ?
?Fθ(·)θ(1)(U)S
?T
.
E{dδh(U)?/φ} = E(d3T3/φ)
= E{π(X,Z)J(U)h(U)ρ2(·)}
= E{δJ(U)h(U)ρ2(·)}
= E?J(U)h(U)E{δρ2(·)|U}?
= E[Fθ(·)h(U)].
Finally, (A.18) follows directly from (A.23) and (A.27), (A.20) fol-
lows directly from (A.25) and (A.27), and (A.21) follows directly from
(A.26) and (A.27).
A.6 Proof of Lemma 3
Denote the model under consideration by M0. Now consider any
regular parametric submodel Mλ, with fX,Z(x,z,α1) and θ(z,α2) in
place of fX,Z(x,z) and θ0(z), respectively. For the model Mλ, we have
the joint log-likelihood of Y,X, and Z,
L(y,z,x) = L(·)+log{fX,Z(x,z,α1)},
where (·) represents the argument {Y,X,θ(Z,α2),B0}. The score
functions are given by
∂L/∂B = LB(·),
∂L/∂α1= ∂ log{fX,Z(x,z,α1)}/∂α1,
∂L/∂α2= Lθ(·)∂θ(z,α2)/∂α2.
The tangent space is spanned by Sλ= {LB(·)T,sf(x,z)T,Lθ(·)g(z)T}
or, equivalently, by
T =?T1= LB(·)T+Lθ(·)θT
B(Z,B0) = ?T,
T2= sf(X,Z)T,T3= g(Z)TLθ(·)?,
where sf(x,z) is any function with expectation 0 and g(z) is any func-
tion of z. Note that, under model Mλ, κ0=?Y exp{L(·)}fX,Z(x,z,
∂κ0/∂B = E{YLB(·)} = E{Y(∂L/∂B)},
∂κ0/∂α1= E{Ysf(X,Z)} = E{Y(∂L/∂α1)},
∂κ0/∂α2= E{YLθ(·)g(Z)} = E{Y(∂L/∂α2)}.
Hence, we see that κ0is pathwise differentiable and d = Y. The pro-
jection of d onto T is then given by
?(d|T1) = E(Y?T)M−1
?(d|T2) = E(Y|X,Z)−κ0,
?(d|T3) = Lθ(·)E{YLθ(·)|Z}/E?{Lθ(·)}2|Z?,
and, hence, the efficient influence function is
?(d|T ) = E(Y?T)M−1
+Lθ(·)E{YLθ(·)|Z}/E?{Lθ(·)}2|Z?.
But we see that the influence function of the sample mean is Y − κ0.
Hence, the sample mean is semiparametric efficient if and only if (19)
holds.
α1)dydxdz. Hence, we have
1?,
1? +{E(Y|X,Z)−κ0}
A.7 Proof of Lemma 4
It suffices to consider only the case that φ = 1 is known, because
the estimates of β0and θ0(z) do not depend on the value of φ.
It is convenient to write c{η(x,z)} as d(x,z) and to denote the deriv-
ative of d(x,z) with respect to θ0(z) as dθ(x,z). Note that the derivative
with respect to β is dβ(x,z) = Xdθ(x,z). Direct calculations show that
Lθ(·) = dθ(X,Z){Y −µ(X,Z)},
Lβ(·) = Xdθ(X,Z){Y −µ(X,Z)},
θβ(Z) = −E[Xd2
E[d2
? = {X +θβ(Z)}dθ(X,Z){Y −µ(X,Z)},
E(Y?) = E?{X +θβ(Z)}dθ(X,Z)V{µ(X,Z)}?,
Lθ(·)E{YLθ(·)|Z}
E{L2
= {Y −µ(X,Z)}dθ(X,Z)E[dθ(X,Z)V{µ(X,Z)}|Z]
θ(X,Z)V{µ(X,Z)}|Z]
θ(X,Z)V{µ(X,Z)}|Z]
,
θ(·)|Z}
E[d2
θ(X,Z)V{µ(X,Z)}|Z].
If dθ(x,z) depends only on z, then θβ(Z) = −E[XV{µ(X,Z)}|Z]/
E[V{µ(X,Z)}|Z], E(Y?) = 0, and
1 ≡ dθ(X,Z)E[dθ(X,Z)V{µ(X,Z)}|Z]
E[d2
so that by Lemma 3 the sample mean is semiparametric efficient.
The cases where the sample mean is not semiparametric efficient
are the following. Consider problems not of canonical exponential
forms. First of all, it cannot be semiparametric efficient if E(Y?) = 0
and dθ(x,z) depends on x, for then (A.28) fails. This means then that
dθ(x,z) cannot be a function of x; that is, the data must follow a canon-
ical exponential family.
If E(Y?) ?= 0, we must have
?
+E[dθ(X,Z)V{µ(X,Z)}|Z]
E[d2
θ(X,Z)V{µ(X,Z)}|Z],
(A.28)
1 ≡ dθ(X,Z)
E(Y?T)M−1
1{X +θβ(Z)}
θ(X,Z)V{µ(X,Z)}|Z]
?
.
(A.29)
Page 17
Maity, Ma, and Carrol: Efficient Estimation of Population Quantities139
Examples where (A.29) fails to hold are easily constructed. Because
the term inside the parentheses in (A.29) is linear in X and a function
of Z, (A.29) can only hold, in principle, if d(x,z) = c{xTβ + θ(z)} =
a+blog{xTβ +θ(z)} for known constants (a,b).
[Received January 2006. Revised August 2006.]
REFERENCES
Bickel, P. J. (1982), “On Adaptive Estimation,” The Annals of Statistics, 10,
647–671.
Block, G., Hartman, A. M., Dresser, C. M., Carroll, M. D., Gannon, J., and
Gardner, L. (1986), “A Data-Based Approach to Diet Questionnaire Design
and Testing,” American Journal of Epidemiology, 124, 453–469.
Carroll, R. J., Fan, J., Gijbels, I., and Wand, M. P. (1997), “Generalized Par-
tially Linear Single-Index Models,” Journal of the American Statistical Asso-
ciation, 92, 477–489.
Chen, X., Linton, O., and Van Keilegom, I. (2003), “Estimation of Semipara-
metric Models When the Criterion Function Is not Smooth,” Econometrica,
71, 1591–1608.
Cheng, P. E. (1994), “Nonparametric Estimation of Mean Functionals With
Data Missing at Random,” Journal of the American Statistical Association,
89, 81–87.
Claeskens, G., and Carroll, R. J. (2007), “Post-Model Selection Inference in
Semiparametric Models,” Biometrica, to appear.
Claeskens, G., and Van Keilegom, I. (2003), “Bootstrap Confidence Bands
for Regression Curves and Their Derivatives,” The Annals of Statistics, 31,
1852–1884.
Flegal, K. M., Carroll, M. D., Ogden, C. L., and Johnson, C. L. (2002), “Preva-
lence and Trends in Obesity Among US Adults, 1999–2000,” Journal of the
American Medical Association, 288, 1723–1727.
Härdle, W., Hall, P., and Ichimura, H. (1993), “Optimal Smoothing in Single-
Index Models,” The Annals of Statistics, 21, 157–178.
Härdle, W., Müller, M., Sperlich, S., and Werwatz, A. (2004), Nonparametric
and Semiparametric Models, Berlin: Springer-Verlag.
Härdle, W., and Stoker, T. M. (1989), “Investigating Smooth Multiple Regres-
sion by the Method of Average Derivatives,” Journal of the American Statis-
tical Association, 408, 986–995.
Ma,Y.,Chiou,J.M.,andWang,N.(2006),“EfficientSemiparametricEstimator
for Heteroscedastic Partially Linear Models,” Biometrika, 93, 75–84.
Newey, W. K. (1990), “Semiparametric Efficiency Bounds,” Journal of Applied
Econometrics, 5, 99–135.
Newey, W. K., Hsieh, F., and Robins, J. M. (2004), “Twicing Kernels and
a Small Bias Property of Semiparametric Estimators,” Econometrica, 72,
947–962.
Powell, J. L., and Stoker, T. M. (1996), “Optimal Bandwidth Choice for
Density-Weighted Averages,” Journal of Econometrics, 75, 291–316.
Ruppert, D., Sheather, S. J., and Wand, M. P. (1995), “An Effective Bandwidth
Selector for Local Least Squares Regression,” Journal of the American Sta-
tistical Association, 90, 1257–1270; Corrigenda (1996), 91, 1380.
Sepanski, J. H., Knickerbocker, R., and Carroll, R. J. (1994), “A Semiparamet-
ric Correction for Attenuation,” Journal of the American Statistical Associa-
tion, 89, 1366–1373.
Severini, T. A., and Staniswalis, J. G. (1994), “Quasilikelihood Estimation in
Semiparametric Models,” Journal of the American Statistical Association, 89,
501–511.
Severini, T. A., and Wong, W. H. (1992), “Profile Likelihood and Conditionally
Parametric Models,” The Annals of Statistics, 20, 1768–1802.
Subar, A. F., Thompson, F. E., Kipnis, V., Midthune, D., Hurwitz, P., McNutt,
S., McIntosh, A., and Rosenfeld, S. (2001), “Comparative Validation of the
Block, Willett and National Cancer Institute Food Frequency Questionnaires:
The Eating at America’s Table Study,” American Journal of Epidemiology,
154, 1089–1099.
Tooze, J. A., Grunwald, G. K., and Jones, R. H. (2002), “Analysis of Repeated
Measures Data With Clumping at Zero,” Statistical Methods in Medical Re-
search, 11, 341–355.
Wang, Q., Linton, O., and Härdle, W. (2004), “Semiparametric Regression
Analysis With Missing Response at Random,” Journal of the American Sta-
tistical Association, 99, 334–345.
Willett, W. C., Sampson, L., Stampfer, M. J., Rosner, B., Bain, C., Witschi, J.,
Hennekens, C. H., and Speizer, F. E. (1985), “Reproducibility and Validity
of a Semiquantitative Food Frequency Questionnaire,” American Journal of
Epidemiology, 122, 51–65.
Woteki, C. E. (2003), “Integrated NHANES: Uses in National Policy,” Journal
of Nutrition, 133, 582S–584S.
Download full-text