Efficient Estimation of PopulationLevel Summaries in General Semiparametric Regression Models
ABSTRACT This paper considers a wide class of semiparametric regression models in which interest focuses on populationlevel quantities that combine both the parametric and nonparametric parts of the model. Special cases in this approach include generalized partially linear models, gener alized partially linear single index models, structural measurement error models and many others. For estimating the parametric part of the model eciently, proflle likelihood kernel estimation methods are wellestablished in the literature. Here our focus is on estimating general populationlevel quantities that combine the parametric and nonparametric parts of the model, e.g., population mean, probabilities, etc. We place this problem into a general context, provide a general kernelbased methodology, and derive the asymptotic distributions of estimates of these populationlevel quantities, showing that in many cases the estimates are semiparametric ecient. For estimating the population mean with no missing data, we show that the sample mean is semiparametric ecient for canonical exponential families, but not in general. We apply the methods to a problem in nutritional epidemiology, where estimating the distribution of usual intake is of primary interest, and semiparametric methods are not available. Extensions to the case of missing response data are also discussed.

Article: Bandwidth selection for backfitting estimation of semiparametric additive models: A simulation study
[Show abstract] [Hide abstract]
ABSTRACT: A datadriven bandwidth selection method for backfitting estimation of semiparametric additive models, when the parametric part is of main interest, is proposed. The proposed method is a double smoothing estimator of the meansquared error of the backfitting estimator of the parametric terms. The performance of the proposed method is evaluated and compared with existing bandwidth selectors by means of a simulation study.Computational Statistics & Data Analysis 06/2013; 62:136–148. · 1.15 Impact Factor  SourceAvailable from: stat.tamu.edu[Show abstract] [Hide abstract]
ABSTRACT: We revisit the secondorder nonlinear least square estimator proposed in Wang and Leblanc (Anne Inst Stat Math 60:883900, 2008) and show that the estimator reaches the asymptotic optimality concerning the estimation variability. Using a fully semiparametric approach, we further modify and extend the method to the heteroscedastic error models and propose a semiparametric efficient estimator in this more general setting. Numerical results are provided to support the results and illustrate the finite sample performance of the proposed estimator.Annals of the Institute of Statistical Mathematics 08/2012; 64(4):751764. · 0.66 Impact Factor  SourceAvailable from: Marc G. Genton[Show abstract] [Hide abstract]
ABSTRACT: We study a class of semiparametric skewed distributions arising when the sample selection process produces nonrandomly sampled observations. Based on semiparametric theory and taking into account the symmetric nature of the population distribution, we propose both consistent estimators, i.e. robust to model misspecification, and efficient estimators, i.e. reaching the minimum possible estimation variance, of the location of the symmetric population. We demonstrate the theoretical properties of our estimators through asymptotic analysis and assess their finite sample performance through simulations. We also implement our methodology on a real data example of ambulatory expenditures to illustrate the applicability of the estimators in practice.Stat 08/2012; 1(1):111.
Page 1
Efficient Estimation of PopulationLevel Summaries
in General Semiparametric Regression Models
Arnab MAITY, Yanyuan MA, and Raymond J. CARROLL
This article considers a wide class of semiparametric regression models in which interest focuses on populationlevel quantities that combine
both the parametric and the nonparametric parts of the model. Special cases in this approach include generalized partially linear models,
generalized partially linear singleindex models, structural measurement error models, and many others. For estimating the parametric part
of the model efficiently, profile likelihood kernel estimation methods are well established in the literature. Here our focus is on estimating
general populationlevel quantities that combine the parametric and nonparametric parts of the model (e.g., population mean, probabilities,
etc.). We place this problem in a general context, provide a general kernelbased methodology, and derive the asymptotic distributions of
estimates of these populationlevel quantities, showing that in many cases the estimates are semiparametric efficient. For estimating the
population mean with no missing data, we show that the sample mean is semiparametric efficient for canonical exponential families, but not
in general. We apply the methods to a problem in nutritional epidemiology, where estimating the distribution of usual intake is of primary
interest and semiparametric methods are not available. Extensions to the case of missing response data are also discussed.
KEY WORDS: Generalized estimating equations; Kernel methods; Measurement error; Missing data; Nonparametric regression; Nutri
tion; Partially linear model; Profile method; Semiparametric efficient score; Semiparametric information bound; Single
index models.
1. INTRODUCTION
This article is about semiparametric regression models when
one is interested in estimating a population quantity such as
the mean, variance, and probabilities. The unique feature of
the problem is that the quantities of interest are functions of
both the parametric and the nonparametric parts of the model.
We will also allow for partially missing responses, but handling
such a modification is relatively easy. The main aim of the ar
ticle is to estimate population quantities that involve both the
parametric and the nonparametric parts of the model and to do
so efficiently and in considerable generality.
We will construct estimators of these populationlevel quan
tities that exploit the semiparametric structure of the problem,
derive their limiting distributions, and show in many cases that
the methods are semiparametric efficient. The work is moti
vatedbyandillustratedwithanimportantprobleminnutritional
epidemiology, namely, estimating the distribution of usual in
take for episodically consumed foods such as red meat.
A special simple case of our results is already established
in the literature (Wang, Linton, and Härdle 2004 and the refer
ences therein), namely, the partially linear model
Yi= XT
iβ0+θ0(Zi)+ξi,
(1)
where θ0(·) is an unknown function and ξi= Normal(0,σ2
We allow the responses to be partially missing, important in
0).
Arnab Maity is a Graduate Student (Email: amaity@stat.tamu.edu),
Yanyuan Ma is Assistant Professor (Email: ma@stat.tamu.edu), and Raymond
J. Carroll is Distinguished Professor (Email: carroll@stat.tamu.edu), Depart
ment of Statistics, Texas A&M University, College Station, TX 77843. This
work was supported by grants from the National Cancer Institute (CA57030
for AM and RJC; CA74552 for YM) and by the Texas A&M Center for En
vironmental and Rural Health via a grant from the National Institute of Envi
ronmental Health Sciences (P30ES09106). The authors are grateful to Janet
Tooze, Amy Subar, Victor Kipnis, and Douglas Midthune for introducing us to
the problem of episodically consumed foods and for allowing us to use their
data.TheauthorsthankNaisyinWangforreadingthefinalmanuscriptandhelp
ing us with replies to a referee. Part of the original work of the last two authors
originally occurred during a visit to the Centre of Excellence for Mathematics
and Statistics of Complex Systems at the Australian National University, whose
support they gratefully acknowledge. The authors especially wish to thank three
referees, the associate editor, and the joint editor for helping turn the original
submission into a publishable article. Their patience and many helpful sugges
tions are very greatly appreciated.
cases where the response is difficult to measure but the pre
dictors are not. Suppose that Y is partially missing and let
δ = 1 indicate that Y is observed, so that the observed data are
(δiYi,Xi,Zi,δi). Suppose further that Y is missing at random, so
that pr(δ = 1Y,X,Z) = pr(δ = 1X,Z).
Usually, of course, the main interest is in estimating β0effi
ciently. This is not the problem we discuss, because in our ex
ample the parameters β0are themselves of relatively minor in
terest. In their work, Wang et al. (2004) estimated the marginal
mean κ0= E(Y) = E{XTβ0+ θ0(Z)}. Note how this combines
both the parametric and the nonparametric parts of the model.
One of the results of Wang et al. is that if one uses only the
complete data that Y is observed, then fits the standard profile
likelihood estimator to obtain? β and? θ(·,? β), it transpires that
sample mean is also semiparametric efficient.
Actually, quite a bit more is true even in this relatively simple
Gaussian case. Let B = (βT,σ2)Tand let? B and? θ(·,? B) be the
ample, Severini and Wong (1992) for local constant estimation
and Claeskens and Carroll (2007) for local linear estimation.
Consider estimating any functional κ0= E[F{X,θ0(Z),B0}]
for some function F(·) that is thrice continuously differen
tiable: This, of course, includes such quantities as population
mean, and probabilities. Then one very special case of our re
sults is that the semiparametric efficient estimate of κ0is just
? κ = n−1?n
ties. Thus, consider a semiparametric problem in which the
loglikelihood function given (X,Z) is L{Y,X,θ(Z),B}. If we
define LB(·) and Lθ(·) to be derivatives of the loglikelihood
with respect to B and θ(Z), we have the properties that
E[LB{Y,X,θ0(Z),B0}X,Z] = 0 and similarly for Lθ(·). We
use profile likelihood methods computed at the observed data.
With missing data, this local linear kernel version of the pro
file likelihood method of Severini and Wong (1992) works
a semiparametric efficient estimator of the population mean κ0
is n−1?n
i=1{XT
i? β +? θ(Zi,? β)}. If there are no missing data, the
profile likelihood estimates in the complete data; see, for ex
i=1F{Xi,? θ(Zi,? B),? B}.
semiparametric models and general populationlevel quanti
In contrast to Wang et al. (2004), we deal with general
© 2007 American Statistical Association
Journal of the American Statistical Association
March 2007, Vol. 102, No. 477, Theory and Methods
DOI 10.1198/016214506000001103
123
Page 2
124Journal of the American Statistical Association, March 2007
as follows. Let K(·) be a smooth symmetric density func
tion with bounded support, let h be a bandwidth, and let
Kh(z) = h−1K(z/h). For any fixed B, let (? α0,? α1) be the local
n
?
and then setting? θ(z,B) =? α0. The profile likelihood estimator
ing in B
n
?
Our estimator of κ0= E[F{X,θ0(Z),B0}] is then
n
?
We emphasize that the possibility of missing response data
and finding a semiparametric efficient estimate of B0is not the
focus of the article. Instead, the focus is on estimating quanti
ties κ0= E[F{X,θ0(Z),B0}] that depend on both the paramet
ric and the nonparametric parts of the model: This is a very
different problem than simply estimating B0. Previous work in
the area considered only the partially linear model and only es
timation of the population mean: Our work deals with general
semiparametric models and general populationlevel quantities.
An outline of this article is as follows. In Section 2 we dis
cuss the general semiparametric problem with loglikelihood
L{Y,X,θ(Z),B} and a general goal of estimating κ0= E[F{X,
θ0(Z),B0}]. We derive the limiting distribution of (4) and show
that it is semiparametric efficient. We also discuss the general
problem where the population quantity κ0of interest is the ex
pectation of a function of Y alone and describe doubly robust
estimators in this context.
InSection3weconsidertheclassofgeneralizedpartiallylin
earsingleindexmodels(Carroll,Fan,Gijbels,andWand1997).
Singleindex modeling, see Härdle and Stoker (1989) and Här
dle, Hall, and Ichimura (1993), is an important means of di
mension reduction, one that is finding increased use in this age
of highdimensional data. We develop methods for estimating
population quantities in the generalized partially linear single
indexmodelingframeworkandshowthatthemethodsaresemi
parametric efficient.
Section 4 describes an example from nutritional epidemiol
ogy that motivated this work, namely, estimating the distribu
tion of usual intake of episodically consumed foods such as red
meat. The model used in this area is far more complex than the
simple partially linear Gaussian model (1), and while the pop
ulation mean is of some interest, of considerably more interest
is the probability that usual intake exceeds thresholds. We will
illustrate why in this context one cannot simply adopt the per
centages of the observed responses that exceed a certain thresh
old.
Section5describesthreeissues ofimportance:(1) bandwidth
selection (Sec. 5.1), (2) the efficiency and robustness of the
sample mean when the population mean is of interest (Sec. 5.2),
and numerical and theoretical insights into the partially linear
likelihood estimator obtained by maximizing, in (α0,α1),
i=1
δiKh(Zi−z)L{Yi,Xi,α0+α1(Zi−z),B},
(2)
of B0modified for missing responses is obtained by maximiz
i=1
δiL{Yi,Xi,? θ(Zi,B),B}.
(3)
? κ = n−1
i=1
F{Xi,? θ(Zi,? B),? B}.
(4)
model and the nature of our assumptions (Sec. 5.3). An inter
esting special case is, of course, the partially linear model when
κ0is the population mean. For this problem, we show in Sec
tion 5.2 that, with no missing data, the sample mean is semi
parametric efficient for canonical exponential families but not
of course in general, thus extending and clarifying the results of
Wang et al. (2004) that were specific to the Gaussian case.
Section 6 gives concluding remarks and results. All technical
results are given in the Appendix.
2. SEMIPARAMETRIC MODELS WITH
A SINGLE COMPONENT
2.1 Main Results
We benefit from the fact that the limiting expansions for? B
modification of incorporating the missing response indicators.
Let f(z) be the density function of Z, which is assumed to have
bounded support and to be positive on that support. Let ?(z) =
f(z)E{δLθθ(·)Z = z}. Let Liθ(·) = Lθ{Yi,Xi,θ0(Zi),B0} and
so on. Then it follows from standard results (see the App. for
more discussion) that as a minor modification of the work of
Severini and Wong (1992),
and? θ(·) are essentially already well known, with the minor
? θ(z,? B)−θ0(z)
= (h2/2)θ(2)
0(z)−n−1
n
?
i=1
δiKh(Zi−z)Liθ(·)/?(z)
?n−1/2?,
δi?i+op
+θB(z,B0)(? B −B0)+op
? B −B0= M−1
where
(5)
1n−1
n
?
i=1
?n−1/2?,
(6)
θB(z,B0) = −E{δLBθ(·)Z = z}/E{δLθθ(·)Z = z},
?i= {LiB(·)+Liθ(·)θB(Zi,B0)},
M1= E(δ??T) = −E?δ{LBB(·)+LBθ(·)θT
andwhere,under regularityconditions,(5) is uniformin z. Con
ditions guaranteeing (6) are well known; see the Appendix.
Define
Di(·) = −Liθ(·)E{Fθ(·)Zi}
M2= E{FB(·)+Fθ(·)θB(Z,B0)}.
In the Appendix we show the following result.
(7)
(8)
B(Z,B0)}?,
E{δLθθ(·)Zi},
Theorem 1. Suppose that nh4→ 0 and that (5) and (6) hold,
the former uniformly in z. Suppose also that Z has compact
support, that its density is bounded away from 0 on that sup
port, and that the kernel function also has a finite support. Then
the estimator ? κ of κ0= E[F{X,θ0(Z),B0}] is semiparametric
n1/2(? κ −κ0)
= n−1/2
i=1
+δiDi(·)?+op(1)
efficient in the sense of Newey (1990). In addition, as n → ∞,
n
?
?Fi(·)−κ0+MT
2M−1
1δi?i
(9)
Page 3
Maity, Ma, and Carrol: Efficient Estimation of Population Quantities125
⇒ Normal?0,E{F(·)−κ0}2+MT
2M−1
1M2
+E{δD2(·)}?.
(10)
Remark 1. To obtain asymptotically correct inference
about κ0, there are two possible routes. The first is to use the
bootstrap: Whereas Chen, Linton, and Van Keilegom (2003)
only justified the bootstrap for estimating B0, we conjecture
that the bootstrap works for κ0as well. More formally, one re
quiresonlyaconsistentestimateofthelimitingvariancein(10).
This is a straightforward exercise, although programming in
tense: One merely replaces all the expectations by sums in that
expression and all the regression functions by kernel estimates.
Remark 2. Our analysis of semiparametric efficiency in the
sense of Newey (1990) has this outline. We first assume path
wise differentiability of κ; see Section A.2.2 for a definition.
Working with this assumption, we derive the semiparametric
efficient score. With this score in hand, we then prove pathwise
differentiability. Details are given in the Appendix.
Remark 3. With a slight modification using a device in
troduced to semiparametric methods by Bickel (1982), The
orem 1 also holds for estimated bandwidths. We confine our
discussion to bandwidths of order n−1/3; see Section 5.1.2 for
a reason. Write such bandwidths as hn= cn−1/3, where, fol
lowing Bickel, the values for c are allowed to take values in
the set U = a{0,±1,±2,...}, where a is an arbitrary small
number. We discretize bandwidths so that they take on values
cn−1/3with c ∈ U. Denote estimators as? κ(hn) and note that for
? κ(c0n−1/3)} = op(1) and that n1/2{? κ(c0n−1/3)−? κ(c∗n−1/3)} =
mated bandwidth with the property that?hn= Op(n−1/3), then
holds for these estimated bandwidths.
an arbitrary c∗and an arbitrary fixed, deterministic sequence
cn→ c0for finite c0, Theorem 1 shows that n1/2{? κ(cnn−1/3)−
op(1). Hence, it follows from Bickel (1982, p. 653, just af
ter eq. 3.7) that if?hn= ? cnn−1/3, with ? c ∈ U, is an esti
n1/2{? κ(? cnn−1/3) − ? κ(c∗n−1/3)} = op(1). Hence, Theorem 1
2.2 General Functions of the Response
and Double Robustness
It is important to consider estimation in problems where
κ0 can be constructed outside the model. Suppose that κ0=
E{G(Y)} and define F{X,θ0(Z),B0} = E{G(Y)X,Z}. We will
discuss two estimators with the properties that (1) if there are
no missing response data, the semiparametric model is not used
and the estimator is consistent; and (2) under certain circum
stances, the estimator is consistent if either the semiparametric
model is correct or if a model for the missingdata process is
correct.
Our motivating example discussed in Section 4 dose not fall
into the category discussed in this section.
The two estimators are based on different constructions for
estimating the missingdata process. The first is based on a
nonparametric formulation for estimating pr(δ = 1Z) = πmarg,
where the subscript indicates a marginal estimation of the prob
ability that Y is observed. The second is based on a paramet
ric formulation for estimating pr(δ = 1Y,X,Z) = π(X,Z,ζ),
where ζ is an unknown parameter estimated by standard logis
tic regression of δ on (X,Z).
The first estimator, similar to one defined by Wang et al.
(2004) and efficient in the Gaussian partially linear model, can
be constructed as follows. Estimate πmargby local linear logis
tic regression of δ on Z, leading to the usual asymptotic expan
sion
? πmarg(z)−πmarg(z)
= n−1
n
?
j=1
{δj−πmarg(Zj)}Kh(z−Zj)/fZ(z)+op
?n−1/2?,
(11)
assuming that nh4→ 0. Then construct the estimator
n
?
? κmarg= n−1
i=1
?
δi
? πmarg(Zi)G(Yi)+
?
1−
δi
? πmarg(Zi)
×F{Xi,? θ(Zi,? B),? B}
?
?
.
The estimatorhas two useful properties:(1) if there are no miss
ing data, it does not depend on the model and is, hence, consis
tent for κ0; and (2) if observation of the response Y depends
only on Z, it is consistent even if the semiparametric model is
not correct.
In a similar vein, the second estimate, also similar to another
estimate of Wang et al. (2004), is given as
?
? κ = n−1
n
?
i=1
δi
π(Xi,Zi,? ζ)G(Yi)+
?
1−
δi
π(Xi,Zi,? ζ)
×F{Xi,? θ(Zi,? B),? B}
?
?
.
This estimator has the doublerobustness property that if either
the parametric model π(X,Z,ζ) or the underlying semipara
metric modelfor {B,θ(·)} is correct, then? κ is consistentand as
in both? κmargand? κ improve efficiency: They are also important
If both models are correct, then the following results are ob
tained as a consequence of (5) and (6); see the Appendix for a
sketch.
ymptotically normally distributed. Generally, the second terms
for the doublerobustness property of? κ.
Lemma 1. Define
M2,marg= E
??
1−
δ
πmarg(Z)
??
?
{FB(·)+Fθ(·)θB(Z,B0)}T
?
?
,
Di,marg(·) = −Liθ(·)E1−
δi
πmarg(Zi)
Fiθ(·)
???Zi
?
?E{δLθθ(·)Zi}.
Then, to terms of order op(1),
n1/2(? κmarg−κ0)
≈ n−1/2
?
n
?
i=1
?
δi
πmarg(Zi)G(Yi)
?
+
1−
δi
πmarg(Zi)
Fi(·)−κ0
?
Page 4
126Journal of the American Statistical Association, March 2007
+M2,margM−1
1n−1/2
n
?
i=1
δi?i
+n−1/2
n
?
i=1
δiDi,marg(·).
(12)
Lemma 2. Define πζ(X,Z,ζ) = ∂π(X,Z,ζ)/∂ζ. Assume
that n1/2(? ζ − ζ) = n−1/2?n
n1/2(? κ −κ0)
≈ n−1/2
i=1
?
Remark 4. The expansions (12) and (13) show that? κmargand
the asymptotic variances are given as
?
i=1ψiζ(·) + op(1) with E{ψζ(·)
X,Z} = 0. Then, to terms of order op(1),
n
?
?
δi
π(Xi,Zi,ζ){G(Yi)−κ0}
+
1−
δi
π(Xi,Zi,ζ)
?
{Fi(·)−κ0}
?
.
(13)
? κ are asymptotically normally distributed. One can show that
Vκ,marg= var
δ
πmarg(Z)G(Y)+
?
1−
δ
πmarg(Z)
?
F(·)
+M2,margM−1
1δ? +δDmarg(·)
?
,
Vκ= var
?
δi
π(Xi,Zi,ζ)G(Yi)
?
+
1−
δi
π(Xi,Zi,ζ)
?
Fi(·)
?
,
respectively, from which estimates are readily derived.
Finally, we note that Claeskens and Carroll (2007) showed
that in general likelihood problems, if there is an omitted co
variate, then under contiguous alternatives the effect on estima
tors is to add an asymptotic bias, without changing the asymp
totic variance.
3. SINGLE–INDEX MODELS
One means of dimension reduction is singleindex modeling.
Singleindex models can be viewed as a generalized version of
projection pursuit, in that only the most influential direction is
retained to keep the model tractable and to reduce dimension.
Since its introduction in Härdle and Stoker (1989), singleindex
modeling has been widely studied and used. A comprehensive
summary of the model is given in Härdle, Müller, Sperlich, and
Werwatz (2004). Let Z = (R,ST)Twhere R is a scalar. We con
sider here the generalized partially linear singleindex model
(GPLSIM)ofCarrolletal.(1997),namely,theexponentialfam
ily (20) with η(X,Z) = XTβ0+θ0(ZTα0), where θ0(·) is an un
known function and for identifiability purposes ?α0? = 1. Be
cause identifiability requires that one of the components of Z
be a nontrivial predictor of Y, for convenience we will make
the very small modification that one component of Z, what we
call R, is a known nontrivial predictor of Y. The reason for
making this modification can be seen in theorem 4 of Carroll
et al. (1997) where the final limit distribution of the estimate
of α0has a singular covariance matrix. In addition, their main
asymptotic expansion, given in their equation (A.12), is about
the nonsingular transformation (I −α0αT
With this modification, we write the model as
E(YX,Z) = C(1)?c{η(X,Z)}?= µ{XTβ0+θ0(R+STγ0)},
where γ0is unrestricted.
Carroll et al. (1997) used profile likelihood to estimate B0=
(γ0,β0) and θ0(·), although they presented no results con
cerning the estimate of φ0, their interest largely being in lo
gistic regression where φ0= 1 is known. Rewrite the likeli
hood function (20) as L{Y,X,β,θ(R + STγ),φ}. Then, given
B = (γT,βT)T, they formed U(γ) = R + STγ and computed
the estimate? θ{u(γ),B} by local likelihood of Y on {X,U(γ)}
ST
iγ,B),φ}] in B and φ.
Our goal is to estimate κSI= E[F{X,θ0(R+STγ0),β0,φ0}].
Ourproposedestimateis? κSI= n−1?n
G = Dφ(Y,φ0)−?Yc{XTβ0+θ0(U)}−C{c(·)}?/φ2
Also define ? = {STθ(1)
and ? = [Y−µ{XTβ0+θ0(U)}]ρ1{XTβ0+θ0(U)}.Define Ni=
?i− [E{δρ2(·)Ui}]−1E{δi?iρ2(·)Ui} and Q = E{δNNT×
ρ2(·)}. Make the further definitions Fβ(·) = ∂F{X,θ0(U),
β0,φ0}/∂β0, Fφ(·) = ∂F{X,θ0(U),β0,φ0}/∂φ0, and Fθ(·) =
∂F{X,θ0(U),β0,φ0}/∂θ0(U). Also define
J(U) = [E{δρ2(·)U}]−1E{Fθ(·)U},
D =
E{Fβ(·)}−E(Fθ(·)[E{δρ2(·)U}]−1E{δXρ2(·)U})
Then we have the following result regarding the asymptotic dis
tribution of? κSI.
iid and that the conditions in Carroll et al. (1997) hold, in par
ticular, that nh4→ 0. Then
n1/2(? κSI−κSI)
= n−1/2
i=1
+DTQ−1δiNi?i+δiJ(Ui)?i
+δiGiE{Fφ(·)}/E(δG2)?+op(1)
⇒ Normal(0,V),
where V = E[F{X,θ0(U),β0,φ0}−κSI]2+DTQ−1D+var{δ×
J(U)?} + E(δG2)[E{Fφ(·)}]2/{E(δG2)}2. Further, ? κSIis semi
4. MOTIVATING EXAMPLE
0)(? α −α0).
(14)
as in Severini and Staniswalis (1994), using the data with
δ = 1. Then they maximized?n
i=1δilog[L{Yi,Xi,β,? θ(Ri+
i=1F{Xi,? θ(Ri+ST
i? γ,? B),
? β,? φ}.
Our main result is as follows. First, define U = R+STγ0and
0.
0(U),XT}T, ρ?(·) = {µ(1)(·)}?/V(·),
?
E{Fθ(·)θ(1)(U)S}−E(Fθ(·)[E{δρ2(·)U}]−1θ(1)(U)E{δSρ2(·)U})
?
.
Theorem 2. Assume that (Yi,δi,Xi,Zi),i = 1,2,...,n, are
n
?
?F{Xi,θ0(Ui),β0,φ0}−κSI
(15)
parametric efficient.
4.1 Introduction
There is considerable interest in understanding the distribu
tion of dietary intake in various populations. For example, as
obesity rates continue to rise in the United States (Flegal, Car
roll, Ogden, and Johnson 2002), the demand for information
Page 5
Maity, Ma, and Carrol: Efficient Estimation of Population Quantities127
about diet and nutrition is increasing. Information on dietary
intake has implications for establishing population norms, con
ducting research, and making public policy decisions (Woteki
2003).
We wish toemphasizethatthereare no missingresponse data
in this example. We also emphasize that the problem is vastly
different from simply estimating the population mean using a
Gaussian partially linear model. The strength of our approach
is that once we have proposed a semiparametric model, then
our methodology, asymptotics, and semiparametric efficiency
results are readily employed.
This article was motivated by the analysis of the Eating at
America’s Table Study (EATS) (Subar et al. 2001), where esti
mating the distribution of the consumption of episodically con
sumed foods is of interest. The data consist of four 24hour re
calls over the course of a year as well as the National Cancer
Institute’s (NCI) dietary history questionnaire (DHQ), a partic
ularversionofafoodfrequencyquestionnaire(FFQ;seeWillett
et al. 1985 and Block et al. 1986). The goal is to estimate the
distribution of usual intake, defined as the average daily intake
of a dietary component by an individual in a fixed time period,
a year in the case of EATS. There were n = 886 individuals in
the dataset.
When the responses are continuous random variables, this
is a classical problem of measurement error, with a large
literature. However, little of the literature is relevant to episodi
cally consumed foods, as we now describe. Consider, for ex
ample, consumption of red meat, darkgreen vegetables, and
deepyellow vegetables, all of interest in nutritional surveil
lance. In the EATS data, 45% of the 24hour recalls reported
no redmeat consumption. In addition, 5.5% of the individu
als reported no redmeat consumption on any of the four sepa
rate 24hour recalls: For deepyellow vegetables these numbers
are 63% and 20%, respectively, while for darkgreen vegetables
the numbers are 78% and 46%, respectively. Clearly, methods
aimed at understanding usual intakes for continuous data are
inappropriate for episodically consumed foods with so many
zeroreported intakes.
4.2 Model
To handle episodically consumed foods, twopart models
have been developed (Tooze, Grunwald, and Jones 2002).
These are basically zeroinflated repeatedmeasures examples.
Our methods are applicable to such problems when the covari
ate Z is evaluated only once for each subject, as it is in our
example.
We describe here a simplification of this approach, used to il
lustrate our methodology. On each individual, we measure age
and gender, the collection being what we call R. We also ob
serve energy (calories) as measured by the DHQ, the logarithm
of which we call Z. The reader should note that Z is evalu
ated only once per individual, and, hence, while there are re
peated measures on the responses, there are no repeated mea
sures on Z: θ0(Z) occurs only once in the likelihood function,
and our methodology applies.
Let X = (R,Z). The response data for an individual i consist
of four 24hour recalls of redmeat consumption. Let ?ij= 1
if red meat is reported consumed on the jth 24hour recall for
j = 1,...,4. Let Yijbe the product of ?ijand the logarithm
of reported redmeat consumption, with the convention that
0log(0) = 0. Then the response data are Yi= (?ij,Yij)4
j=1.
4.2.1 Modeling the Probability of Zero Response.
part of the model is whether the subject reports redmeat con
sumption.Wemodelthisasarepeatedmeasureslogisticregres
sion, so that
The first
pr(?ij= 1Ri,Zi,Ui1) = H(β0+XT
where H(·) is the logistic distribution function and Ui1=
Normal(0,σ2
u1) is a personspecific random effect. Note that,
for simplicity, we have modeled the effect of energy consump
tion as linear, because in the data there is little hint of nonlin
earity.
iβ1+Ui1),
(16)
4.2.2 Modeling Positive Responses.
model consists of a distribution of the logarithm of redmeat
consumption on days when consumption is reported, namely,
The second part of the
[Yij?ij= 1,Ri,Zi,Ui2] = Normal{RT
iβ2+θ(Zi)+Ui2,σ2},
(17)
where Ui2= Normal(0,σ2
which we take to be independent of Ui1. Note that (17) means
that the nonzero Y data within an individual marginally have
the same mean RT
covariance σ2
u2.
4.2.3 Likelihood Function.
is B, consisting of β0, β1, β2, σ2
likelihood function L(·) is readily computed with numerical in
tegration as follows:
?u1
j=1
×{1−H(β0+XTβ1+u1)}1−?ijdu1
×σ−1
?
Of course, the second numerical integral is not necessary, be
cause the integration can be done analytically.
u2) is a personspecific random effect,
iβ2+ θ(Zi), variance σ2+ σ2
u2, and common
The collection of parameters
u1, σ2
u2, and σ2. The log
exp{L(·)} =
1
σu1
?
φ
σu1
? 4?
{H(β0+XTβ1+u1)}?ij
u2σ−?
4?
j?ij
?
φ
?u2
iβ2+θ(Zi)+u2}
σ
σu2
?
×
j=1
φ
?Yij−{RT
???ij
du2.
4.2.4 Defining Usual Intake at the Individual Level.
ing from (17) that reported intake on days of consumption fol
lows a lognormal distribution, the usual intake for an individ
ual is defined as
Not
G{X,U1,U2,B,θ(Z)}
= H(β0+XT
×exp{RTβ2+θ(Z)+U2+σ2/2}.
The goal is to understand the distribution of G{X,U1,U2,
B,θ(Z)} across a population. In particular, for arbitrary c
we wish to estimate pr[G{X,U1,U2,B,θ(Z)} > c]. Define
F{X,B,θ(Z)} = pr[G{X,U1,U2,B,θ(Z)} > cX,Z], a quan
tity that can be computed by numerical integration. Then κ0=
E[F{X,B,θ(Z)}] is the percentage of the population whose
longterm reported daily average consumption of red meat ex
ceeds c.
iβ1+U1)
(18)
Page 6
128 Journal of the American Statistical Association, March 2007
4.3 Bias in Naive Estimates, and a Simulation Study
We emphasize that the distribution of mean intake cannot
be estimated consistently by the simple device of computing
the sample percentage of the observed 24hour recalls that ex
ceed c, and, as a consequence, going through the modelfitting
process is actually necessary. To see this, suppose only one
24hour recall per person was computed and the percentage of
these 24hour recalls exceeding c was computed. In large sam
ples, this percentage converges to
κ24hr= E?H(β0+XTβ1+U1)
×??{RTβ2+θ(Z)−log(c)}/(σ2+σ2
In contrast, for σ2> 0,
κ0= E????RTβ2+θ(Z)+σ2/2
2)1/2??.
−log{c/H(β0+XTβ1+U1)}?/σ2
??.
As the number of replicates m of the 24hour recall ap
proaches ∞, the percentage κm,24hrof the means of the 24hour
recalls that exceed c → κ0, so we would expect that the fewer
the replicates, the less our estimate agrees with the sample ver
sion of κm,24hr, a phenomenon observed in our data; see below.
To see this numerically, we ran the following simulation
study. Gender, age, and the DHQ were kept the same as in
the EATS. The parameters (β0,β1,β2,σ2,σ2
same as our estimated values; see below. The function θ(·) was
roughly in accord with our estimated function, for simplicity,
being quadratic in the logarithm of the DHQ, standardized to
have minimum .0 and maximum 1.0, with intercept, slope, and
quadratic parameters being .50, 1.50, and −.75, respectively.
The true survival function, that is, 1 − the cdf, was computed
analytically, while the survival functions for the mean of two
24hour recalls and the mean of four 24hour recalls were com
puted by 1,000 simulated datasets.
The results are given in Figure 1, where the bias from not
using a model is evident.
We used our methods with a nonparametrically estimated
function, a bandwidth h = .30, and the Epanechnikov kernel
function. We generated 300 datasets, with results displayed in
Figure 2. The mean over the simulation was almost exactly the
correct function, not surprising given that the sample size is
large (n = 886). In Figure 2 we also display a 90% confidence
range from the simulated datasets, indicating that in the EATS
data at least, the results of our approach are relatively accurate.
1,σ2
2) were the
4.4 Data Analysis
We standardized age to have mean 0 and variance 1. In the
logistic part of the model, the intercept was estimated as −8.15,
with the coefficients for (gender, age, DHQ) = (.13,.14,1.09).
The randomeffect variance was estimated as ? σ2
.05 to .40, with little change in any of the estimates, as de
scribed in more detail in Section 5.1. With a bandwidth h = .30,
our estimates were? σ2= .76,? σ2
data: We used other methods such as mixed models with poly
nomial fits and obtained roughly the same answers.
1= .66. In the
continuous part of the model, we used bandwidths ranging from
2= .043,and the coefficientsfor
gender and age were −.25 and .02, respectively. The coefficient
for the personspecific random effect σ2
2appears intrinsic to the
Figure 1. Results of the Simulation Study Meant to Mimic the EATS
Study. All results are averages over 1,000 simulated datasets. The mean
of the semiparametric estimator (
almost identical to the true survival curve. The empirical survival func
tion of the mean of two 24hour recalls (
datasets. The empirical survival function of the mean of four 24hour
recalls (
) from 1,000 simulated datasets.
) of the survival curse, which is
) from 1,000 simulated
We display the computed survival function in Figure 3. Dis
played there are our method, along with the empirical survival
functions for the mean of the first two 24hour recalls and the
mean of all four 24hour recalls. While these are biased, it is
interesting to note that using the mean of only two 24hour re
calls is more different from our method than using the mean of
four 24hour recalls, which is expected as described previously.
The similarity of Figures 1 and 3 is striking, mainly indicating
that naive approaches, such as using the mean of two 24hour
recalls, can result in badly biased estimates of κ0.
Figure 2. Results of the Simulation Study Meant to Mimic the EATS
Study. Plotted is the mean survival function for 300 simulated datasets,
along with the 90% pointwise confidence intervals. The mean fitted func
tion is almost exact.
Page 7
Maity, Ma, and Carrol: Efficient Estimation of Population Quantities129
Figure 3. Results From the EATS Example. Plotted are estimates of
the survival function (1 − the cdf) of usual intake of red meat. The solid
line is the semiparametric method described in Section 4. The dotted
line is the empirical survival function of the mean of the first two 24hour
recalls per person, while the dashed line is survival function of the mean
of all the 24hour recalls per person.
5. BANDWIDTH SELECTION, THE PARTIALLY LINEAR
MODEL, AND THE SAMPLE MEAN
5.1 Bandwidth Selection
5.1.1 Background.
nel density function, that is, one with mean 0 and positive vari
ance. With this choice, in Theorem 1 we have assumed that the
bandwidth satisfies nh4→ 0: for estimation of the population
mean in the partially linear model. In contrast, if one were in
terested only in B0, then it is well known that by using profile
likelihood the usual bandwidth order h ∼ n−1/5is acceptable,
and offtheshelf bandwidth selection techniques yield an as
ymptotically normal limit distribution.
The reason for the technical need for undersmoothing is
the inclusion of θ0(·) in κ0. For example, suppose that κ0=
E{θ0(Z)}. Then it follows from (5) that ? κ − κ0= Op(h2+
moves the bias term entirely.
Note that κ0is not a parameter in the model, being a mixture
of the parametric part B0, the nonparametric part θ0(·), and the
joint distribution of (X,Z). Thus, it does not appear that κ0can
be estimated by profiling ideas.
We have used a standard firstorder ker
n−1/2). Thus, in order for n1/2(? κ −κ0) = Op(1), we require that
nh4= Op(1). The additional restriction that nh4→ 0 merely re
5.1.2 Optimal Estimation.
ymptotic distribution of n1/2(? κ −κ0) is unaffected by the band
and numerical evidence of the lack of sensitivity to the band
width choice; see also Section 5.3 for further numerical evi
dence. In Section 5.1.4 we describe three different, simple prac
tical methods for bandwidth selection in this problem, all of
which work quite well in our simulations and example.
Because firstorder calculations do not get squarely at the
choice of bandwidth, other than to suggest that it is not partic
As seen in Theorem 1, the as
width, at least to first order. In Section 5.1.3 we give intuitive
ularly crucial, an alternative theoretical device is to do second
order calculations. Define η(n,h) = n1/2h2+ (n1/2h)−1. In a
problem similar to ours, Sepanski, Knickerbocker, and Carroll
(1994) showed that the variance of linear combinations of the
estimate of B0has a secondorder expansion as follows. Sup
pose we want to estimate ξTB0. Then, for constants (a1,a2),
n1/2(ξT? B −ξTB0) = Vn+op{η(n,h)},
This means that the optimal bandwidth is on the order of
h = cn−1/3for, a constant c depending on (a1,a2), which, in
turn, depend on the problem, that is, on the distribution of
(Y,X,Z) as well as on B0and θ0(·). In their practical imple
mentation, translated from the Gaussian kernel function to our
Epanechnikov kernel function, Sepanski et al. (1994) suggested
the following device, namely, that if the optimal bandwidth for
estimating θ0(·) is ho= cn−1/5, then one should use the correct
order bandwidth h = cn−1/3. They also did sensitivity analysis,
for example, h = (1/2)cn−1/3, but found little change in their
simulations. One of our three methods of practical bandwidth
selection is exactly this one.
A problem not within our framework but carrying a sim
ilar flavor was considered by Powell and Stoker (1996) and
Newey, Hsieh, and Robins (2004), namely, the estimation of
the weighted average derivative κAD= E{Yθ(1)
by Sepanski et al. (1994), Powell and Stoker (1996) showed
that the optimal bandwidth constructed from secondorder cal
culations is an undersmoothed bandwidth. Newey et al. (2004)
suggested that a simple device of choosing the bandwidth is
to choose something optimal when using a standard second
order kernel function but to then undersmooth, in effect, by us
ing a higher order kernel such as the twicing kernel. This is our
second bandwidth selection method described in Section 5.1.4.
Like the first, it appears to be an effective means of eliminating
the bias term.
In our problem, the article by Sepanski et al. (1994) is more
relevant. Preliminary calculations based on the basic tools in
that article suggest that for our problem, the optimal bandwidth
is also of order n−1/3. We intend to pursue these very calcula
tions in another article.
cov(Vn) = constant+?a1n1/2h2+a2
?hn1/2?−1?2.
0(Z)}. As done
5.1.3 Lack of Sensitivity to Bandwidth.
term technical need for undersmoothing because that is what it
really is. In practice, as Theorem 1 states, the asymptotic dis
tribution of? κ is unaffected by bandwidth choice for very broad
pens with estimation of the function θ0(·), where bandwidth se
lection is typically critical in practice, and this is seen in theory
through the usual bias–variance tradeoff.
In practice, we expect little effect of the bandwidth selection
on estimation of B0, and even less effect on estimation of κ0.
The reason is that broad ranges of bandwidths lead to no as
ymptotic effect on the distribution of? B. The extra amount of
? κ will be even less sensitive to the bandwidth, the socalled
To see this issue, consider the simulation in Wang et al.
(2004). They set X and Z to be independent, with X =
Normal(1,1) and Z = Uniform[0,1]. In the partially linear
We have used the
ranges of bandwidths. This is totally different from what hap
smoothing inherent in the summation in (4) should mean that
doublesmoothing phenomenon.
Page 8
130Journal of the American Statistical Association, March 2007
model, they set B0= 1.5, ? = Normal(0,1), and θ0(z) =
3.2z2− 1. They used the kernel function (15/16)(1 − z2)2×
I(z ≤ 1), and they fixed the bandwidth to be h = n−2/3, which
at least asymptotically is very great undersmoothing, because
h ∼ n−1/3is already acceptable and typically something like
nh2/log(n) → ∞ is usually required. In their case 3, they used
effective sample sizes for complete data of 18, 36, and 60, with
corresponding bandwidths .146, .092, and .065, respectively.
We reran the simulation of Wang et al. (2004), with com
plete response data and n = 60. We used bandwidths .02, .06,
.10, and .14, ranging from a very small bandwidth, less than
1/3 that used by Wang et al. (2004), to a larger bandwidth,
more than double that used. As another perspective, if one sets
h = σzn−c, where σzis the standard deviation of Z, then the
bandwidths used are equivalent to c = .73, .46, .34, and .26.
In other words, a bandwidth here of h = .02 is very great un
dersmoothing, while even h = .14 satisfies the theoretical con
straint on the bandwidth.
In Figure 4 we plot the results for a single dataset, where, as
in Wang et al. (2004), interest lies in estimating κ0= E(Y). As
is obvious from this figure, the bandwidth choice is very impor
tant for estimation of the function, but trivially unimportant for
estimation of κ0, the estimate of which ranged from 1.818 to
1.828.
In Figure 5 we plot the mean estimated functions from 100
simulated datasets. Again, the bandwidth matters a great deal
for estimating the function θ0(·). Again, too, the bandwidth
matters hardly at all for estimating κ0. Thus, for estimating κ0,
the mean estimates across the bandwidths range from 1.513 to
1.526, and the standard deviations of the estimates range from
.249 to .252. There is somewhat more effect of bandwidth on
the estimate of B0: For h ≥ .06, there is almost no effect, but
choosing h = .02 results in a 50% increase in standard devia
tion.
In other words, as expected by theory and intuition, band
width selection has little effect on the estimate of B0, except
when the bandwidth is much too small, and very little effect
on the estimation of κ0= E(Y). Similar results occur when one
looks at the variance of the errors as the parameter, and κ0is
the population variance.
5.1.4 Bandwidth Selection.
bandwidth selection is not a vital issue for estimating κ0: Of
course, it is vital for estimating θ0(·). Effectively, what this
means is that the real need is simply to get bandwidths that
satisfy the technical assumption of undersmoothing but are not
too ridiculously small: A precise target is often unnecessary. In
addition, because the asymptotic distribution of ? κ does not de
are not possible. Thus, in our example, we used three different
methods,allof whichgaveanswers thatwereas nearlyidentical
as in the simulation of Wang et al. (2004).
All the methods are based on a socalled “typical device” to
get an optimal bandwidth for estimating θ0, of the form hopt=
cσzn−1/5. In practice, this can be accomplished by constructing
a finite grid of bandwidths of the form hgrid= cgridσzn−1/5: We
use a grid from .20 to 5.0. After estimating B0by? B(hgrid), this
obtained. The maximizer of the loglikelihood crossvalidation
score is selected as hopt.
As described in Section 5.1.3,
pend on the bandwidth, simple firstorder methods of the type
that are used in bandwidth selection for function estimation
valueisfixed,andthenaloglikelihoodcrossvalidationscoreis
• If hopt= cσzn−1/5, an extremelysimple deviceis simply to
set h = hoptn−2/15= cσzn−1/3, which satisfies the techni
cal condition of undersmoothing without becoming ridicu
lously small. This device may seem terribly ad hoc, but the
theory, the simulation of Wang et al. (2004), the discus
sion in Section 5.1.3, and our own work suggest that this
method actually works reasonably well. Note, too, that in
Section 5.1.2 we give evidence that this bandwidth rate is
most likely optimal.
• A second approach is taken by Newey et al. (2004) and
is also an effective practical device. The technical need
for undersmoothing comes from the fact that the bias term
in a firstorder local likelihood kernel regression is of or
der O(h2). One can use higher order kernels to get the
bias to be of order O(h2s) for s ≥ 2, but this does not re
ally help in that the variance remains of order O{(nh)−1},
so that the optimal mean squared error kernel estima
tor has h = O{n−1/(4s+1)}, and thus undersmoothing to
estimate κ0 is still required. However, as Newey et al.
(2004) pointed out, if one uses the optimal bandwidth
hopt= cσzn−1/5, but then does the estimation procedure
replacing the firstorder kernel by a higher order kernel,
then the bias is O(h2s
nient higher order kernel is the secondorder twicing ker
nel Ktw(u) = 2K(u) −?K(u − v)K(v)dv, where K(·) is a
• One can also use loglikelihood crossvalidation, but with
the grid of values being of the form hgrid= cgridσzn−1/3.
Because crossvalidation scores often have multiple
modes, this is not the same as optimal smoothing.
opt) = o(n−1/2) if s ≥ 2. A conve
firstorder kernel.
It may be worth pointing out again that Wang et al. (2004) set
h = n−2/3, and even then, with too much undersmoothing (as
ymptotically), the performance of the method is rather good.
5.2 Efficiency and Robustness of the Sample Mean
In general problems with completedata, with no assumptions
about the response Y other than that it has a second moment, the
sample mean Y is semiparametric efficient for estimating the
population mean κ0= E(Y); see, for example, Newey (1990).
Somewhat remarkably, Wang et al. (2004) showed that in the
partially linear model with Gaussian errors, with complete data
the sample mean is still semiparametric efficient. This fact is
crucial, of course, in establishing that with missing response
data, their estimators are still semiparametric efficient.
It is clear that with complete data, the sample mean will not
be semiparametric efficient for all semiparametric likelihood
models. Simple counterexamples abound, for example, the par
tially linear model for Laplace or t errors. More complex exam
ples can be constructed, for example, the partially linear model
in the Gamma family with loglinear mean exp{XTB0+θ0(Z)}:
Details follow from Lemma 4.
The model robustness of the sample mean for estimating
the population mean in complete data is nonetheless a pow
erful feature. It is, therefore, of considerable interest to know
whether there are cases of semiparametric likelihood problems
where the sample mean is still semiparametric efficient and,
thus, would be used because of its model robustness. It turns
out that such cases exist. In particular, the sample mean for
complete response data is semiparametric efficient in canoni
cal exponential families with partially linear form.
Page 9
Maity, Ma, and Carrol: Efficient Estimation of Population Quantities131
Figure 4. Results for a Single Dataset in a Simulation as in Wang et al. (2004), the Partially Linear Model With n = 60, Complete Response Data,
and When κ0= E(Y). Various bandwidths are used, and the estimates of the function θ0(·) are displayed. Note how the bandwidth has a major
impact on the function estimate when the bandwidth is too small (h = .02), but very little effect on the estimate of κ0. (
, h = .06, κ = 1.818;
, h = .02, κ = 1.828;
, h = .10, κ = 1.826; , h = .14, κ = 1.818.)
Figure 5. Results for 100 Simulated Datasets in a Simulation as in Wang et al. (2004), the Partially Linear Model With n = 60 and Complete
Response Data. Various bandwidths are used, and the mean estimates of the function θ0(·) are displayed. Note how even over these simulations,
the bandwidth has a clear impact on the function estimate: There is almost no impact on estimates of the population mean and variance. (
h = .02;
,
, h = .06; , h = .10; , h = .14.)
Page 10
132Journal of the American Statistical Association, March 2007
Lemma 3. Recall that ? is defined in (8). If there are no miss
ing data, the sample mean is a semiparametric efficient estima
tor of the population mean only if
Y −E(YX,Z) = E(Y?T)M−1
1? +Lθ(·)E{YLθ(·)Z}
E{L2
θ(·)Z}
.
(19)
It is interesting to consider (19) in the special case of expo
nential families with likelihood function
?yc{η(x,z)}−C[c{η(x,z)}]
where η(x,z) = xTβ0+ θ0(z), so that E(YX,Z) = C(1)[c{η(X,
Z)}] = µ{η(X,Z)} = µ(X,Z) and var(YX,Z) = φC(2)[c{η(X,
Z)}] = φV[µ{η(X,Z)}].
As it turns out, effectively, (19) holds and the sample mean is
semiparametric efficient only in the canonical exponential fam
ily for which c(t) = t. More precisely, we show in Section A.6
the following result.
f(yx,z) = exp
φ
+D(y,φ)
?
, (20)
Lemma 4. If there are no missing data, under the exponen
tial model (20), the sample mean is a semiparametric efficient
estimate of the population mean if ∂c{XTβ + θ(Z)}/∂θ(Z) is
a function only of Z for all β, for example, the canonical
exponential family. Otherwise, the sample mean is generally
not semiparametric efficient: The precise condition is given
in (A.29) in the Appendix. In particular, outside the canoni
cal exponential family, the only possibility for the sample mean
to be semiparametric efficient is that if for some known (a,b),
c{xTβ +θ(z)} = a+blog{xTβ +θ(z)}.
Remark 5. We consider Lemmas 3 and 4 to be positive re
sults, although an earlier version of the paper had a misplaced
emphasis. Effectively, we have characterized the cases, with
complete data, that the sample mean is both model free and
semiparametric efficient. In these cases, one would use the sam
ple mean, or perhaps a robust version of it, rather than fit a po
tentially complex semiparametric model that can do no better
and if that model is incorrect, can incur nontrivial bias.
5.3 Numerical Experience and Theoretical Insights
in the Partially Linear Model, and Some
Tentative Conclusions
In responding to a referee about the estimation of the popu
lation mean in the partially linear model (1), we collect here
a few remarks based on our numerical experience. Because
the problem of estimating the population mean is the prob
lem focused on by Chen et al. (2003), we focus on the simu
lation setup in their article, although some of the conclusions
we reach may be supportable in general cases. To remind the
reader, in their simulation, X and Z are independent, with X =
Normal(0,1), Z = Uniform[0,1], β = 1.5, θ(z) = 3.2z2− 1,
and ? = Normal(0,1).
5.3.1 Can Semiparametric Methods Improve Upon the Sam
ple Mean?
When there are missing response data, the simu
lations in Wang et al. (2004) show conclusively that substantial
gains in efficiency can be made over using the sample mean of
the observed responses alone. In addition, if missingness de
pends on (X,Z), the sample mean of the observed responses
will be biased.
Thisleavestheissueofwhathappenswhentherearenomiss
ing data. Obviously, if one thought that ? were normally dis
tributed, it would be delusional to use anything other than the
sample mean, it being efficient.
Theoretically, some insight can be gained by the following
considerations. Suppose that X and Z are independent. Suppose
also that ? has a symmetric density function known up to a scale
parameter. Let σ2
inverse of the Fisher information for estimating the mean in the
model Y = µ + ?. Then, it can be shown that E{FB(·)} = 0,
that θB(z,B) = 0, and that the asymptotic mean squared error
(MSE) efficiency of the semiparametric efficient estimate of the
population mean compared to the sample mean is
?be the variance of ? and let ζ ≤ σ2
?be the
MSE efficiency of sample mean
=
β2var(X)+var{θ(Z)}+ζ
β2var(X)+var{θ(Z)}+σ2
?may be quite small, es
?
≤ 1.
Note that there are cases where ζ/σ2
pecially when ? is heavy tailed, so that if β = 0 and θ(·) is ap
proximately constant, the MSE efficiency of the sample mean
would be ζ/σ2
?, and then substantial gains in efficiency would
be gained. However, the usual motivation for fitting semipara
metric models is that the regression function is not constant, in
which case the MSE efficiency gain will be attenuated toward
1.0, often dramatically.
We conclude then that with no missing data, in the partially
linear model, substantial improvements upon the sample mean
will be realized mainly when the regression errors are heavy
tailed and the regression signal is slight.
We point out that in the example that motivated this work
(Sec. 4), there is no simple analog to the sample mean, one that
could avoid fitting models to the data.
5.3.2 How Critical Are Our Assumptions on Z?
made two assumptions on Z: It has a compact support, and its
density function is positive on that support. We have indicated
in Section A.1.2 that all general articles in the semiparametric
kernelbased literature make this assumption and that it appears
to be critical for deriving asymptotic results for problems such
as our example in Section 4. It is certainly well beyond our
capabilities to weaken this assumption as it applies to problems
such as our motivating example.
The condition that the density of Z be bounded away from 0
warns users that the method will deteriorate if there are a few
sparsely observed outlying Z values; see below for numerical
evidence of this phenomenon.
Estimation in subpopulations formed by compact subsets of
Z can also be of considerable interest in practice, and these
compact subsets can be chosen to avoid density spareness and
meet our assumptions. A simple example might be where Z is
age, and one might be interested in population summaries for
those in the 40 to 60year age range.
The partially linear model is a special case, however, be
cause all estimates are explicit and what few Taylor expan
sions are necessary simplify tremendously. That is, the esti
mates are simple functions of sums of random variables. Cheng
(1994) considered a different problem where there is no X and
where local constant estimation of the nonparametric function
is used, rather than local linear estimation, so that? θ(z0) =
We have
Page 11
Maity, Ma, and Carrol: Efficient Estimation of Population Quantities133
?n
Z decay exponentially fast.
We tested this numerically in the normalbased simulation
of Wang et al. (2004) with the sample size of n = 500: Simi
lar results were found with n = 100. We use the Epanechnikov
kernel and estimated the bandwidth using the following meth
ods. First, we regressed Y and X separately on Z, using the
direct plugin (DPI) bandwidth selection method of Ruppert,
Sheather, and Wand (1995) to form different estimated band
widths on each. We then calculated the residuals from these fits
and regressed the residual in Y on the residual in X to get a
preliminary estimate? βstartof β. Following this, we regressed
smoothed it by multiplication by n−2/15to get a bandwidth of
order n−1/3to eliminate bias, and then reestimated β and θ(·).
We found that for various Beta distributions on Z, for ex
ample, the Beta(2, 1) that violates our assumptions, the sam
ple mean and the semiparametric efficient method were equally
efficient. The same occurs for the case that Z is normally dis
tributed. However, when Z has a t distribution with 9 degrees
of freedom, the sample mean greatly outperforms the under
smoothed estimator (MSE efficiency ≈ 2.0), which, in turn,
outperformed the method that did not employ undersmoothing
(MSE efficiency ≈ 2.5). An interesting quote from Ma, Chiou,
and Wang (2006) is relevant here: Also operating in a partial
linear model, they stated “This condition enables us to sim
plifyasymptoticexpressionof certainsums of functionsof vari
ables...also excludes pathological cases where the number of
observations in a window defined by the bandwidth may not
increase to infinity when n → ∞.”
We conclude that if the design density in Z is at all heavy
tailed, then the semiparametric methods will be badly affected.
If such a phenomenon happens in the simple case of the par
tially linear model, it is likely to hold in most other cases.
Otherwise, in practice at least, as long as there are no design
“stragglers,” the assumption is likely to be one required by the
technicalities of the problem. How well this generalizes to com
plex nonlinear problems is unknown.
i=1Kh(Zi− z0)Yi/?n
i=1Kh(Zi− z0). He indicated that the
essential condition for this case is that the tails of the density of
Y − XT? βstart on Z to get a common bandwidth, then under
6. DISCUSSION
In this article we considered the problem of estimating
populationlevel quantities κ0such as the mean, variance, and
probabilities. Previous literature on the topic applies only to
the simple special case of estimating a population mean in the
Gaussian partially linear model. The problem was motivated
by an important issue in nutritional epidemiology, estimating
the distribution of usual intake for episodically consumed food,
whereweconsideredazeroinflatedmixturemeasurementerror
model:Such a problemis very different from the partially linear
model, and the main interest is not in the population mean.
The key feature of the problem that distinguishes it from
most work in semiparametric modeling is that the quantities
of interest are based on both the parametric and the non
parametric parts of the model. Results were obtained for two
general classes of semiparametric ones: (1) general semipara
metric regression models depending on a function θ0(Z) and
(2) generalized linear singleindex models. Within these semi
parametric frameworks, we suggested a straightforward estima
tion methodology, derived its limiting distribution, and showed
semiparametric efficiency. An interesting part of the approach
is that we also allow for partially missing responses.
In the case of standard semiparametric models, we have con
sidered the case where the unknown function θ0(Z) is a scalar
function of a scalar argument. The results, though, readily ex
tend to the case of a multivariate function of a scalar argument.
We have also assumed that κ0= E[F{X,θ0(Z),B0}] and
F(·) are scalar, which, in principle, excludes the estimation of
the population variance and standard deviation. It is, however,
readily seen that both F(·) and κ0or κSIcan be multivariate,
and, hence, the obvious modification of our estimates is semi
parametric efficient.
APPENDIX: SKETCH OF TECHNICAL ARGUMENTS
In what follows, the arguments for L and its derivatives are in the
form L(·) = L{Y,X,B0,θ0(Z)}. The arguments for F and its deriva
tives are F(·) = F{X,θ0(Z),B0}.
Also, note that in our arguments about semiparametric efficiency,
we use the symbol d exactly as it was used by Newey (1990). It does
not stand for differential.
A.1 Assumptions and Remarks
A.1.1 General Considerations.
asymptotic distribution of our estimator are (5) and (6). The single
index model assumptions were given already in Carroll et al. (1997).
Results (5) and (6) hold under smoothness and moment conditions
for the likelihood function and under smoothness and boundedness
conditions for θ(·). The strength of these conditions depends on the
generality of the problem. For the partially linear Gaussian model of
Wang et al. (2004), because the profile likelihood estimator of β is an
explicit function of regressions of Y and X on Z, the conditions are
simply conditions about uniform expansions for kernel regression es
timators, as in, for example, Claeskens and Van Keilegom (2003). For
generalized partially linear models, Severini, and Staniswalis (1994)
gave a series of moment and smoothness conditions toward this end.
For general likelihood problems, Claeskens and Carroll (2007) stated
that the conditions needed are as follows.
The main results needed for the
(C1) The bandwidth sequence hn→ 0 as n → ∞ in such a way
that nhn/log(n) → ∞ and hn≥ {log(n)/n}1−2/λfor λ as in condi
tion (C4).
(C2) The kernel function K is a symmetric, continuously differ
entiable pdf on [−1,1] taking on the value 0 at the boundaries. The
design density f(·) is differentiable on an interval B = [b1,b2], the
derivative is continuous, and infz∈Bf(z) > 0. The function θ(·,B) has
two continuous derivatives on B and is also twice differentiable with
respect to B.
(C3) For B ?= B?, the Kullback–Leibler distance between L{·,·,B,
θ(·,B)} and L{·,·,B?,θ(·,B?)} is strictly positive. For every (y,x),
third partial derivatives of L{y,x,B,θ(z)} with respect to B exist and
are continuous in B. The fourth partial derivative exists for almost all
(y,x). Further, mixed partial derivatives
with 0 ≤ r,s ≤ 4,r + s ≤ 4 exist for almost all (y,x) and
E{supBsupv
G(z), possesses a continuous derivative and infz∈BG(z) > 0.
(C4) There exists a neighborhood N{B0,θ0(z)} such that
????
for some λ ∈ (2,∞], where ? · ?λ,zis the Lλnorm, conditional on
Z = z. Further,
?
∂r+s
∂Br∂vsL{y,x,B,v}v=θ(z),
∂r+s
∂Br∂vsL{y,x,B,v}2} < ∞. The Fisher information,
max
k=1,2sup
z∈B
sup
(B,θ)∈N{B0,θ0(z)}
????
∂k
∂θklog{L(Y,X,B,θ)}
????
????λ,z
< ∞
sup
z∈B
Ez
sup
(B,θ)∈N{B0,θ0(z)}
????
∂3
∂θ3log{L(Y,X,B,θ)}
????
?
< ∞.
Page 12
134Journal of the American Statistical Association, March 2007
The preceding regularity conditions are the same as those used in a lo
cal likelihood setting where one wishes to obtain strong uniform con
sistency of the local likelihood estimators. Condition (C3) requires the
fourth partial derivative of the log profile likelihood to have a bounded
second moment; it further requires the Fisher information matrix to
be invertible and to be differentiable with respect to z. Condition (C4)
requires a bound on the first and second derivatives of the log profile
likelihood and of the first moment of the third partial derivative, in a
neighborhood of the true parameter values.
A.1.2 Compactly Supported Z.
of this article commented that the assumption that Z be compactly sup
ported with density positive on this support is too strong.
However, this assumption is completely standard in the kernel
based semiparametric literature for estimation of B0, because it is
needed for uniform expansions for estimation of θ0(·). The assump
tion was made in the founding articles on semiparametric likelihood
estimation (Severini and Wong 1992, p. 1875, part e); the first article
on generalized linear models (Severini and Staniswalis 1994, p. 511,
assumption D), the first article on efficient estimation of partially lin
ear single index models (Carroll et al. 1997, p. 485, condition 2a); and
the precursor article to ours that was focused on estimation of the pop
ulation mean in a partially linear model (Wang et al. 2004, p. 341,
condition C.T). The uniform expansions for local likelihood given in
Claeskens and Van Keilegom (2003) also make this assumption; see
their page 1869, condition R0. Thus, our assumption on the design
density of Z is a standard one.
The reason this assumption is made has to do with kernel technol
ogy, where proofs generally require a uniform expansion for the kernel
regression or at least uniform in all observed values of Z, which is the
same thing. The Nadaraya–Watson estimator, for example, has a de
nominator that is a density estimate, and the condition on Z stops this
denominator from getting too close to 0. Ma et al. (2006), who made
the same assumption (their condition 6 on p. 83), stated that it is nec
essary to avoid “pathological cases.”
Multiple reviewers of earlier drafts
A.2 Proof of Theorem 1
A.2.1 Asymptotic Expansion.
is a loglikelihood function conditioned on (X,Z), so that we have
We first show (9). First, note that L
E{δLθθ(·)X,Z} = −E{δLθ(·)Lθ(·)X,Z},
E{δLθB(·)X,Z} = −E{δLθ(·)LB(·)X,Z}.
By a Taylor expansion,
n1/2(? κ −κ0)
= n−1/2
i=1
+{FiB(·)+Fiθ(·)θB(Zi,B0)}T(? B −B0)
= MT
n
?
+op(1).
Because nh4→ 0, using (5), we see that
n−1/2
i=1
n
?
(A.1)
n
?
?Fi(·)−κ0
+Fiθ(·){? θ(Zi,B0)−θ0(Zi)}?+op(1)
2n1/2(? B −B0)
+n−1/2
i=1
?Fi(·)−κ0+Fiθ(·){? θ(Zi,B0)−θ0(Zi)}?
n
?
Fiθ(·){? θ(Zi,B0)−θ0(Zi)}
= −n−1/2
i=1
Fiθ(·)n−1
n
?
j=1
δjKh(Zj−Zi)Ljθ(·)/?(Zi)+op(1)
= −n−1/2
n
?
n
?
i=1
δiLiθ(·)n−1
n
?
j=1
Kh(Zj−Zi)Fjθ(·)/?(Zj)+op(1)
= n−1/2
i=1
δiDi(·)+op(1),
the last step following because the interior sum is a kernel regression
converging to Di; see Carroll et al. (1997) for details. Result (9) now
follows from (6). The limiting variance (10) is an easy calculation;
noting that (A.1) implies that
E{δ?Lθ(·)Z}
= E{δLθ(·)LB(·)+δLθ(·)Lθ(·)θB(Z,B0)Z}
= −E{δLBθ(·)+δLθθ(·)θB(Z,B0)Z}
= 0
by the definition of θB(·) given in (7), and, hence, the last two terms
in (9) are uncorrelated. We will use (A.2) repeatedly in what follows.
(A.2)
A.2.2 Pathwise Differentiability.
ric efficiency, using the results of Newey (1990). The relevant text of
his article is in his section 3, especially through his equation (9). A pa
rameter κ = κ(?) is pathwise differentiable under two conditions. The
first is that κ(?) is differentiable for all smooth parametric submodels:
In our case, the parametric submodels include B, parametric submod
elsfor θ(·),andparametricsubmodelsforthedistributionof (X,Z) and
the probability function pr(δ = 1X,Z). This condition is standard in
the literature and fairly well required. Our motivating example clearly
satisfies this condition.
The second condition is that there exists a random vector d such
that E(dTd) < ∞ and ∂κ(?)/∂? = E(dST
likelihood score for the parametric submodel. Newey noted that path
wise differentiability also holds if the first condition holds and if there
is a regular estimator in the semiparametric problem. Generally, as
Newey noted, finding a suitable random variable d can be difficult.
Assuming pathwise differentiability, which we show later, the effi
cient influence function is calculated by projecting d onto the nuisance
tangent space. One innovation here is that we can calculate the efficient
influence function without having an explicit representation for d.
Our development in Section A.2.3 will consist of two steps. In the
first, we will assume pathwise differentiability and derive the efficient
score function under that assumption. Using this derivation, we will
then exhibit a random variable d that has the requisite property.
We now turn to the semiparamet
?), where S?is the log
A.2.3 Efficiency.
fX,Z(x,z) be the density function of (X,Z). Let the model under con
sideration be denoted by M0. Now consider a smooth parametric sub
model Mλ, with fX,Z(x,z,α1), θ(z,α2), and π(X,Z,α3) in place of
fX,Z(x,z), θ0(z), and π(X,Z), respectively. Then, under Mλ, the log
likelihood is given by
Recall that pr(δ = 1X,Z) = π(X,Z). Let
L(·) = δL(·)+δlog{π(X,Z,α3)}
+(1−δ)log{1−π(X,Z,α3)}
+log{fX,Z(X,Z,α1)},
where (·) represents the argument {Y,X,θ(Z,α2),B0}. Then the score
functions in this parametric submodel are given by
∂L(·)/∂B = δLB(·),
∂L(·)/∂α1= ∂ log{fX,Z(X,Z,α1)}/∂α1,
∂L(·)/∂α2= δLθ(·)∂θ(Z,α2)/∂α2,
∂L(·)/∂α3= {∂π(X,Z,α3)/∂α3}{δ −π(X,Z,α3)}
/?π(X,Z,α3){1−π(X,Z,α3)}?.
Page 13
Maity, Ma, and Carrol: Efficient Estimation of Population Quantities135
Thus, the tangent space is spanned by the functions δLB(·)T,sf(x,z),
δLθ(·)g(Z), and a(X,Z){δ − π(X,Z)}, where sf(x,z) is any function
with mean 0, while g(z) and a(X,Z) are any functions. For compu
tational convenience, we rewrite the tangent space as the linear span
of four subspaces T1,T2,T3, and T4that are orthogonal to each other
(see below) and defined as follows:
T1= δLB(·)T+δLθ(·)θT
T2= sf(x,z),
T3= δLθ(·)g(Z),
T4= a(X,Z){δ −π(X,Z)}.
Toshowthat thesespaces areorthogonal, wefirstnote that, byassump
tion, the data are missing at random, and, hence, pr(δ = 1Y,X,Z) =
π(X,Z). This means that T4is orthogonal to the other three spaces.
Note also that, by assumption, E{LB(·)X,Z} = E{Lθ(·)X,Z} = 0.
This shows that T2is orthogonal to T1and T3. It remains to show that
T1and T3are orthogonal, which we showed in (A.2). Thus, the spaces
T1–T4are orthogonal.
Note that, under model Mλ,
?
Hence, we have
B(Z,B0),
κ0=
F{X,θ(Z,α2),B0}fX,Z(x,z,α1)dxdz.
∂κ0/∂B = E{FB(·)},
∂κ0/∂α1= E?F(·)∂ log{fX,Z(X,Z,α1)}/∂α1
∂κ0/∂α2= E{Fθ(·)∂θ(Z,α2)/∂α2},
∂κ0/∂α3= 0.
Now, by pathwise differentiability and equation (7) of Newey (1990),
there exists a random variable d, which we need not compute, such
that
E{FB(·)} = E?d{δLB(·)}?,
E{F(·)sf(X,Z)} = E{dsf(X,Z)},
E{Fθ(·)g(Z)} = E{dδLθ(·)g(Z)},
0 = E?da(X,Z){δ −π(X,Z)}?.
Next,wecomputetheprojectionsof d onto T1,T2,T3,and T4.First,
note that, by (A.4), for any function sf(X,Z) with expectation 0, we
have E[{d−F(·)+κ0}sf(X,Z)] = 0, which implies that the projection
of d onto T2is given by
?(dT2) = F(·)−κ0.
Also, by (A.1) and (A.5), for any function g(Z), we have
?,
(A.3)
(A.4)
(A.5)
(A.6)
(A.7)
E[{d −δD(·)}δg(Z)Lθ(·)]
= E{Fθ(·)g(Z)}+E?δg(Z)L2
= 0,
and, hence, the projection of d onto T3is given by
θ(·)E{Fθ(·)Z}/E{δLθθ(·)Z}?
?(dT3) = δD(·).
(A.8)
In addition, by (A.3) and (A.5),
E[{d −MT
= E{FT
= 0.
2M−1
B(·)}−E{Fθ(·)θT
1δ?}δ?T]
B(Z,B0)}−E(MT
2M−1
1δ??T)
Hence, the projection of d onto T1is given by
?(dT1) = δMT
2M−1
1?.
(A.9)
Also, by (A.6), we have ?(dT4) = 0. Using (A.7), (A.8), and (A.9),
we get that the efficient influence function for κ0is
ψeff= ?(dT1)+?(dT2)+?(dT3)+?(dT4)
= F(·)−κ0+δMT
2M−1
1? +δD(·),
which is the same as (9), hence completing the proof under the as
sumption of pathwise differentiability. In the calculations that follow,
we will write FBrather than FB(·), a rather than a(X,Z), and so on.
We now show pathwise differentiability and, hence, semiparametric
efficiency; that is, we show that (A.3)–(A.6) hold for d = F − κ0+
δD+δMT
To verify (A.3), we see that
2M−1
1?.
E(dδLB) = E[(F −κ0+δD+δMT
= E[δDLB+δMT
?
+E{δLB(LB+LθθB)T}M−1
?
= −E(FθθB)+M2
= E(FB).
2M−1
1?LB]
?
1?)δLB]
2M−1
= −E
Lθ
E(FθZ)
E(δLθθZ)LBδ
1M2
= E
δLθB
E(FθZ)
E(δLθθZ)
?
−E{δ(LBB+LBθθT
B)}M−1
1M2
To verify (A.4), we see that
E(dsf) = E{(F −κ0+δD+δMT
= E(Fsf)−κ0E(sf)+E{E(δD+δMT
= E(Fsf).
2M−1
1?)sf}
2M−1
1?X,Z)sf}
To verify (A.5), we see that
E(dδLθg) = E{(F −κ0+δD+δMT
= E(DLθδg)+MT
?
+MT
= E(Fθg)−MT
= E(Fθg)−MT
= E(Fθg),
2M−1
1E(?Lθδg)
?
1?)δLθg}
2M−1
= −E
Lθ
E(FθZ)
E(δLθθZ)Lθδg
2M−1
2M−1
2M−1
1E{(LB+LθθB)Lθδg}
1E{(LBθ+LθθθB)δg}
1E{E(δLBθ+δLθθθBZ)g}
where again we have used (A.2). Finally, because the responses are
missing at random, (A.6) is immediate. This completes the proof.
A.3 Sketch of Lemma 1
We have
? κmarg= n−1
n
?
i=1
?
δi
? πmarg(Zi)G(Yi)
+
? πmarg(Zi)
?
1−
δi
?
F{Xi,? θ(Zi,? B),? B}
?
= A1+A2.
Page 14
136Journal of the American Statistical Association, March 2007
By calculations that are similar to those given previously, and us
ing (11), we can readily show that
A1= n−1
n
?
i=1
δi
πmarg(Zi)G(Yi)
−n−1
n
?
i=1
{δi−πmarg(Zi)}E
?
δiG(Yi)
{πmarg(Zi)}2
???Zi
?
+op
?n−1/2?.
We can write
A2= B1+B2+op
?n−1/2?,
δi
πmarg(Zi)
B1= n−1
n
?
n
?
i=1
?
1−
?
F{Xi,? θ(Zi,? B),? B},
{? πmarg(Zi)−πmarg(Zi)}.
B2= n−1
i=1
δiF{Xi,? θ(Zi,? B),? B}
{πmarg(Zi)}2
Using (5) and (6), we can show that
?
B1= n−1
n
?
i=1
1−
δi
πmarg(Zi)
?
Fi(·)+M2,margM−1
1n−1
n
?
i=1
δi?i
+n−1
n
?
i=1
δiDi,marg(·)+op
?n−1/2?.
Using (11) once again, we see that
B2= n−1
n
?
i=1
{δi−πmarg(Zi)}E
?
δiFi(·)
{πmarg(Zi)}2
???Zi
?
?
+op
?n−1/2?.
Collecting terms and noting that
0 = E
?δi{G(Yi)−Fi(·)}
{πmarg(Zi)}2
???Zi
proves (12).
A.4 Sketch of Lemma 2
We have
? κ = n−1
n
?
i=1
?
δi
π(Xi,Zi,? ζ)G(Yi)
+
?
1−
δi
π(Xi,Zi,? ζ)
?
F{Xi,? θ(Zi,? B),? B}
?
= A1+A2,
say. By a simple Taylor series expansion,
A1= n−1
n
?
?
i=1
δi
π(Xi,Zi,ζ)G(Yi)
−E
1
π(X,Z,ζ)G(Y)πζ(X,Z,ζ)
?T
n−1
n
?
i=1
ψiζ+op
?n−1/2?.
In addition,
A2= B1+B2+op
?n−1/2?,
B1= n−1
n
?
n
?
?n−1/2?.
i=1
?
1−
δi
π(Xi,Zi,ζ)
?
F{Xi,? θ(Zi,? B),? B},
πζ(Xi,Zi,ζ)T(? ζ −ζ)
B2= n−1
i=1
δiF{Xi,? θ(Zi,? B),? B}
{π(Xi,Zi,ζ)}2
+op
Using the fact that
0 = E
?
1−
δi
π(Xi,Zi,ζ)
???X,Z
?
,
we can easily show that
B1= n−1
n
?
i=1
?
1−
δi
π(Xi,Zi,ζ)
?
Fi(·)+op
?n−1/2?.
It also follows that
?
Collecting terms and using the fact that E{G(Y)X,Z} = F(·), we ob
tain the result.
B2= E
1
π(X,Z,ζ)F(·)πζ(X,Z,ζ)
?T
n−1
n
?
i=1
ψiζ(·)+op
?n−1/2?.
A.5 Proof of Theorem 2
A.5.1 Asymptotic Expansion.
Recall that B = (γ,β). The only things that differ with the calculations
of Carroll et al. (1997) is that we add terms involving δiand we need
not worry about any constraint on γ, and, thus, we avoid terms such as
their Pα on their page 487.
In their equation (A.12), they showed that
We first show the expansion (15).
n1/2(? B −B0) = n−1/2Q−1
Define H(u) = [E{ρ2(·)U = u}]−1. In their equation (A.13), Carroll
et al. (1997) showed that
? θ(R+ST? γ,? B)−θ0(R+STγ0)
+? θ(R+STγ0,? B)−θ0(R+STγ0)+op
? θ(u,? B)−θ0(u)
= n−1
i=1
−H(u)?E{δ?ρ2(·)U = u}?T(? B −B0)+op
G{φ,Y,X,B,θ(U)} = Dφ(Y,φ)−?Yc{XTβ +θ(U)}−C{c(·)}?/φ2.
Of course, G(·) is the likelihood score for φ. If there are no argu
ments, G = G{φ0,Y,X,B0,θ0(R + STγ0)}. The estimating function
for φ solves
n
?
Because G is a likelihood score, it follows that
E?Gφ{φ0,Y,X,B0,θ0(R+STγ0)}X,R,S?= −E{G2X,R,S}.
By a Taylor series,
E(δG2)n1/2(? φ −φ0)
= n−1/2
i=1
n
?
n
?
n
?
i=1
δiNi?i+op(1).
(A.10)
= θ(1)
0(R+STγ0)ST(? γ −γ0)
Also, in their equation (A.11), they showed that
?n−1/2?.
(A.11)
n
?
δiKh(Ui−u)?iH(u)/f(u)
?n−1/2?.
(A.12)
Carroll et al. (1997) did not consider an estimate of φ. Define
0 = n−1/2
i=1
δiG{? φ,Yi,Xi,? B,? θ(Ri+ST
i? γ,? B)}.
n
?
δiG{φ0,Yi,Xi,? B,? θ(Ri+ST
δiGi+E(δGT
i? γ,? B)}+op(1)
= n−1/2
i=1
B)n1/2(? B −B0)
i? γ,? B)−θ0(Ri+ST
+n−1/2
i=1
δiGiθ{? θ(Ri+ST
iγ0)}+op(1).
Page 15
Maity, Ma, and Carrol: Efficient Estimation of Population Quantities137
However, it is readily verified that E(δGBX,R,S) = 0 and that
E(δGθX,R,S) = 0. It, thus, follows via a simple calculation using
(A.11) that
E(δG2)n1/2(? φ −φ0)
= n−1/2
i=1
n
?
the last step following from an application of (A.12).
With some considerable algebra, (15) now follows from calcula
tions similar to those in Section A.2. The variance calculation follows
because it is readily shown that, for any function h(U),
0 = E?(N?){δh(U)?}?.
A.5.2 Efficiency.
We now turn to semiparametric efficiency. Recall
that the GPLSIM follows the form (20) with XTβ0+ θ0(R + STγ0)
and that U = R+STγ0. It is immediate that V{µ(t)} = µ(1)(t)/c(1)(t),
that c(1)(t) = ρ1(t), and that ρ2(t) = ρ2
We also have
n
?
δiGi+n−1/2
n
?
i=1
δiGiθ{? θ(Ui,B0)−θ0(Ui)}+op(1)
= n−1/2
i=1
δiGi+op(1),
(A.13)
1(t)V{µ(t)} = c(1)(t)µ(1)(t).
E(?X,Z) = 0,
(A.14)
E(?2X,Z)
= E??Y −µ{XTβ0+θ0(U)}?2X,Z??ρ1{XTβ0+θ0(U)}?2
= var(YX,Z)?ρ1{XTβ0+θ0(U)}?2
= φρ2(·).
Let the semiparametric model be denoted by M0. Consider a
parametric submodel Mλwith fX,Z(X,Z;ν1), θ0(R + STγ0,ν2), and
π(X,Z,ν3). The joint loglikelihood of Y,X, and Z under Mλis given
by
L(·) = (δ/φ)?Yc{XTβ0+θ0(R+STγ0,ν2)}
(A.15)
−C?c{XTβ0+θ0(R+STγ0,ν2)}??
+δD(Y,φ)+log{fX,Z(X,Z,ν1)}
+δ log{π(X,Z,ν3)}+(1−δ)log{1−π(X,Z,ν3)}.
As before, recall that ? = ρ1(·){Y − µ(·)} = c(1)(·){Y − µ(·)}. Then
the score functions evaluated at M0are
∂L/∂β = δXc(1)(·){Y −µ(·)}/φ = δX?/φ,
∂L/∂γ = δθ(1)(U)Sc(1)(·){Y −µ(·)}φ = δθ(1)(U)S?/φ,
∂L/∂ν1= sf(X,Z),
∂L/∂ν2= δh(U)c(1)(·){Y −µ(·)}/φ = δh(U)?/φ,
∂L/∂ν3= a(X,Z){δ −π(X,Z)},
∂L/∂φ = δDφ(Y,φ)−δ?Yc(·)−C{c(·)}?/φ2= δG,
where Dφ(Y,φ) isthederivativeof D(Y,φ) withrespectto φ, sf(X,Z)
is a meanzero function and h(U), and a(X,Z) are any functions. This
means that the tangent space is spanned by
?T1= δ?STθ(1)
T3= δh(U)?/φ,T4= a(X,Z){δ −π(X,Z)},
T5= δG?.
An orthogonal basis of the tangent space is given by [T1=
δNT?,T2= sf(X,Z),T3= δh(U)?,T4= a(X,Z){δ − π(X,Z)}] and
0(U),XT??/φ,T2= sf(X,Z),
T5= δG; the orthogonality is a straightforward calculation. Now no
tice that
?
and, hence,
κ0=
F{x,θ0(z;ν2),B0,φ0}fX,Z(x,z;γ)dxdz
∂κ0/∂β = E{Fβ(·)},
∂κ0/∂γ = E?Fθ(·)θ(1)(U)S?,
∂κ0/∂ν1= E{F(·)sf(X,Z)},
∂κ0/∂ν2= E[Fθ(·)h(Z)],
∂κ0/∂ν3= 0,
∂κ0/∂φ = E{Fφ(·)}.
As before, we first assume pathwise differentiability to construct the
efficient score. We then verify this in Section A.5.3.
By equation (7) of Newey (1990), there is a random quantity d such
that
E(dδX?/φ) = E{Fβ(·)},
E?dδθ(1)(U)S?/φ?= E?Fθ(·)θ(1)(U)S?,
E{dsf(X,Z)} = E{F(·)sf(X,Z)},
E{dδh(U)?/φ} = E{Fθ(·)h(U)},
E?da(X,Z){δ −π(X,Z)}?= 0,
E(dδG) = E{Fφ(·)}.
Now we compute the projection of d onto the tangent space. It is
immediate that ?(dT2) = F(·) − κ0and that ?(dT4) = 0. Be
cause E[{δJ(U)?}{δh(U)?/φ}] = E{h(U)Fθ(·)}, it is readily shown
that ?(dT3) = δJ(U)?. It is a similarly direct calculation to show that
?(dT1) = DTQ−1δN?. Finally, ?(dT5) = δGE{Fφ(·)}/E(δG2).
These calculations, thus, show that, assuming pathwise differentia
bility, the efficient influence function for κ0is
? = DTQ−1δN? +F(·)−κ0+δJ(U)? +δGE{Fφ(·)}/E(δG2).
Hence, from (15), we see that ? κSIhas the semiparametric optimal in
A.5.3 Pathwise Differentiability.
κ0+ δJ(U)? + δGE{Fφ(·)}/E(δG2), we have to show that (A.16)–
(A.21) hold. Let
d1= DTQ−1δN?,
d2= F(·)−κ0,
d3= δJ(U)?,
d4= 0,
d5= δGE{Fφ(·)}/E(δG2).
Then d = d1+···+d5. Because T1, T2, T3, T4, and T5are orthogonal
and di∈ Tifor i = 1,...,5, we have
E(d1T1) = E(DTQ−1δNNT?2) = φE(DT),
E(d2T2) = E?{F(·)−κ0}sf(X,Z)?= E{F(·)sf(X,Z)},
E(d3T3) = E{δJ(U)h(U)?2}
= E{π(X,Z)J(U)h(U)φρ2(·)},
E(d4T4) = 0,
E(d5T5) = E?δG2E{Fφ(·)}/E(δG2)?= E{Fφ(·)},
E(diTj) = 0,
(A.16)
(A.17)
(A.18)
(A.19)
(A.20)
(A.21)
fluence function and is asymptotically efficient.
For d = DTQ−1δN? + F(·) −
(A.22)
(A.23)
(A.24)
(A.25)
(A.26)
i ?= j.
(A.27)
View other sources
Hide other sources
 Available from Arnab Maity · May 20, 2014
 Available from tamu.edu