ArticlePDF Available

Abstract and Figures

Chambers and Dorfman (2002) constructed bootstrap confidence intervals in model based estimation for finite population totals assuming that auxiliary values are available throughout a target population and that the auxiliary values are independent. They also assumed that the cluster sizes are known throughout the target population. We now extend to two stage sampling in which the cluster sizes are known only for the sampled clusters, and we therefore predict the unobserved part of the population total. Jan and Elinor (2008) have done similar work, but unlike them, we use a general model, in which the auxiliary values are not necessarily independent. We demonstrate that the asymptotic properties of our proposed estimator and its coverage rates are better than those constructed under the model assisted local polynomial regression model.
Content may be subject to copyright.
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101-115
Generalised Model Based Confidence Intervals
in Two Stage Cluster Sampling
Christopher Ouma Onyango
Center for Mathematics
Strathmore University
Nairobi, Kenya
Chrisouma2004@yahoo.co.uk
Romanus Odhiambo Otieno
Department of Statistics
Jomo Kenyatta University
Nairobi, Kenya
romanusemod@yahoo.com
George Otieno Orwa
Department of Statistics
Jomo Kenyatta University
Nairobi, Kenya
orwagoti@yahoo.com
Abstract
Chambers and Dorfman (2002) constructed bootstrap confidence intervals in model based
estimation for finite population totals assuming that auxiliary values are available throughout a
target population and that the auxiliary values are independent. They also assumed that the
cluster sizes are known throughout the target population. We now extend to two stage sampling
in which the cluster sizes are known only for the sampled clusters, and we therefore predict the
unobserved part of the population total. Jan and Elinor (2008) have done similar work, but unlike
them, we use a general model, in which the auxiliary values
i
X
are not necessarily independent.
We demonstrate that the asymptotic properties of our proposed estimator and its coverage rates
are better than those constructed under the model assisted local polynomial regression model.
Keywords and Phrases: Model Based Surveys, Bootstrapping, Two Stage
Sampling
AMS 2000 subject classifications. Primary 60K35; Secondary 60K35.
1. Introduction
1.1 Background
In specifying a sampling strategy in survey sampling, there exist different
approaches: the design based approach, the model assisted approach, the
model-based approach and randomization-assisted model based approach. For
a detailed review of these approaches, see Smith (1976), Smith (1994). Our
concern is the model based approach. Ouma and Wafula (2005) reviewed the
work of Chambers and Dorfman (2002) and modified the conditions. However,
they limited their work to simple random sampling. Suppose that P is a finite
Christopher Ouma Onyango, Romanus Odhiambo Otieno, George Otieno Orwa
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101-115
102
population of N identifiable units, Y denotes a survey variable having population
values
, ( 1,2,3,..., )
i
Y i N
= and
X
to denote an auxiliary variable with
corresponding population values
, ( 1,2,3,..., )
i
X i N
=. If the values
, ( 1,2,3,..., )
i
= are all known but the characteristic values
, ( 1,2,3,..., )
i
Y i N
=
are known only for a Sample, say
s
, of
n N
of the population elements, one
way of characterizing the sample selection of the survey variable is to assume
that for every unit on the sampling frame, a new variable, say
i
S
takes a value
equivalent to the number of times which that particular population unit’s value is
observed. The distribution of these values defines the design of the sample
survey.
Once the sample has been chosen, the values
( , )
i
Y i s
are known. Now, let the
distribution of
i
S
depend on the known population values of
X
and suppose that
one wishes to use the sample values together with the known values of
X
to
make an inference about the unknown but finite population total 1
N
i
i
T Y
=
=
of
Y
. A
major concern in model based approach to statistical survey inference has been
finding robust estimators for the population parameters under model
misspecifications.
1.2 Outline of the paper
This paper is organised as follows. In Subsections 1.3, 1.4, 1.5, 1.6, we give a
brief highlight on model based estimation, the local polynomial estimation,
confidence intervals, and two stage cluster sampling respectively, in each case,
pointing out some gaps that our proposed estimator attempts to fill. In Section 2,
we propose an estimator for the finite population total and suggest a bootstrap
confidence interval for it in Section 3. In Section 4, we derive the properties of our
proposed estimator. We conclude this paper in Section 5 with a simulation
experiment and some discussions.
1.3 Review of model based estimation
The model based approach to statistical survey sampling has been developed to
detailed extents. In particular, we build up on the work of Dorfman (1992) who
proposed a non-parametric regression estimator for the population total under a
model based approach. He illustrated that the developed estimator of the
population total performs better when compared to the corresponding design
based estimators and linear regression estimators. The model due to Dorfman
(1992) relies on the assumption that the regression line passes through the origin
and that the auxiliary values
i
X
are independent. Suppose one or all these
assumptions are incorrect. Will the prediction intervals still occupy the same
nominal properties? and will the estimator of the population total still be design
unbiased?
Generalised Model Based Confidence Intervals in Two Stage Cluster Sampling
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101-115
103
1.4 Review of the local polynomial estimation
Dorfman (1992) considered a non-parametric regression model for estimating
population totals in finite populations. He proposed a non-parametric regression
based estimator for the population total. To develop the estimator, he assumed
that the population values were generated by a model defined as
(
)
i i i
Y m x e
= +
(1.1)
where
1,2,3,...,
i N
=
,
(
)
.
m
is a smooth function and
i
e
is an independent random
variable with mean zero and constant variance. The non-parametric population
total estimator due to Dorfman (1992) is defined as
( )
D
i i
i s i s
T Y m x
∈ ∈
= +
∑ ∑ (1.2)
where
( )
i i i i
i s
m x w x y
= and
(
)
(
)
i i b i b i
s
w x k x x k x x
= −
is the weight associated
with the ith unit of the sample. Further,
(
)
k u
is a symmetric density function,
b
a
scaling factor and
(
)
(
)
1
/
b
k u b k u b
=. In his empirical study, Dorfman (1992)
illustrated that the estimator
D
T
performs better when compared to the
corresponding design based and linear regression estimators. These results
were also confirmed by Cheng (1994) who applied non-parametric regression in
estimating population parameters under conditions of missing data. Breidt and
Opsomer (2000), also assumed model 1.1 and developed a new class of model-
assisted non parametric regression estimators for the population total, based on
local polynomial smoothing, a kernel method. Their estimator is defined as
i
i
OB
i
i s i s
i
y m
T m
π
∧ ∧
∈ ∈
= +
∑ ∑
(1.3)
where
1,2,3,...,
i N
=
,
(
)
i
pr i s
π
= ∈
, i
si s
m w y
=
and 1,
i j
si j
x x
k
w diag j s
h h b
π
 
 
= ∈
 
 
 
 
with
h
denoting the bandwidth. In their
simulation study,
OB
T
performs better than the Horvitz-Thompson estimator
defined as
i
HT i s
i
Y
T
π
=
(1.4)
However, the theory developed in Breidt and Opsomer (2000) for the local
polynomial regression estimator applies only to direct element sampling designs
with auxiliary information available for all elements of the population.
Consequently, we offer more insight on the consistency of the coverage rates
using a general super population model in two stage sampling. Ji-Yeon et al.
(2009) recently extended the work of Breidt and Opsomer (2000) to two stage
cluster sampling where the estimators are linear combinations of estimators of
Christopher Ouma Onyango, Romanus Odhiambo Otieno, George Otieno Orwa
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101-115
104
cluster totals with weights that are calibrated to known control totals. They
indicated that the local polynomial regression estimators are constructed by
modeling the
M
points
(
)
,
i i
x t
as a realization from an infinite super population
model ζ in which
(
)
i i i
t x e
µ
= +
(1.5)
where
(
)
(
)
,var,0~ xNei
(
)
x
µ
is a smooth function of
x
and
(
)
var
x
is also
smooth and strictly positive. Their estimator is defined as
i
i
yi
i s i s i
t
t
µ
µ
π
∧ ∧
∧ ∧
∈ ∈
= +
∑ ∑ (1.6)
where
( )
1
' ' '
s
ii si si si si si
e X W X X W t
µ
∧ ∧
 
=
 
 
(1.7)
In the equation 1.7,
1
i j
si
m m j
x x
k
w diag h h b
π
 
 
=
 
 
 
 
 
,
(
)
(
)
1, ,..., q
si j i j i
j s
X x x x x
 
= −
 
 
and
i
e
represents the first column of the identity matrix
si
X
. In their simulation
results they concluded that the estimator 1.6 is more efficient than the Horvitz-
Thompson and the linear regression estimators when the mean function of the
super population model is non linear while being nearly as efficient when the
model is linear.
Recently, Jan and Elinor (2008) considered the problem of estimating the
population total in two-stage cluster sampling when cluster sizes are known only
for the sampled clusters, making use of a population model arising from a
variance component model. They considered the application of predictive
likelihood technique in estimation of the unknown part of the population total
= =
=N
i
m
jij
iyT 1 1 (1.8)
where
N
is the number of primary sampling units or clusters and each cluster
consists of
i
m
units which are only known for the sampled clusters,
ij
y
is the
value of the variable of interest for unit
j
of the ith cluster. They assumed the
population model defined by the equations
(
)
(
)
(
)
(
)
2
, var var , cov 0
i i i i i j
E M x M x M M
β σ
= = =
(1.9)
(
)
(
)
(
)
2 2
, var , cov
ij ij ij ik
E Y Y Y Y
µ τ ρτ
= = = (1.10)
in cases where
j k
and
0
ρ
. To predict the unobserved value of
Z
in the
estimate of the population total
T
given by
1 1
i
M
N
ij
i j
T Y Z
∧ ∧
= =
= +
(1.11)
Generalised Model Based Confidence Intervals in Two Stage Cluster Sampling
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101-115
105
they developed a partial likelihood for
Z
,
(
)
,
L z y
from the generalized joint
likelihood for the unknown quantities
z
and
θ
given by
(
)
(
)
, ,
L z y f z y
θ
= (1.12)
They applied the design based Horvitz-Thompson estimator of population total,
i
i
HT i s
o i
m y
x
T
n x
= (1.13)
where
0
n
, represents the number of the primary sampling units selected in first
stages and the model. In their simulation, they considered three coverage
measures of
Z
; the model based over the joint distribution of
Y
and
Z
, the design
based over the sampling design, and regarding the total sample as a stochastic
variable. They concluded that for a small number and the unconditional coverage
no of sampled clusters, the three intervals differ significantly, but for large n0, the
three intervals are practically identical.
Further, a comprehensive simulation study of the model based and the design
coverage properties of the prediction intervals indicate that for large sample
sizes, the coverage measures achieve approximately the nominal level 1
α
and
are slightly less than 1
α
for moderately large samples and for small sample
sizes, the coverage measures are about
1 2
α
, being raised to 1
α
for a
modified interval based on 0
2
n
t
distribution. We note that the models 1.9 and
1.10 assume that the regression line passes through the origin and that the
auxiliary values
i
X
are considered independent. The questions raised in
subsection 1.3 therefore remain unanswered.
1.5 Review of confidence intervals in survey sampling
Confidence intervals are usually constructed around point estimators in order to
provide a properly scaled measure of uncertainty associated with the estimator.
The conventional method is based on the assumption that the sample size is
large enough for the Central Limit Theorem to hold. This is however not always
true in practice.
As a consequence Do and Kokic (2001), Chambers and Dorfman (2002) applied
the bootstrap method to develop model based confidence intervals to address
situations where the sample sizes are not large. They also proposed
modifications of the procedure to account for misspecifications in a working
model. They further noted that there is greater efficiency in using of successive
model refinements and estimators obtained using the bootstrap approach as
opposed to their competing estimators. However, the evidence of the extended
simulation study on the beef population showed that the achievement of the
research did not precisely attain its goal. They therefore recommended the
construction of sounder confidence intervals using the bootstrap approach.
Ouma and Wafula (2005) suggested the use of a general super population model
( )
i i i
Y m x e
= +
(1.14)
Christopher Ouma Onyango, Romanus Odhiambo Otieno, George Otieno Orwa
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101-115
106
where
1,2,3,...,
i N
=
,
(
)
.
m
is a smooth function,
i
e
is an independent random
variable with mean zero and constant variance. They used a bandwidth of 1.5
and simple random sampling with replacement to generate the values of survey
variable
Y
. In their empirical study, they established that their coverage rates
were higher compared to that of Chambers and Dorfman (2002). We now extend
this to two stage cluster sampling.
1.6 Review of two stage cluster sampling
Let U be a finite population of N primary sampling units
s
psu
or clusters
labeled N,..,2,1 ,
1,2,...,
U N
=
where N is a known number, i
M, Ni ,..,2,1
=
be the
number of secondary sampling units
s
ssu
in the th
i
psu
. Let ij
yNi ,..,2,1
=
,
i
Mj ,..,2,1= be the value of the response variable
Y
for the
ssu
j
belonging to
the
psu
i
. In the previous works, an assumption has been made that the element
specific auxiliary data '
,
ij
xNi ,..,2,1
=
, i
Mj ,..,2,1= are known for all clusters and
population elements, respectively.
For our case, we assume that the cluster sizes are known only for the sampled
clusters and therefore the survey values ij
y, Ni ,..,2,1
=
, i
Mj ,..,2,1= are
generated using the model
ij
ij
ij exmY ^^^ )( += (1.15)
with Ni ,..,2,1
=
i
Mj ,..,2,1=.
2. Proposed Estimator for population total.
Jan and Elinor [6] used the model 1.15 to define the population total as
= =
=N
i
m
jij
iyT 1 1 (2.1)
where N is the number of primary sampling units or clusters and each cluster
consists of i
m units which are only known for the sampled clusters, ij
y is the
value of the variable of interest for unit
j
of the th
icluster. Referring to the same
model 1.15 we may write that
∑ ∑ = +== = += N
i
M
mj ij
N
i
m
jij
ii yyT 1 1
^
1 1
^^
ZY
si sj ij
i
+=
∈ ∈ (2.2)(((
and it follows that the problem is now reduced to the that of predicting the
unobserved values
z
of the random variable
Z
. To do this, we apply the general
model 1.15 to predict the values of the unobserved survey variables
Generalised Model Based Confidence Intervals in Two Stage Cluster Sampling
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101-115
107
ij
y,Ni ,..,2,1
=
, i
Mj ,..,2,1=. Therefore the estimate of the population total is
given by
( )
∑ ∑ = +== =
++= N
i
M
mj ij
ij
N
i
m
jij
i
i
iexmyT 1 1
^^
1 1
^ (2.3)
3. Proposed Bootstrap Confidence Interval
Under the model based approach, the sampling distribution of the estimator
corresponds to the distribution of possible alternative point estimates that could
arise given the selection of the same sample
S
from populations similar to the
actual underlying population of the observed data. To construct a confidence
interval for
T
that reflects the actual finite sample and finite population
characteristics of the distribution of ^
T
we estimate such a distribution from the
sample data. For our case, we make use the sample data and the working
model 1.15 to generate a sequence of alternative realizations of
Y
using non
parametric estimates of )( ij
xm . Let
*
ij
Y
be an estimator of the values of
Y
, where
^
*( )
ij ij ij
Y m x e
= +
(3.1)
In equation 3.1, ij
e is selected via two stage cluster sampling with replacement
from
: 1,2... , 1,2,..
ij i
e i n j m
= = .
Having obtained the bootstrap population, the bootstrap version *
1
^
Tof 1
^
T, using
the same sample as the parent sample, is calculated. The process is then
repeated a large number,
B
, of times to obtain *
^
*
2
^
*
1
^,...., iBii TTT . Then the
bootstrap confidence interval is obtained using
(
)
2
1
,
2**
α
α
QQ
where )(
*pQ is the th
p – quantile of bootstrap distribution.
4. Properties of the proposed estimator and resulting confidence interval
4.1 Unbiasedness of the model
Considering the model 1.15 we may write that
)(
^
ijijij xmYe = (4.1)
and
=)(
*ij
xW
+
iij r
m
n
nm
nm
W2
1
2
1
2
1
11
1 (4.2)
Christopher Ouma Onyango, Romanus Odhiambo Otieno, George Otieno Orwa
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101-115
108
where
m
is the bootstrap sample size, i
r is the number of times the th
i primary
sampling unit is selected, ij
x is the th
j observation made from the th
i cluster, and
)( ij
xW is the initial sampling weight of secondary sampling unit equal to the
inverse of its selection probability, that is;
)( ij
xW =ij
π
1 (4.3)
with ni ,..,2,1
=
; i
nj ,..,2,1=.
However there is considerable benefit and little loss in choosing 1
=
nm . Rao
and Wu (1998)
Therefore,
=)(
*ij
xW
1
2
11
i
ij
n
r
n
π
 
 
 
(4.4)
ni ,..,2,1
=
; i
mj ,..,2,1=, and
( )
^^ )( ijij
ij xmYEeE =
(4.5)
So
=sji ijijij YxWxm ,
^)(*)( (4.6)
(ni ,..,2,1
=
,i
nj ,..,2,1=) yielding
)(
^
ij
xmE =
1
2
1ij
i
i j ij
Y
n
E r
n
π
 
 
 
 
 
 
 
(4.7)
Now, let the initial sampling weight of secondary sampling units ij
ij
xW
π
1
)( =be
the kernel based weights. Then we have
(
)
( )
=ikijb
ikijb
ij xxK xxK
xW )( (4.8)
with 1)( =
sij ij
xw , further, bbeing a scaling factor,
(
)
buKbuKb/)( 1
= and )(uk is
a symmetric density function which is such that
uwith the symbols bearing
their usual meanings, then
(a)
( )
1=
uuk ,(b)
( )
<
uuk2, (c)
( )
<
uuku 2
3and (d)
(
)
(
)
ukuk =
Therefore, ^
( ) ( ) [ ( )]
ij ij ij
E e m x E m x
= − (4.9)
Generalised Model Based Confidence Intervals in Two Stage Cluster Sampling
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101-115
109
But as 0
b and
nb , )()(
^
ijij xmxm meaning that
0
ij
E e
 
=
 
  which is the
mean of
ij
e
in model 1.15, completing the proof that the proposed model is
unbiased.
4.2 Asymptotic variance of the error term
From subsection 4.1, it follows that
2
^ ^
ij ij
Var e E e
   
=
   
   
(4.10)
Therefore
2
^^
=
ij
ij eEeVar = 2
^)(
ijij xmYE =2
^^
2)()(2
+ijijijij xmExmEYEY (4.11)
which leads to
2
)( ij
eE =)()( ^
2ijij xmVarx +
σ
(4.12)
But
)(
^
ij
xmVar =
1
21 1
( 1)
1ij ik
ij
i s
x x
n
Var n b k Y
n b
− −
 
 
 
 
 
 
 
   
 
1
^)(
ijs xd (4.13)
where
( ) ( )
11
s ij
b ij ik
d x
k x x
=
 
)(
^
ij
xmVar =
2
2 2
( 1) ( )
1ij ik
ij
i s
x x
n
n b k x
n b
σ
− −
 
   
 
   
2
^)(
ijs xd (4.14)
But
2
^)(
ijs xd =
2
)(
ijs xd
++2
1
2
1
3
//
2
2)1()(
)(
1bnbOxdk
xd bijs
ijs (4.15)
and
)( ij
ikij x
bxx
k
σ
= )()()( 2
1
3bbOxdxb ijsij ++
σ
(4.16)
So using equations 4.15 and 4.16 in equation 4.14 we have that
^
( )
ij
varm x
=
ki si ij
ikij x
bxx
kbn
nn)()1(
1
2
22
2
σ
2
^)(
ijs xd (4.17)
Christopher Ouma Onyango, Romanus Odhiambo Otieno, George Otieno Orwa
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101-115
110
Next we obtain the asymptotic expansion of ^
( )
ij
varm x
using the following
theorem as a basis.
Theorem
Let )(uk be a symmetric density function with
0)( =
duuku and
k u k u du
22
0
= >
( )
. Assume that n and N increase together such
that
π
N
n with 10
<
<
π
. If further the sampled and non sampled values of
x
are in the interval
[
]
dc, and are generated by densities s
d and sp
d respectively,
both bounded away from zero on
[
]
dc, and with continuous second derivatives,
and if for any expression of
Z
, it can be shown explicitly that
[
]
)()(/ BOUAUZE += and
[
]
)(/ COUZVar =, then
++= 2
1
)( CBOUAZ p.
Using this theorem, we may write equation 4.17 as
^
( )
ij
varm x
=
( )
1
1
1 2 2
2
( 1) ( ) 1
1ij
nn x O n b
n
σ
 
 
+ −
 
 
 
 
 
(4.18)
which reduces to
^
( )
ij
varm x
=
( ) ( )
1
1
22
2
2( ) 1
1ij
nx O n b
n
σ
 
 
+ −
 
 
 
(4.19)
and noting that as the number, nmi= of the second stage samples tends to be
large, nn
1 so that we have
^
( )
ij
varm x
=
(
)
( ) ( )
1
32
2
1
1
ij
x
O n b
n
σ
 
+ −
 
 
(4.20)
again as
nb , 2
^
( )
( )
1
ij
i
x
var m x n
σ
=
 
  (4.21)
hence 2
^2
( )
( )
1
ij
ij ij
x
var e x n
σ
σ
= +
 
  (4.22)
which as
n
, reduces to
^2
( )
ij
ij
var e x
σ
=
 
  (4.23)
Generalised Model Based Confidence Intervals in Two Stage Cluster Sampling
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101-115
111
4.3 Conditional relative bias of the estimator for the population total
From its definition, the conditional relative bias of using
T
as an estimator of
T
is
TTTE /
^=
( )
TTETE /
^
(4.24)
In this case,
^
1 1
i
m
N
ij
i j
T Y
= =
=
= ^ ^
1 1 ( )
i
M
N
ij
ij
i j m
m x e
= = +
 
+
 
 
∑ ∑ (4.25)
meaning that
( )
TTETE /
^=
( )
ij
i j ij
i j ij
ij exmEexmE +
+^^ )( (4.26)
Using equation 4.9 in equation 4.26, it can be seen that as
n
this bias,
( )
TTETE /
^asymptotically tends to zero.
5. Empirical Study
5.1 Description of the simulation experiment
Simulation experiments were performed in order to compare the performance of
the model based regression estimator with that of the model assisted local
polynomial regression estimator in two-stage element sampling due to Ji-Yeon et
al. (2009).
To obtain the model based estimator for the population total,
i
X
are generated as
independent and identically distributed on uniform (0, 1) random variables. The
population consists of 100 clusters. In stage one a sample of =
i
n20 clusters is
taken which forms the primary sampling units from the total cluster size =
i
N100
using simple random sampling with replacement.
In stage two, from each selected clusters, say
(
)
i
nii ,...2,1,
=
we select
sample ij
m,50,...2,1
=
j, from 000,1,...,...,2,1 k
mj = that is the th
j sample from a
fixed selected th
icluster using simple random sampling with replacement from
total k
M=1000 elements. We consider the variable of interest ij
Y,
kk Mmj ,...,,...,2,1= which are known only for the sample and using the known
auxiliary variables ,
ij
xk
Mj ,...,2,1= we generate the non sample values using the
model given in 1.14.
To simulate bootstrap values of k
M independent samples of size k
m we use
simple random sampling with replacement within cluster
i
in order to obtain the
Christopher Ouma Onyango, Romanus Odhiambo Otieno, George Otieno Orwa
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101-115
112
bootstrap population values and the model ij
ijij exmY ^^
*)( += to
obtain k
iMiii yyyy *
3
*
2
*
1
*,...,, . Further, we let ]1,0[~)( UuK so that the model based
regression estimator for the population total is given by equation 2.3
where
(
)
ij
xm
^ is defined by equation 1.14 and
( )
^2
~ 0,
ij ij
e x
σ
 
 
.
This procedure is repeated a large number 1000 of times such that we
have *
1
^
i
T,*
2
^
i
T,*
3
^
i
T,…, *
1000
^
i
T.We then construct the 95% confidence intervals for
population total NiT i,...2,1,
^*=. Similarly, we compute the local polynomial
estimator for the population total suggested by Ji-Yeon et al. (2009) given in
equation 1.6. For each mean function values of ij
x, each study
variable ij
y
(
)
.2,1=j for k
m values from k
M elements are generated as
(
)
2
1
k
jk
k
ijj
jk M
e
Mx
y+=
µ
(5.1)
where, jk
yis the th
kobservation made from the th
j cluster, and
(
)
(
)
ijij xNe 2
,0~
σ
.
Using above bootstrap procedure, the bootstrap estimate of the population total
( )
+= si i
ii
ci ij
imy
xmT
π
^
^* (5.2)
is calculated, where
{
}
si
i= Pr
π
,sji
hxx
k
h
diagm ij
ikij
i
=,,
11
^
π
and
( )
ij
m x
is
as defined in equation 1.4. Similarly, we construct the confidence interval for the
population total
T
and compare performance of the developed model on
estimation of
T
with that due to Ji-Yeon et al. (2009) on Local Polynomial
regression estimation in two-stage sampling.
In computation of the model assisted local polynomial regression estimators, we
again adopt a method due to Ji-Yeon et al. (2009). This we do as a means to
having a realistic comparative study. We therefore apply the Epanechnikov
kernel
( )
(
)
{ }
1
2
1
4
3
Ι= u
uuk and different bandwidths for computation of the Local
polynomial regression estimator of population totals. This helps to compare the
bias, mean squared errors and confidence interval lengths of the estimators
using both the model based and model assisted local polynomial regression
approaches.
Generalised Model Based Confidence Intervals in Two Stage Cluster Sampling
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101-115
113
5.2 Simulation Results
Table 1 gives the results of Mean Squared Error of the model based MSEmb and
the Local Linear Polynomial regression estimator of the population total in two
stage cluster sampling.
Table 1: Mean Squared Error (MSE)
Band width
LP
MB
0.005
0.7845631
1.2684580
0.006
0.7961103
1.7980100
0.007
0.7528211
1.4856560
0.008
0.7523094
1.4641440
0.009
0.7909287
1.3896480
0.010
0.7740691
0.1187740
0.020
0.7615400
0.0846216
0.030
0.7653660
0.0752471
0.040
0.7639690
0.0720307
0.050
0.7621990
0.0672295
0.060
0.7740690
1.1703404
0.070
0.7615400
0.7534386
0.080
0.75516
10
0.7534386
It can be seen that at lower bandwidths the MSE for the model based estimators
is higher compared to that of the Local Polynomial regression estimator. As the
bandwidth increases, the MSEmb drastically reduces and approximately remains
low. It is important to note that an increase in the bandwidth does not significantly
change the MSE for the Local Polynomial Regression Estimators LPRE.
Generally, the Model based estimator is more efficient than the Local polynomial
estimators of the totals. Table 2 is a summary of bias for the model based
estimator of the population total and the Local Polynomial regression estimator.
Table 2: Summary Results of Bias
Band width
LP
MB
0.005
-
0.0443350
0.00399122
0.006
-
0.0381211
0.00200561
0.007
-
0.03801
43
-
0.0004676
0.008
-
0.0384639
-
0.0036316
0.009
-
0.0404816
-
0.0001817
0.010
-
0.0985294
-
0.0032099
0.020
-
0.0360501
-
0.0046855
0.030
-
0.0396870
-
0.0265196
0.040
-
0.0536516
-
0.0156076
0.050
-
0.0552965
-
0.0201331
0.060
-
0.0985294
-
0.0384731
0.070
-
0.
0360501
-
0.0538332
0.080
-
0.0496699
-
0.0538332
Christopher Ouma Onyango, Romanus Odhiambo Otieno, George Otieno Orwa
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101-115
114
The bias for the model based estimator is much lower than those of the Local
Polynomial regression estimators. The large bias associated with the Local
Polynomial estimators are reflected in the values of its estimators which are
much lower than the true simulated population total of 99.5078. This can best be
attributed to the choice of the variance. The precision of estimation can be
improved by choosing a smaller value of the variance. Table 3 now presents a
summary of the estimated population totals for the model based and the Local
Polynomial regression estimators in two stage sampling.
Table 3: Summary Results of Estimated Population Totals
Band width
LP
MB
0.005
60.81535
99.27500
0.006
61.38900
99.0
1420
0.007
61.49330
99.92000
0.008
60.93400
99.20200
0.009
59.02400
99.32200
0.010
58.64900
97.47400
0.020
60.24100
93.09100
0.030
59.81800
88.08000
0.040
59.34000
83.90000
0.050
59.30700
79.47900
0.060
58.64900
59.49500
0.070
60.24000
59.82600
0.080
59.82400
59.82600
Table 4 now gives the coverage rates of 95 % confidence interval lengths for
model based and Local Polynomial Regression models.
Table 4: Confidence Interval Lengths
Band width
LP
MB
0.01
38.97447
1.531668
0.02
40.74652
1.315
887
0.03
39.52278
1.227330
0.04
38.87707
1.212220
0.05
38.72655
1.148093
0.06
38.93319
38.85571
0.07
38.86739
40.20200
0.08
38.54396
38.60400
The confidence intervals generated by the model based method are much tighter
than those generated by the Local polynomial method at lower bandwidths but at
larger bandwidths, both the LPRE and the model based estimators of population
total perform poorly. We note that the best performing confidence interval is one
whose coverage rate is close to the true population total and its length small.
Consequently, the model based estimators are far better than their local
Generalised Model Based Confidence Intervals in Two Stage Cluster Sampling
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101-115
115
polynomial regression estimators. The results in general show that the model
based approach outperforms the model assisted method at 95% coverage rate.
The bias under model based approach is also much lower.
References
1. Breidt, F. and Opsomer, J. (2000). Local polynomial regression estimators in
survey sampling. The Annals of Statistics, 28:1026–1053.
2. Chambers, R. and Dorfman, A. (2002). Robust sample survey inference via
bootstrapping and bias correction-the case of the ratio estimator. Technical
report, Southampton Statistical Sciences Research Institute, University of
Southampton.
3. Cheng, P. (1994). Nonparametric estimation of mean functional with data
missing at random. Journal of the American Statistical Association, 89:
81–87.
4. Do, K. and Kokic, P. (2001). Bootstrap Variance and confidence interval
estimation for model- based surveys. Australia National University.
5. Dorfman, R. (1992). Nonparametric regression for estimating totals in finite
population. In Section on Survey Research Methods, Journal of the
American Statistical Association, pages 622–625.
6. Jan, F. and Elinor, Y. (2008). Two stage sampling from a predictive point of
view when the cluster sizes are unknown. Biometrika, 95 (1): 187–204.
7. Ji-Yeon, K., Breidt, F., and Opsomer, J. (2009). Nonparametric regression
estimation of finite population totals under two-stage cluster sampling.
Technical report, Department of Statistics, Colorado State University.
8. Ouma, C. and Wafula, C. (2005). Bootstrap confidence interval for model
based surveys. East African Journal of Statistics, 1: 14–18.
9. Rao, J. and Wu, C. (1998). Re sampling inference with complex survey data.
Journal of the American Statistical Association, 83: 231–241.
10. Smith, T. (1976). The foundations of survey sampling a review. Journal of
the Royal Statistical Society, Series A, 139: 183–198.
11. Smith, T. (1994). Sample surveys 1975, 1990. an age reconciliation.
International Review, 62: 3–34.
... Assuming non-response in the second stage of sampling, the problem is therefore to estimate the values of ˆi j Y . To do this, a linear regression model applied by [4] and [5] given below is used; ...
... A detailed work done by [5] proved that [6]. Thus we get the following: ...
... ij E e = , for details see [5]. ...
Article
Full-text available
Developing finite population estimators of parameters such as mean, variance, and asymptotic mean squared error has been one of the core objectives of sample survey theory and practice. Sample survey practitioners need to assess the properties of these estimators so that better ones can be adopted. In survey sampling, the occurrence of nonresponse affects inference and optimality of the estimators of finite population parameters. It introduces bias and may cause samples to deviate from the distributions obtained by the original sampling technique. To compensate for random nonresponse, imputation methods have been proposed by various researchers. However, the asymptotic bias and variance of the finite population mean estimators are still high under this technique. In this paper, transformation of data weighting technique is suggested. The proposed estimator is observed to be asymptotically consistent under mild assumptions. Simulated data show that the estimator proposed is much better than its rival estimators for all the different mean functions simulated.
Article
Full-text available
Non-response is a regular occurrence in sample surveys. Developing estimators when non-response exists may result in large biases during estimation of population parameters. In this paper, a finite population mean is estimated when non-response exists randomly under two stage cluster sampling with replacement. It is assumed that non-response arises in the survey variable in the second stage of cluster sampling. Weighting method of compensating for non-response is applied. Asymptotic properties of the proposed estimator of the population mean are derived. Under mild assumptions, the estimator is shown to be asymptotically consistent.
Article
Full-text available
The bootstrap approach to statistical inference is described in Efron (1982). The method has wide applicability and has seen considerable development in recent years. However, use of the bootstrap in sample survey inference has been somewhat limited. Rao and Wu (1988), describe an application of the bootstrap under the design-based approach to sample survey inference. Sitter (1992a, 1992b), has extended their results to more complex survey designs. More recently, Booth, Butler and Hall (1991) and Booth and Murison (1992) describe a rather different approach to constructing a design-based bootstrap. In this paper we describe how this approach to the bootstrap can be applied under model-based sample survey inference, focussing on an application where the popular ratio estimator is the estimator of choice.
Article
Full-text available
This article considers a distribution-free estimation procedure for a basic pattern of missing data that often arises from the wellknown double sampling in survey methodology. Without parametric modeling of the missing mechanism or the joint distribution, kernel regression estimators are used to estimate mean functionals through empirical estimation of the missing pattern. A generalization of the method of Cheng and Wei is verified under the assumption of missing at random. Asymptotic distributions are derived for estimating the mean of the incomplete data and for estimating the mean treatment difference in a nonrandomized observational study. The nonparametric method is compared with a naive pairwise deletion method and a linear regression method via the asymptotic relative efficiencies and a simulation study. The comparison shows that the proposed nonparametric estimators attain reliable performances in general.
Article
This article considers a distribution-free estimation procedure for a basic pattern of missing data that often arises form the well-known double sampling in survey methodology. Without parametric modeling of the missing mechanism of the joint distribution, kernel regression estimators are used to estimate mean functionals through empirical estimation of the missing pattern. A generalization of the method of Cheng and Wei is verified under the assumption of missing at random. Asymptotic distributions are derived for estimating the mean of the incomplete data and for estimating the mean treatment difference in a nonrandomized observational study. The nonparametric method is compared with a naive pairwise deletion method and a linear regression method via the asymptotic relative efficiencies and a simulation study. The comparison shows that the proposed nonparametric estimators attain reliable performances in general.
Article
A main shortcoming of the conventional method of constructing a confidence interval for a finite population parameter e.g. the mean/ total is that it assumes that the sample size is large enough for the central limit theorem to apply to the estimation error. This is not always the case in practice. To deal with the problem, Chambers and Dorfam (1994) suggested a n alternative method based on the bootstrap methodology. Their method is meant for model-based surveys. It starts by assuming a simple linear regression model as a working model in which the ratio estimator is optimal for estimating the population total. To achieve robustness in their results, a series of modifications is carried out on the ratio estimator. This makes their method cumbersome to apply. In this paper we suggest an alternative bootstrap approach that is simpler to implement.
Article
The historical development of the twin problems of survey design and of finite population inference is discussed by reference to key papers. No attempt is made to be totally comprehensive, rather broad trends are identified and evaluated. The final conclusion is that finite population inference should pay more attention to population structure and less to survey design.
Article
Methods for standard errors and confidence intervals for nonlinear statistics —such as ratios, regression, and correlation coefficients—have been extensively studied for stratified multistage designs in which the clusters are sampled with replacement, in particular, the important special case of two sampled clusters per stratum. These methods include the customary linearization (or Taylor) method and resampling methods based on the jackknife and balanced repeated replication (BRR). Unlike the jackknife or the BRR, the linearization method is applicable to general sampling designs, but it involves a separate variance formula for each nonlinear statistic, thereby requiring additional programming efforts. Both the jackknife and the BRR use a single variance formula for all nonlinear statistics, but they are more computing-intensive. The resampling methods developed here retain these features of the jackknife and the BRR, yet permit extension to more complex designs involving sampling without replacement. The sampling designs studied include (a) stratified cluster sampling in which the clusters are sampled with replacement, (b) stratified simple random sampling without replacement, (c) unequal probability sampling without replacement, and (d) two-stage cluster sampling with equal probabilities and without replacement. Our proposed resampling methods may be viewed as extensions to complex survey samples of the bootstrap, and in the case of design (c), of the BRR as well. We obtain and study the properties of variance estimators of and confidence intervals for the parameter of interest θ, based on the bootstrap histogram of the t statistic. The variance estimators reduce to the standard ones in the special case of linear statistics. These confidence intervals take account of the skewness in the distribution of , unlike the intervals based on the normal approximation. For case (a), the sampled clusters are resampled in each stratum independently by simple random sampling with replacement, and this procedure is replicated many times. The estimate for each resampled cluster is properly scaled such that the resulting variance estimator of reduces to the standard unbiased variance estimator in the linear case. For case (b), the sampled units are resampled in each stratum as in case (a), but a different scaling is used. Different resampling procedures and scalings are employed for case (c). A two-stage resampling procedure is developed for case (d). Results of a simulation study under a stratified simple random sampling design show that the bootstrap intervals track the nominal error rate in each tail better than the intervals based on the normal approximation, but the bootstrap variance estimators are less stable than those based on the linearization or the jackknife.
Article
A nonparametric regression estimator for the finite population total in two-stage sampling with complete stage-one auxiliary information is developed. The estimator, based on local polynomial regression, is a linear combination of cluster total estimators, with weights that are calibrated to known control totals. The estimator is asymptot- ically design-unbiased and design consistent under mild assumptions, and its variance can be consistently estimated. Simulation results indicate that the nonparametric esti- mator dominates several parametric estimators when the model regression function is incorrectly specified, while being nearly as ecient when the parametric specification is correct. The methodology is illustrated using data from a study of land use and erosion.
Article
We consider the problem of estimating the population total in two-stage cluster sampling when cluster sizes are known only for the sampled clusters, making use of a population model arising from a variance component model. The problem can be considered as one of predicting the unobserved part Z of the total, and the concept of predictive likelihood is studied. Prediction intervals and a predictor for the population total are derived for the normal case, based on predictive likelihood. For a more general distribution-free model, by application of an analysis of variance approach instead of maximum likelihood for parameter estimation, the predictor obtained from the predictive likelihood is shown to be approximately uniformly optimal for large sample size and large number of clusters, in the sense of uniformly minimizing the mean-squared error in a partially linear class of model-unbiased predictors. Three prediction intervals for Z based on three similar predictive likelihoods are studied. For a small number n<sub>0</sub> of sampled clusters, they differ significantly, but for large n<sub>0</sub>, the three intervals are practically identical. Model-based and design-based coverage properties of the prediction intervals are studied based on a comprehensive simulation study. The simulation study indicates that for large sample sizes, the coverage measures achieve approximately the nominal level 1 - α and are slightly less than 1 - α for moderately large sample sizes. For small sample sizes, the coverage measures are about 1 - 2α, being raised to 1 - α for a modified interval based on the distribution. Copyright 2008, Oxford University Press.