Content uploaded by Romanus Odhiambo Otieno
Author content
All content in this area was uploaded by Romanus Odhiambo Otieno
Content may be subject to copyright.
Available via license: CC BY 4.0
Content may be subject to copyright.
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101115
Generalised Model Based Confidence Intervals
in Two Stage Cluster Sampling
Christopher Ouma Onyango
Center for Mathematics
Strathmore University
Nairobi, Kenya
Chrisouma2004@yahoo.co.uk
Romanus Odhiambo Otieno
Department of Statistics
Jomo Kenyatta University
Nairobi, Kenya
romanusemod@yahoo.com
George Otieno Orwa
Department of Statistics
Jomo Kenyatta University
Nairobi, Kenya
orwagoti@yahoo.com
Abstract
Chambers and Dorfman (2002) constructed bootstrap confidence intervals in model based
estimation for finite population totals assuming that auxiliary values are available throughout a
target population and that the auxiliary values are independent. They also assumed that the
cluster sizes are known throughout the target population. We now extend to two stage sampling
in which the cluster sizes are known only for the sampled clusters, and we therefore predict the
unobserved part of the population total. Jan and Elinor (2008) have done similar work, but unlike
them, we use a general model, in which the auxiliary values
i
X
are not necessarily independent.
We demonstrate that the asymptotic properties of our proposed estimator and its coverage rates
are better than those constructed under the model assisted local polynomial regression model.
Keywords and Phrases: Model Based Surveys, Bootstrapping, Two Stage
Sampling
AMS 2000 subject classifications. Primary 60K35; Secondary 60K35.
1. Introduction
1.1 Background
In specifying a sampling strategy in survey sampling, there exist different
approaches: the design based approach, the model assisted approach, the
modelbased approach and randomizationassisted model based approach. For
a detailed review of these approaches, see Smith (1976), Smith (1994). Our
concern is the model based approach. Ouma and Wafula (2005) reviewed the
work of Chambers and Dorfman (2002) and modified the conditions. However,
they limited their work to simple random sampling. Suppose that P is a finite
Christopher Ouma Onyango, Romanus Odhiambo Otieno, George Otieno Orwa
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101115
102
population of N identifiable units, Y denotes a survey variable having population
values
, ( 1,2,3,..., )
i
Y i N
= and
X
to denote an auxiliary variable with
corresponding population values
, ( 1,2,3,..., )
i
X i N
=. If the values
, ( 1,2,3,..., )
i
X i N
= are all known but the characteristic values
, ( 1,2,3,..., )
i
Y i N
=
are known only for a Sample, say
s
, of
n N
≤
of the population elements, one
way of characterizing the sample selection of the survey variable is to assume
that for every unit on the sampling frame, a new variable, say
i
S
takes a value
equivalent to the number of times which that particular population unit’s value is
observed. The distribution of these values defines the design of the sample
survey.
Once the sample has been chosen, the values
( , )
i
Y i s
∈
are known. Now, let the
distribution of
i
S
depend on the known population values of
X
and suppose that
one wishes to use the sample values together with the known values of
X
to
make an inference about the unknown but finite population total 1
N
i
i
T Y
=
=
∑
of
Y
. A
major concern in model based approach to statistical survey inference has been
finding robust estimators for the population parameters under model
misspecifications.
1.2 Outline of the paper
This paper is organised as follows. In Subsections 1.3, 1.4, 1.5, 1.6, we give a
brief highlight on model based estimation, the local polynomial estimation,
confidence intervals, and two stage cluster sampling respectively, in each case,
pointing out some gaps that our proposed estimator attempts to fill. In Section 2,
we propose an estimator for the finite population total and suggest a bootstrap
confidence interval for it in Section 3. In Section 4, we derive the properties of our
proposed estimator. We conclude this paper in Section 5 with a simulation
experiment and some discussions.
1.3 Review of model based estimation
The model based approach to statistical survey sampling has been developed to
detailed extents. In particular, we build up on the work of Dorfman (1992) who
proposed a nonparametric regression estimator for the population total under a
model based approach. He illustrated that the developed estimator of the
population total performs better when compared to the corresponding design
based estimators and linear regression estimators. The model due to Dorfman
(1992) relies on the assumption that the regression line passes through the origin
and that the auxiliary values
i
X
are independent. Suppose one or all these
assumptions are incorrect. Will the prediction intervals still occupy the same
nominal properties? and will the estimator of the population total still be design
unbiased?
Generalised Model Based Confidence Intervals in Two Stage Cluster Sampling
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101115
103
1.4 Review of the local polynomial estimation
Dorfman (1992) considered a nonparametric regression model for estimating
population totals in finite populations. He proposed a nonparametric regression
based estimator for the population total. To develop the estimator, he assumed
that the population values were generated by a model defined as
(
)
i i i
Y m x e
= +
(1.1)
where
1,2,3,...,
i N
=
,
(
)
.
m
is a smooth function and
i
e
is an independent random
variable with mean zero and constant variance. The nonparametric population
total estimator due to Dorfman (1992) is defined as
( )
D
i i
i s i s
T Y m x
∧
∧
∈ ∈
= +
∑ ∑ (1.2)
where
( )
i i i i
i s
m x w x y
∧
∈
=∑ and
(
)
(
)
i i b i b i
s
w x k x x k x x
= − −
∑
is the weight associated
with the ith unit of the sample. Further,
(
)
k u
is a symmetric density function,
b
a
scaling factor and
(
)
(
)
1
/
b
k u b k u b
−
=. In his empirical study, Dorfman (1992)
illustrated that the estimator
D
T
∧
performs better when compared to the
corresponding design based and linear regression estimators. These results
were also confirmed by Cheng (1994) who applied nonparametric regression in
estimating population parameters under conditions of missing data. Breidt and
Opsomer (2000), also assumed model 1.1 and developed a new class of model
assisted non parametric regression estimators for the population total, based on
local polynomial smoothing, a kernel method. Their estimator is defined as
i
i
OB
i
i s i s
i
y m
T m
π
∧
∧ ∧
∈ ∈
−
= +
∑ ∑
(1.3)
where
1,2,3,...,
i N
=
,
(
)
i
pr i s
π
= ∈
, i
si s
m w y
∧=
and 1,
i j
si j
x x
k
w diag j s
h h b
π
−
= ∈
with
h
denoting the bandwidth. In their
simulation study,
OB
T
∧ performs better than the HorvitzThompson estimator
defined as
i
HT i s
i
Y
T
π
∧
∈
=
∑
(1.4)
However, the theory developed in Breidt and Opsomer (2000) for the local
polynomial regression estimator applies only to direct element sampling designs
with auxiliary information available for all elements of the population.
Consequently, we offer more insight on the consistency of the coverage rates
using a general super population model in two stage sampling. JiYeon et al.
(2009) recently extended the work of Breidt and Opsomer (2000) to two stage
cluster sampling where the estimators are linear combinations of estimators of
Christopher Ouma Onyango, Romanus Odhiambo Otieno, George Otieno Orwa
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101115
104
cluster totals with weights that are calibrated to known control totals. They
indicated that the local polynomial regression estimators are constructed by
modeling the
M
points
(
)
,
i i
x t
as a realization from an infinite super population
model ζ in which
(
)
i i i
t x e
µ
= +
(1.5)
where
(
)
(
)
,var,0~ xNei
(
)
x
µ
is a smooth function of
x
and
(
)
var
x
is also
smooth and strictly positive. Their estimator is defined as
i
i
yi
i s i s i
t
t
µ
µ
π
∧ ∧
∧ ∧
∈ ∈
−
= +
∑ ∑ (1.6)
where
( )
1
' ' '
s
ii si si si si si
e X W X X W t
µ
∧ ∧
−
=
(1.7)
In the equation 1.7,
1
i j
si
m m j
x x
k
w diag h h b
π
−
=
,
(
)
(
)
1, ,..., q
si j i j i
j s
X x x x x
∈
= − −
and
i
e
represents the first column of the identity matrix
si
X
. In their simulation
results they concluded that the estimator 1.6 is more efficient than the Horvitz
Thompson and the linear regression estimators when the mean function of the
super population model is non linear while being nearly as efficient when the
model is linear.
Recently, Jan and Elinor (2008) considered the problem of estimating the
population total in twostage cluster sampling when cluster sizes are known only
for the sampled clusters, making use of a population model arising from a
variance component model. They considered the application of predictive
likelihood technique in estimation of the unknown part of the population total
∑∑
= =
=N
i
m
jij
iyT 1 1 (1.8)
where
N
is the number of primary sampling units or clusters and each cluster
consists of
i
m
units which are only known for the sampled clusters,
ij
y
is the
value of the variable of interest for unit
j
of the ith cluster. They assumed the
population model defined by the equations
(
)
(
)
(
)
(
)
2
, var var , cov 0
i i i i i j
E M x M x M M
β σ
= = =
(1.9)
(
)
(
)
(
)
2 2
, var , cov
ij ij ij ik
E Y Y Y Y
µ τ ρτ
= = = (1.10)
in cases where
j k
≠
and
0
ρ
≥
. To predict the unobserved value of
Z
in the
estimate of the population total
T
given by
1 1
i
M
N
ij
i j
T Y Z
∧ ∧
= =
= +
∑∑ (1.11)
Generalised Model Based Confidence Intervals in Two Stage Cluster Sampling
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101115
105
they developed a partial likelihood for
Z
,
(
)
,
L z y
from the generalized joint
likelihood for the unknown quantities
z
and
θ
given by
(
)
(
)
, ,
L z y f z y
θ
= (1.12)
They applied the design based HorvitzThompson estimator of population total,
i
i
HT i s
o i
m y
x
T
n x
∧
∈
=∑ (1.13)
where
0
n
, represents the number of the primary sampling units selected in first
stages and the model. In their simulation, they considered three coverage
measures of
Z
; the model based over the joint distribution of
Y
and
Z
, the design
based over the sampling design, and regarding the total sample as a stochastic
variable. They concluded that for a small number and the unconditional coverage
no of sampled clusters, the three intervals differ significantly, but for large n0, the
three intervals are practically identical.
Further, a comprehensive simulation study of the model based and the design
coverage properties of the prediction intervals indicate that for large sample
sizes, the coverage measures achieve approximately the nominal level 1
α
−
and
are slightly less than 1
α
−
for moderately large samples and for small sample
sizes, the coverage measures are about
1 2
α
−
, being raised to 1
α
−
for a
modified interval based on 0
2
n
t
−
distribution. We note that the models 1.9 and
1.10 assume that the regression line passes through the origin and that the
auxiliary values
i
X
are considered independent. The questions raised in
subsection 1.3 therefore remain unanswered.
1.5 Review of confidence intervals in survey sampling
Confidence intervals are usually constructed around point estimators in order to
provide a properly scaled measure of uncertainty associated with the estimator.
The conventional method is based on the assumption that the sample size is
large enough for the Central Limit Theorem to hold. This is however not always
true in practice.
As a consequence Do and Kokic (2001), Chambers and Dorfman (2002) applied
the bootstrap method to develop model based confidence intervals to address
situations where the sample sizes are not large. They also proposed
modifications of the procedure to account for misspecifications in a working
model. They further noted that there is greater efficiency in using of successive
model refinements and estimators obtained using the bootstrap approach as
opposed to their competing estimators. However, the evidence of the extended
simulation study on the beef population showed that the achievement of the
research did not precisely attain its goal. They therefore recommended the
construction of sounder confidence intervals using the bootstrap approach.
Ouma and Wafula (2005) suggested the use of a general super population model
( )
i i i
Y m x e
∧
∧
= +
(1.14)
Christopher Ouma Onyango, Romanus Odhiambo Otieno, George Otieno Orwa
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101115
106
where
1,2,3,...,
i N
=
,
(
)
.
m
is a smooth function,
i
e
is an independent random
variable with mean zero and constant variance. They used a bandwidth of 1.5
and simple random sampling with replacement to generate the values of survey
variable
Y
. In their empirical study, they established that their coverage rates
were higher compared to that of Chambers and Dorfman (2002). We now extend
this to two stage cluster sampling.
1.6 Review of two stage cluster sampling
Let U be a finite population of N primary sampling units
s
psu
or clusters
labeled N,..,2,1 ,
1,2,...,
U N
=
where N is a known number, i
M, Ni ,..,2,1
=
be the
number of secondary sampling units
s
ssu
in the th
i
psu
. Let ij
yNi ,..,2,1
=
,
i
Mj ,..,2,1= be the value of the response variable
Y
for the
ssu
j
belonging to
the
psu
i
. In the previous works, an assumption has been made that the element
specific auxiliary data '
,
ij
xNi ,..,2,1
=
, i
Mj ,..,2,1= are known for all clusters and
population elements, respectively.
For our case, we assume that the cluster sizes are known only for the sampled
clusters and therefore the survey values ij
y, Ni ,..,2,1
=
, i
Mj ,..,2,1= are
generated using the model
ij
ij
ij exmY ^^^ )( += (1.15)
with Ni ,..,2,1
=
i
Mj ,..,2,1=.
2. Proposed Estimator for population total.
Jan and Elinor [6] used the model 1.15 to define the population total as
∑∑
= =
=N
i
m
jij
iyT 1 1 (2.1)
where N is the number of primary sampling units or clusters and each cluster
consists of i
m units which are only known for the sampled clusters, ij
y is the
value of the variable of interest for unit
j
of the th
icluster. Referring to the same
model 1.15 we may write that
∑ ∑∑∑ = +== = += N
i
M
mj ij
N
i
m
jij
ii yyT 1 1
^
1 1
^^
ZY
si sj ij
i
+= ∑∑
∈ ∈ (2.2)(((
and it follows that the problem is now reduced to the that of predicting the
unobserved values
z
of the random variable
Z
. To do this, we apply the general
model 1.15 to predict the values of the unobserved survey variables
Generalised Model Based Confidence Intervals in Two Stage Cluster Sampling
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101115
107
ij
y,Ni ,..,2,1
=
, i
Mj ,..,2,1=. Therefore the estimate of the population total is
given by
( )
∑ ∑∑∑ = +== =
++= N
i
M
mj ij
ij
N
i
m
jij
i
i
iexmyT 1 1
^^
1 1
^ (2.3)
3. Proposed Bootstrap Confidence Interval
Under the model based approach, the sampling distribution of the estimator
corresponds to the distribution of possible alternative point estimates that could
arise given the selection of the same sample
S
from populations similar to the
actual underlying population of the observed data. To construct a confidence
interval for
T
that reflects the actual finite sample and finite population
characteristics of the distribution of ^
T
we estimate such a distribution from the
sample data. For our case, we make use the sample data and the working
model 1.15 to generate a sequence of alternative realizations of
Y
using non
parametric estimates of )( ij
xm . Let
*
ij
Y
be an estimator of the values of
Y
, where
^
*( )
ij ij ij
Y m x e
∧
= +
(3.1)
In equation 3.1, ij
e is selected via two stage cluster sampling with replacement
from
: 1,2... , 1,2,..
ij i
e i n j m
= = .
Having obtained the bootstrap population, the bootstrap version *
1
^
Tof 1
^
T, using
the same sample as the parent sample, is calculated. The process is then
repeated a large number,
B
, of times to obtain *
^
*
2
^
*
1
^,...., iBii TTT . Then the
bootstrap confidence interval is obtained using
(
)
−
2
1
,
2**
α
α
QQ
where )(
*pQ is the th
p – quantile of bootstrap distribution.
4. Properties of the proposed estimator and resulting confidence interval
4.1 Unbiasedness of the model
Considering the model 1.15 we may write that
)(
^
ijijij xmYe −= (4.1)
and
=)(
*ij
xW
−
+
−
−iij r
m
n
nm
nm
W2
1
2
1
2
1
11
1 (4.2)
Christopher Ouma Onyango, Romanus Odhiambo Otieno, George Otieno Orwa
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101115
108
where
m
is the bootstrap sample size, i
r is the number of times the th
i primary
sampling unit is selected, ij
x is the th
j observation made from the th
i cluster, and
)( ij
xW is the initial sampling weight of secondary sampling unit equal to the
inverse of its selection probability, that is;
)( ij
xW =ij
π
1 (4.3)
with ni ,..,2,1
=
; i
nj ,..,2,1=.
However there is considerable benefit and little loss in choosing 1
−
=
nm . Rao
and Wu (1998)
Therefore,
=)(
*ij
xW
1
2
11
i
ij
n
r
n
π
−
(4.4)
ni ,..,2,1
=
; i
mj ,..,2,1=, and
( )
^^ )( ijij
ij xmYEeE −=
(4.5)
So ∑
∈
=sji ijijij YxWxm ,
^)(*)( (4.6)
(ni ,..,2,1
=
,i
nj ,..,2,1=) yielding
)(
^
ij
xmE =
1
2
1ij
i
i j ij
Y
n
E r
n
π
≠
−
∑ (4.7)
Now, let the initial sampling weight of secondary sampling units ij
ij
xW
π
1
)( =be
the kernel based weights. Then we have
(
)
( )
∑−
−
=ikijb
ikijb
ij xxK xxK
xW )( (4.8)
with 1)( =
∑
∈sij ij
xw , further, bbeing a scaling factor,
(
)
buKbuKb/)( 1−
= and )(uk is
a symmetric density function which is such that
ℜ
∈
∀
uwith the symbols bearing
their usual meanings, then
(a)
( )
1=∂
∫
∞
∞− uuk ,(b)
( )
∞<∂
∫
∞
∞− uuk2, (c)
( )
∞<∂
∫
∞
∞− uuku 2
3and (d)
(
)
(
)
ukuk −=
Therefore, ^
( ) ( ) [ ( )]
ij ij ij
E e m x E m x
∧= − (4.9)
Generalised Model Based Confidence Intervals in Two Stage Cluster Sampling
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101115
109
But as 0
→
b and
∞
→
nb , )()(
^
ijij xmxm → meaning that
0
ij
E e∧
=
which is the
mean of
ij
e
∧
in model 1.15, completing the proof that the proposed model is
unbiased.
4.2 Asymptotic variance of the error term
From subsection 4.1, it follows that
2
^ ^
ij ij
Var e E e
=
(4.10)
Therefore
2
^^
=
ij
ij eEeVar = 2
^)(
−ijij xmYE =2
^^
2)()(2
+− ijijijij xmExmEYEY (4.11)
which leads to
2
)( ij
eE =)()( ^
2ijij xmVarx +
σ
(4.12)
But
)(
^
ij
xmVar =
1
21 1
( 1)
1ij ik
ij
i s
x x
n
Var n b k Y
n b
− −
∈
−
−
−
∑1
^)( −
ijs xd (4.13)
where
( ) ( )
11
s ij
b ij ik
d x
k x x
−
∧
=
−
∑
)(
^
ij
xmVar =
2
2 2
( 1) ( )
1ij ik
ij
i s
x x
n
n b k x
n b
σ
− −
∈
−
−
−
∑2
^)( −
ijs xd (4.14)
But
2
^)( −
ijs xd =
2
)( −
ijs xd
−++− −− 2
1
2
1
3
//
2
2)1()(
)(
1bnbOxdk
xd bijs
ijs (4.15)
and
)( ij
ikij x
bxx
k
σ
−= )()()( 2
1
3bbOxdxb ijsij ++
σ
(4.16)
So using equations 4.15 and 4.16 in equation 4.14 we have that
^
( )
ij
varm x
=∑
≠
∈
−−
−
−
−ki si ij
ikij x
bxx
kbn
nn)()1(
1
2
22
2
σ
2
^)( −
ijs xd (4.17)
Christopher Ouma Onyango, Romanus Odhiambo Otieno, George Otieno Orwa
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101115
110
Next we obtain the asymptotic expansion of ^
( )
ij
varm x
using the following
theorem as a basis.
Theorem
Let )(uk be a symmetric density function with
0)( =
∫
duuku and
k u k u du
22
0
= >
∫
( )
. Assume that n and N increase together such
that
π
→
N
n with 10
<
<
π
. If further the sampled and non sampled values of
x
are in the interval
[
]
dc, and are generated by densities s
d and sp
d− respectively,
both bounded away from zero on
[
]
dc, and with continuous second derivatives,
and if for any expression of
Z
, it can be shown explicitly that
[
]
)()(/ BOUAUZE += and
[
]
)(/ COUZVar =, then
++= 2
1
)( CBOUAZ p.
Using this theorem, we may write equation 4.17 as
^
( )
ij
varm x
=
( )
1
1
1 2 2
2
( 1) ( ) 1
1ij
nn x O n b
n
σ
−
−
−
− + −
−
(4.18)
which reduces to
^
( )
ij
varm x
=
( ) ( )
1
1
22
2
2( ) 1
1ij
nx O n b
n
σ
−
−
+ −
−
(4.19)
and noting that as the number, nmi= of the second stage samples tends to be
large, nn
≅
−
1 so that we have
^
( )
ij
varm x
=
(
)
( ) ( )
1
32
2
1
1
ij
x
O n b
n
σ
−
−
+ −
−
(4.20)
again as
∞
→
nb , 2
^
( )
( )
1
ij
i
x
var m x n
σ
=
−
(4.21)
hence 2
^2
( )
( )
1
ij
ij ij
x
var e x n
σ
σ
= +
−
(4.22)
which as
∞
→
n
, reduces to
^2
( )
ij
ij
var e x
σ
=
(4.23)
Generalised Model Based Confidence Intervals in Two Stage Cluster Sampling
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101115
111
4.3 Conditional relative bias of the estimator for the population total
From its definition, the conditional relative bias of using
T
∧
as an estimator of
T
is
−TTTE /
^=
( )
TTETE /
^
−
(4.24)
In this case,
^
1 1
i
m
N
ij
i j
T Y
= =
=
∑∑
= ^ ^
1 1 ( )
i
M
N
ij
ij
i j m
m x e
= = +
+
∑ ∑ (4.25)
meaning that
( )
−
TTETE /
^=
( )
ij
i j ij
i j ij
ij exmEexmE +−
+∑∑∑∑ ^^ )( (4.26)
Using equation 4.9 in equation 4.26, it can be seen that as
∞
→
n
this bias,
( )
−
TTETE /
^asymptotically tends to zero.
5. Empirical Study
5.1 Description of the simulation experiment
Simulation experiments were performed in order to compare the performance of
the model based regression estimator with that of the model assisted local
polynomial regression estimator in twostage element sampling due to JiYeon et
al. (2009).
To obtain the model based estimator for the population total,
i
X
are generated as
independent and identically distributed on uniform (0, 1) random variables. The
population consists of 100 clusters. In stage one a sample of =
i
n20 clusters is
taken which forms the primary sampling units from the total cluster size =
i
N100
using simple random sampling with replacement.
In stage two, from each selected clusters, say
(
)
i
nii ,...2,1,
=
we select
sample ij
m,50,...2,1
=
j, from 000,1,...,...,2,1 k
mj = that is the th
j sample from a
fixed selected th
icluster using simple random sampling with replacement from
total k
M=1000 elements. We consider the variable of interest ij
Y,
kk Mmj ,...,,...,2,1= which are known only for the sample and using the known
auxiliary variables ,
ij
xk
Mj ,...,2,1= we generate the non sample values using the
model given in 1.14.
To simulate bootstrap values of k
M independent samples of size k
m we use
simple random sampling with replacement within cluster
i
in order to obtain the
Christopher Ouma Onyango, Romanus Odhiambo Otieno, George Otieno Orwa
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101115
112
bootstrap population values and the model ij
ijij exmY ^^
*)( += to
obtain k
iMiii yyyy *
3
*
2
*
1
*,...,, . Further, we let ]1,0[~)( UuK so that the model based
regression estimator for the population total is given by equation 2.3
where
(
)
ij
xm
^ is defined by equation 1.14 and
( )
^2
~ 0,
ij ij
e x
σ
.
This procedure is repeated a large number 1000 of times such that we
have *
1
^
i
T,*
2
^
i
T,*
3
^
i
T,…, *
1000
^
i
T.We then construct the 95% confidence intervals for
population total NiT i,...2,1,
^*=. Similarly, we compute the local polynomial
estimator for the population total suggested by JiYeon et al. (2009) given in
equation 1.6. For each mean function values of ij
x, each study
variable ij
y
(
)
.2,1=j for k
m values from k
M elements are generated as
(
)
2
1
k
jk
k
ijj
jk M
e
Mx
y+=
µ
(5.1)
where, jk
yis the th
kobservation made from the th
j cluster, and
(
)
(
)
ijij xNe 2
,0~
σ
.
Using above bootstrap procedure, the bootstrap estimate of the population total
( )
∑∑ ∈∈
−
+= si i
ii
ci ij
imy
xmT
π
^
^* (5.2)
is calculated, where
{
}
si
i∈= Pr
π
,sji
hxx
k
h
diagm ij
ikij
i∈
−
=,,
11
^
π
and
( )
ij
m x
∧ is
as defined in equation 1.4. Similarly, we construct the confidence interval for the
population total
T
and compare performance of the developed model on
estimation of
T
with that due to JiYeon et al. (2009) on Local Polynomial
regression estimation in twostage sampling.
In computation of the model assisted local polynomial regression estimators, we
again adopt a method due to JiYeon et al. (2009). This we do as a means to
having a realistic comparative study. We therefore apply the Epanechnikov
kernel
( )
(
)
{ }
1
2
1
4
3≤
Ι−= u
uuk and different bandwidths for computation of the Local
polynomial regression estimator of population totals. This helps to compare the
bias, mean squared errors and confidence interval lengths of the estimators
using both the model based and model assisted local polynomial regression
approaches.
Generalised Model Based Confidence Intervals in Two Stage Cluster Sampling
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101115
113
5.2 Simulation Results
Table 1 gives the results of Mean Squared Error of the model based MSEmb and
the Local Linear Polynomial regression estimator of the population total in two
stage cluster sampling.
Table 1: Mean Squared Error (MSE)
Band width
LP
MB
0.005
0.7845631
1.2684580
0.006
0.7961103
1.7980100
0.007
0.7528211
1.4856560
0.008
0.7523094
1.4641440
0.009
0.7909287
1.3896480
0.010
0.7740691
0.1187740
0.020
0.7615400
0.0846216
0.030
0.7653660
0.0752471
0.040
0.7639690
0.0720307
0.050
0.7621990
0.0672295
0.060
0.7740690
1.1703404
0.070
0.7615400
0.7534386
0.080
0.75516
10
0.7534386
It can be seen that at lower bandwidths the MSE for the model based estimators
is higher compared to that of the Local Polynomial regression estimator. As the
bandwidth increases, the MSEmb drastically reduces and approximately remains
low. It is important to note that an increase in the bandwidth does not significantly
change the MSE for the Local Polynomial Regression Estimators LPRE.
Generally, the Model based estimator is more efficient than the Local polynomial
estimators of the totals. Table 2 is a summary of bias for the model based
estimator of the population total and the Local Polynomial regression estimator.
Table 2: Summary Results of Bias
Band width
LP
MB
0.005

0.0443350
0.00399122
0.006

0.0381211
0.00200561
0.007

0.03801
43

0.0004676
0.008

0.0384639

0.0036316
0.009

0.0404816

0.0001817
0.010

0.0985294

0.0032099
0.020

0.0360501

0.0046855
0.030

0.0396870

0.0265196
0.040

0.0536516

0.0156076
0.050

0.0552965

0.0201331
0.060

0.0985294

0.0384731
0.070

0.
0360501

0.0538332
0.080

0.0496699

0.0538332
Christopher Ouma Onyango, Romanus Odhiambo Otieno, George Otieno Orwa
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101115
114
The bias for the model based estimator is much lower than those of the Local
Polynomial regression estimators. The large bias associated with the Local
Polynomial estimators are reflected in the values of its estimators which are
much lower than the true simulated population total of 99.5078. This can best be
attributed to the choice of the variance. The precision of estimation can be
improved by choosing a smaller value of the variance. Table 3 now presents a
summary of the estimated population totals for the model based and the Local
Polynomial regression estimators in two stage sampling.
Table 3: Summary Results of Estimated Population Totals
Band width
LP
MB
0.005
60.81535
99.27500
0.006
61.38900
99.0
1420
0.007
61.49330
99.92000
0.008
60.93400
99.20200
0.009
59.02400
99.32200
0.010
58.64900
97.47400
0.020
60.24100
93.09100
0.030
59.81800
88.08000
0.040
59.34000
83.90000
0.050
59.30700
79.47900
0.060
58.64900
59.49500
0.070
60.24000
59.82600
0.080
59.82400
59.82600
Table 4 now gives the coverage rates of 95 % confidence interval lengths for
model based and Local Polynomial Regression models.
Table 4: Confidence Interval Lengths
Band width
LP
MB
0.01
38.97447
1.531668
0.02
40.74652
1.315
887
0.03
39.52278
1.227330
0.04
38.87707
1.212220
0.05
38.72655
1.148093
0.06
38.93319
38.85571
0.07
38.86739
40.20200
0.08
38.54396
38.60400
The confidence intervals generated by the model based method are much tighter
than those generated by the Local polynomial method at lower bandwidths but at
larger bandwidths, both the LPRE and the model based estimators of population
total perform poorly. We note that the best performing confidence interval is one
whose coverage rate is close to the true population total and its length small.
Consequently, the model based estimators are far better than their local
Generalised Model Based Confidence Intervals in Two Stage Cluster Sampling
Pak.j.stat.oper.res. Vol.VI No.2 2010 pp101115
115
polynomial regression estimators. The results in general show that the model
based approach outperforms the model assisted method at 95% coverage rate.
The bias under model based approach is also much lower.
References
1. Breidt, F. and Opsomer, J. (2000). Local polynomial regression estimators in
survey sampling. The Annals of Statistics, 28:1026–1053.
2. Chambers, R. and Dorfman, A. (2002). Robust sample survey inference via
bootstrapping and bias correctionthe case of the ratio estimator. Technical
report, Southampton Statistical Sciences Research Institute, University of
Southampton.
3. Cheng, P. (1994). Nonparametric estimation of mean functional with data
missing at random. Journal of the American Statistical Association, 89:
81–87.
4. Do, K. and Kokic, P. (2001). Bootstrap Variance and confidence interval
estimation for model based surveys. Australia National University.
5. Dorfman, R. (1992). Nonparametric regression for estimating totals in finite
population. In Section on Survey Research Methods, Journal of the
American Statistical Association, pages 622–625.
6. Jan, F. and Elinor, Y. (2008). Two stage sampling from a predictive point of
view when the cluster sizes are unknown. Biometrika, 95 (1): 187–204.
7. JiYeon, K., Breidt, F., and Opsomer, J. (2009). Nonparametric regression
estimation of finite population totals under twostage cluster sampling.
Technical report, Department of Statistics, Colorado State University.
8. Ouma, C. and Wafula, C. (2005). Bootstrap confidence interval for model
based surveys. East African Journal of Statistics, 1: 14–18.
9. Rao, J. and Wu, C. (1998). Re sampling inference with complex survey data.
Journal of the American Statistical Association, 83: 231–241.
10. Smith, T. (1976). The foundations of survey sampling a review. Journal of
the Royal Statistical Society, Series A, 139: 183–198.
11. Smith, T. (1994). Sample surveys 1975, 1990. an age reconciliation.
International Review, 62: 3–34.