3. (2Se, GREG) 4. (2Sc, GREG)
Two-stage epsem is an intuitive choice when survey variable can be expressed by a super- population model which assumes constant variance ofy-values across the finite population.
In this chapter, above four sampling strategies involving two-stage epsem are compared in a systematic way. The comparison is based on AMSE’s under the two-level regression model rather than their MSE’s for a given finite population; comparison of these sampling strategies with respect to AMSE’s is not seen in literature. AMSE represent average performance of a sampling strategy for the finite population parameters under a give super-population model. The aim of this comparison is to single out a preferred sampling strategy which might be useful from practical viewpoint.
Usually cluster sampling is less efficient than element sampling, therefore, two-stage epsem base on two-stage cluster sampling (i.e. 2Sc) is expected to be less efficient than two-stage element sampling (i.e. 2Se) when sub-clusters have positive intra-cluster correlation.
Also, equal-sized sub-clusters are rarely found in practice. This may motivate custom formulation of equal-sized sub-clusters. A simplified case is when equal-sized sub-clusters are randomly formed within each PSU. When a set of auxiliary variables is available, one may also look to formulate even better sub-clusters using the known values of auxiliary variables. In the following, a toy example demonstrates that there exist ways of cluster sampling by which sampling variance is zero.
Example 2.1. Let response values are yi = 1, ...,16 and sample size is n = 4. Consider the following arrangement of yi’s with rows as clusters:
1 8 9 16 2 7 10 15 3 6 11 14 4 5 12 13
where rows of the matrix are four samples which have the same sample mean is 8.5 (and total 34), but different sample variance. The sampling variance of the row-sample mean is 0. This particular cluster sampling is the most efficient design for this population, including compared against SRS of elements.
2.1. INTRODUCTION
To get cluster sampling like in the above example, unknown y-values are required which is not possible in a sample survey situation. However, auxiliary variables related to y-variable can be used for this purpose like other model-assisted approaches in survey sampling. In this chapter, one way of formulating sub-clusters is explored which aims to obtain sub-clusters such that they are equal-sized and have equal means of auxiliary variables. Assuming that such sub-clusters exists and can be achieved, a preliminary analysis indicates that SRS of these sub-clusters is actually same as equal-probability balanced sampling (with respect to auxiliary variables) of elements in term of AMSE.
Furthermore, these sub-clusters might be more difficult to achieve in practice as compared to balanced sampling. However, it may have an advantage over balanced sampling, which is unbiased variance estimation because it is SRS of sub-clusters.
Rest of the chapter is arranged as follows. In Section2.2, formulation of sub-clusters using auxiliary variables is briefly described which happen to be same as balanced equal prob- ability sampling of elements. In Section 2.3, four sampling strategies involving two-stage epsem are compared with respect to their AMSE’s under a two-level super-population model. In Section 2.4, a simulation study is conducted which provides further insights of the comparison. In the last, Section2.5 gives some conclusions of this chapter. In the fol- lowing two subsections, formulas of HT-estimator and GREG-estimator under two-stage sampling designs are given.
2.1.1 HT-estimators under two-stage sampling
Let a random sample sI of nI PSU’s is selected from the population UI (UI is defined in Section 1.2.1), there are NI PSU’s in the population. Let πg and πgh denotes first- and second-order inclusion probabilities for the gth and (g, h)th PSU’s respectively, where g ∈ UI and (g ̸= h) ∈ UI. In two-stage element sampling, a random sample sg of ng elements is selected from the PSU’s selected in the first-stage sample, i.e. g ∈sI. Letπi|g and πij|g denote first- and second-order inclusion probabilities ofith and (i, j)th elements in the gth PSU, i.e. i∈Ug and i̸=j ∈Ug. The HT-estimator of population totalY and its sampling variance under two-stage element sampling are given by
YˆHT =
nI
∑
g=1
YˆgHT
πg (2.1)
V( ˆYHT) =
NI
∑
g,h=1
( πgh πgπh −1
)
YgYh+
NI
∑
g=1
1 πg
Ng
∑
i,j=1
( πij|g πi|gπj|g −1
)
ygiygj (2.2)
2.1. INTRODUCTION
where ˆYgHT =∑mg
i=1yi/πi|g is HT estimator ofgth PSU totalYg =∑
i∈Ugygi, see (S¨arndal et al., 1992, p. 137).
Under two-stage epsem by πps-SRS (i.e. 2Se), sample size of elements is equal within all sampled PSU’s, i.e. ng ≡n0. Sampling variance under 2Se can be written as
V( ˆYHT2Se) =
NI
∑
g,h=1
( πgh πgπh −1
)
YgYh+
NI
∑
g=1
1 πg
(1−fg n0 Sg2
)
(2.3)
where fg =n0/Ng, Sg2 =∑
i∈Ug(ygi−Y¯g)2 and ¯Yg =Yg/Ng.
For two-stage cluster sampling, let each PSU can be further divided intoNIIg sub-clusters of unequal sizesNgk whereg ∈UIandk = 1, ..., NIIg. LetUg1, ..., UNIIg denote second-stag clusters ingth PSU. Let a random samplesIIg ofnIIgsecond-stage clusters (or sub-clusters) is selected from the gth PSU selected in first-stage sample, i.e. g ∈sI. Let πk|g and πkl|g denotes first- and second-order inclusion probabilities for kth and (k, l)th sub-clusters in the gth PSU, respectively. The HT-estimator of population total Y and its sampling variance under two-stage cluster sampling can be written as
YˆHT =
nI
∑
g=1
1 πg
nIIg
∑
k=1
Ygk
πk|g (2.4)
V( ˆYHT) =
NI
∑
g,h=1
( πgh πgπh −1
)
YgYh+
NI
∑
g=1
1 πg
NIIg
∑
k,l=1
( πkl|g πk|gπl|g −1
)
YgkYgl (2.5)
where Ygk =∑
i∈Ukygi is population total forkth sub-cluster in gth PSU.
Under two-stage epsem byπps-SRS (2Sc), assume that all the sub-clusters are equal sized, i.e. Ngk ≡N0, and equal number of sub-clusters are sampled from each selected PSU, i.e.
nIIg ≡nII0. Sampling variance under 2Sc can be written as
V( ˆYHT2Sc) =
NI
∑
g,h=1
( πgh πgπh −1
)
YgYh+
NI
∑
g=1
1 πg
(1−fIIg nII0 SIIg2
)
(2.6)
where fIIg = nII0/NIIg, SIIg2 = ∑
k∈Ug(Ygk −Y¯IIg)2 and ¯YIIg = Yg/NIIg is mean per sub- cluster.
2.1. INTRODUCTION
2.1.2 GREG-estimators under two-stage sampling
For GREG-estimator, (S¨arndal et al., 1992, p. 304) described three scenarios for avail- ability of auxiliary data in the two-stage sampling design, given in the following.
Case A: (PSU Auxiliaries). The auxiliary totals are available for all PSU’s in the pop- ulation.
Case B: (Complete Element Auxiliaries). The auxiliary values are available for all ele- ments in the entire population.
Case C: (Limited Element Auxiliaries). The auxiliary values are available for all the elements in selected PSU’s only.
Case A leads to regression model at PSU level, Cases B and C lead to regression modelling of the element values. Similarly, above three cases of auxiliaries leads to three different GREG-estimators under two-stage design. Here, we considered the Case B for availability of complete auxiliary data.
Let xgi denotes a column vector of q-values corresponding to the ith element in the gth PSU. Let relationship of y and the auxiliary variables is expressed by an element level general linear model, given byygi=µgi+ϵgi, whereµgi=µ(xgi) = x⊤giβis linear predictor, β is vector of q regression coefficients and ϵgi is random error term associated with the value of ith element in thegth cluster, which is normally distributed with mean zero and variance σgi2. A population based estimate B of unknown vector β is given in Eq. (1.8) which also an unknown finite population quantity. Let s = {s1, ..., snI} denotes all the elements in the two-stage sample. A sample estimate of B is given in Eq. (1.7) where πi = πgπi|g under 2Se, under 2Sc πi = πgπk|g for all i ∈ Uk, and it is assumed σi2 = 1.
GREG-estimator for finite population total Y under two-stage element sampling, from (S¨arndal et al., 1992, p. 323), is given by
YˆGR2Se =∑
i∈U
ˆ
yi+∑
i∈s
ˆ ei πi =
NI
∑
g=1 Ng
∑
i=1
ˆ ygi+
nI
∑
g=1
1 πg
mg
∑
i=1
ˆ egi
πi|g (2.7) where ˆygi = x⊤giBˆ and ˆegi = ygi−yˆgi. An approximate variance of GREG-estimator in Eq. (2.7) is given by
V( ˆYGR2Se)≈
NI
∑
g=1 NI
∑
h=1
( πgh πgπh −1
)
egeh+
NI
∑
g=1
1 πg
Ng
∑
i,j=1
( πij|g πi|gπj|g −1
) egiegj