Proposed procedure: two-step cube - Improving the cube method

Improving the cube method

3.2 Proposed procedure: two-step cube

It becomes negligible for large n and fixed value of q (Deville and Tillé, 2004). In the landing-phase, cube method does not control the total imbalance explicitly. Moreover, when the observed D_l(s) from a sample s selected by cube method is non-negligible, it is not clear whether the given sample was selected unluckily or actually the total imbalance under cube method is large. One can get an empirical estimate of total imbalance by repeated sampling under the cube method. Having selected many samples by the cube method, each sample has different observed value of∑q

l=1D_l. A sample with the smallest value of ∑q

l=1D_l would be a good choice, but it does not achieve a sampling design with fixed first-order inclusion probabilities. Selecting a random sample out of these samples shall again result into sampling by cube method. A sensible choice would be to select a random sample out of the many cube samples with a different sampling distribution such that the total imbalance is smaller than that of the cube method. Implementation of this idea is given in the next section.

Rest of the chapter is arranged as follow. In the Section 3.2, the proposed sampling procedure is given. Its theoretical properties are discussed in Section 3.3. In the Section 3.4, a simulation study is conducted which compares the propose sampling procedure with the cube method. A methodology for the estimation of sampling variance under balanced sampling is proposed in Section 3.5. A simulation study in Section 3.6 assesses the performance of proposed variance estimators and compares with an estimator from the literature. Section 3.7 gives conclusions and some future work directions.

3.2 Proposed procedure: two-step cube

The proposed procedure consists of two steps: first, a finite number of samples are selected using cube method and empirical value of total imbalance is computed; second, a sampling distribution for the selected cube samples is determined by minimizing the empirical total imbalance. However, for the minimization problem in the second step only a tentative solution is given. The finite set of samples selected using the cube method is named as realized cube sample space, and empirical distribution over the realized cube sample space is calculated. First-order inclusion probabilities under the new sampling distribution, determined by minimization, are same as those calculated under the empirical distribution implied by the cube method. When a sample is selected from the realised cube sample space using the sampling distribution which minimises the empirical total imbalance, then total imbalance is expected to be equal or smaller than than of cube method.

Let c(s) is set of selection probabilities associated with samples under the cube method,

3.2. PROPOSED PROCEDURE: TWO-STEP CUBE

and denotes the sampling distribution implied by the cube method. Let Λ_c and tr(Λ_c) denote the variance-covariance matrix of vectorXˆ and the total imbalance under the cube method respectively. Expressions for Λ_c and tr(Λ_c) can be obtained by substituting c(s) forp(s) in Eq. (3.2) and Eq. (3.3) respectively. The two steps for the proposed procedure are given in the following:

Step 1: Let K samples of ﬁxed size n are selected from U by using cube method and Ω_K ={s_k;k = 1, ..., K}denotes the realized cube sample space of sizeK, Ω_Kmay contain a sample multiple times, therefore it not a set of distinct samples. The minimum size of Ω_K (or minimum value of K) is such that all the population units are contained in it.

The impact of small and large values of K would become more clear in the next section where theoretical properties of the proposed procedure are discussed. Let the vector λ = [λ_k = 1/K] denotes the empirical distribution over Ω_K which assigns a probability mass of 1/K to each sample s_k in Ω_K. An empirical estimate for the variance-covariance matrix Λ_c based on λis given by

Λˆ_λ=E_λ {

(Xˆ −X)^⊤(Xˆ −X) }

where E_λ denotes expectation with respect to the empirical distributionλ. Similarly, an empirical estimate of total imbalance under the cube method is given by

tr(Λˆ_λ) =

∑q l=1

Λˆ_ll(λ) =

∑q l=1

∑K k=1

λ_k( ˆX_l(s_k)−X_l)² =

∑q l=1

1 K

∑K k=1

( ˆX_l(s_k)−X_l)²

where ˆX_l(s_k) is HT-estimate of the auxiliary totalX_lbased on the samples_k, and ˆΛ_ll(λ) is lth diagonal element ofΛˆ_λwhich simply represent empirical estimate of sampling variance of HT-estimator ˆX_l under the cube method. First-order inclusion probabilities implied by the empirical distribution λ are given by

π_i(λ) =

∑K k=1

I_i_∈_s_kλ_k = 1 K

∑K k=1

I_i_∈_s_k

where Ii∈s_k is indicator variable which takes value 1 if i∈sk, 0 otherwise, and i∈U.

Step 2: Let λ^∗(̸=λ) denotes another sampling distribution (which needs to be determined here) over the realized cube sample space Ω_K such that tr(Λˆ_λ∗) ≤ tr(Λˆ_λ) and π_i(λ^∗) = π_i(λ) for all i ∈ U, where π_i(λ^∗)’s are ﬁrst-order inclusion probabilities and

3.2. PROPOSED PROCEDURE: TWO-STEP CUBE

tr(Λˆ_λ∗) is total imbalance implied by the sampling distribution λ^∗ for the given realized cube sample space Ω_K. Expression for tr(Λˆ_λ∗) is given by

tr(Λˆλ^∗) =

∑q l=1

Λˆll(λ^∗) =

∑q l=1

∑K k=1

λ^∗_k( ˆXl(sk)−Xl)²

where matrix Λˆ_λ∗ is the variance-covariance matrix of vector Xˆ under the sampling distribution λ^∗. In order to obtain the sampling distribution λ^∗ over Ω_K, an optimization problem is formulated in the following:

Minimize:

C(λ^∗) =tr(Λˆ_λ∗) =

∑q l=1

∑K k=1

λ^∗_k( ˆX_l(s_k)−X_l)² Subjected to:

(i). 0< λ^∗_k<1, ∀k, (ii).

∑K k=1

λ^∗_k = 1,

(iii). πi(λ^∗) = πi(λ), ∀i∈U.

where C(λ^∗) denotes cost function for the solution vector λ^∗ in the above optimization problem. In the optimization problem above, ﬁrst and second constraints ensure that λ^∗_k is a sampling distribution while third constraint ensures that ﬁrst-order inclusion probabilities under the proposed procedure are same as those computed under the cube method.

For the above optimization problem, a stochastic global optimization algorithm known as simulated annealing algorithm is used. Mullen(2014) discussed implementations for diﬀer- ent optimization algorithms including simulated annealing algorithm which are available in two diﬀerent packages of R statistical software (R Core Team,2022). First, R-function optim(method="SANN") in R-package stats; second, R-function and package GenSA (Xi- ang et al., 2013). These implementations for the simulated annealing algorithm rarely produce a sensible solution for this particular problem because of its complexity. Simu- lated annealing algorithm is easy to program and manipulate manually in R. An R-code is created for the simulated annealing algorithm which is giving a reasonable solution, but has scalability problem as the population size becomes large. If a better algorithm is available, one can plug-in but the idea of the proposed procedure does not change. A

Dalam dokumen Balanced Two-Stage Equal Probability Sampling (Halaman 78-81)