Improving the cube method
3.2 Proposed procedure: two-step cube
It becomes negligible for large n and fixed value of q (Deville and Till´e, 2004). In the landing-phase, cube method does not control the total imbalance explicitly. Moreover, when the observed Dl(s) from a sample s selected by cube method is non-negligible, it is not clear whether the given sample was selected unluckily or actually the total imbalance under cube method is large. One can get an empirical estimate of total imbalance by repeated sampling under the cube method. Having selected many samples by the cube method, each sample has different observed value of∑q
l=1Dl. A sample with the smallest value of ∑q
l=1Dl would be a good choice, but it does not achieve a sampling design with fixed first-order inclusion probabilities. Selecting a random sample out of these samples shall again result into sampling by cube method. A sensible choice would be to select a random sample out of the many cube samples with a different sampling distribution such that the total imbalance is smaller than that of the cube method. Implementation of this idea is given in the next section.
Rest of the chapter is arranged as follow. In the Section 3.2, the proposed sampling procedure is given. Its theoretical properties are discussed in Section 3.3. In the Section 3.4, a simulation study is conducted which compares the propose sampling procedure with the cube method. A methodology for the estimation of sampling variance under balanced sampling is proposed in Section 3.5. A simulation study in Section 3.6 assesses the performance of proposed variance estimators and compares with an estimator from the literature. Section 3.7 gives conclusions and some future work directions.
3.2 Proposed procedure: two-step cube
The proposed procedure consists of two steps: first, a finite number of samples are selected using cube method and empirical value of total imbalance is computed; second, a sampling distribution for the selected cube samples is determined by minimizing the empirical total imbalance. However, for the minimization problem in the second step only a tentative solution is given. The finite set of samples selected using the cube method is named as realized cube sample space, and empirical distribution over the realized cube sample space is calculated. First-order inclusion probabilities under the new sampling distribution, determined by minimization, are same as those calculated under the empirical distribution implied by the cube method. When a sample is selected from the realised cube sample space using the sampling distribution which minimises the empirical total imbalance, then total imbalance is expected to be equal or smaller than than of cube method.
Let c(s) is set of selection probabilities associated with samples under the cube method,
3.2. PROPOSED PROCEDURE: TWO-STEP CUBE
and denotes the sampling distribution implied by the cube method. Let Λc and tr(Λc) denote the variance-covariance matrix of vectorXˆ and the total imbalance under the cube method respectively. Expressions for Λc and tr(Λc) can be obtained by substituting c(s) forp(s) in Eq. (3.2) and Eq. (3.3) respectively. The two steps for the proposed procedure are given in the following:
Step 1: Let K samples of fixed size n are selected from U by using cube method and ΩK ={sk;k = 1, ..., K}denotes the realized cube sample space of sizeK, ΩKmay contain a sample multiple times, therefore it not a set of distinct samples. The minimum size of ΩK (or minimum value of K) is such that all the population units are contained in it.
The impact of small and large values of K would become more clear in the next section where theoretical properties of the proposed procedure are discussed. Let the vector λ = [λk = 1/K] denotes the empirical distribution over ΩK which assigns a probability mass of 1/K to each sample sk in ΩK. An empirical estimate for the variance-covariance matrix Λc based on λis given by
Λˆλ=Eλ {
(Xˆ −X)⊤(Xˆ −X) }
where Eλ denotes expectation with respect to the empirical distributionλ. Similarly, an empirical estimate of total imbalance under the cube method is given by
tr(Λˆλ) =
∑q l=1
Λˆll(λ) =
∑q l=1
∑K k=1
λk( ˆXl(sk)−Xl)2 =
∑q l=1
1 K
∑K k=1
( ˆXl(sk)−Xl)2
where ˆXl(sk) is HT-estimate of the auxiliary totalXlbased on the samplesk, and ˆΛll(λ) is lth diagonal element ofΛˆλwhich simply represent empirical estimate of sampling variance of HT-estimator ˆXl under the cube method. First-order inclusion probabilities implied by the empirical distribution λ are given by
πi(λ) =
∑K k=1
Ii∈skλk = 1 K
∑K k=1
Ii∈sk
where Ii∈sk is indicator variable which takes value 1 if i∈sk, 0 otherwise, and i∈U.
Step 2: Let λ∗(̸=λ) denotes another sampling distribution (which needs to be deter- mined here) over the realized cube sample space ΩK such that tr(Λˆλ∗) ≤ tr(Λˆλ) and πi(λ∗) = πi(λ) for all i ∈ U, where πi(λ∗)’s are first-order inclusion probabilities and
3.2. PROPOSED PROCEDURE: TWO-STEP CUBE
tr(Λˆλ∗) is total imbalance implied by the sampling distribution λ∗ for the given realized cube sample space ΩK. Expression for tr(Λˆλ∗) is given by
tr(Λˆλ∗) =
∑q l=1
Λˆll(λ∗) =
∑q l=1
∑K k=1
λ∗k( ˆXl(sk)−Xl)2
where matrix Λˆλ∗ is the variance-covariance matrix of vector Xˆ under the sampling dis- tribution λ∗. In order to obtain the sampling distribution λ∗ over ΩK, an optimization problem is formulated in the following:
Minimize:
C(λ∗) =tr(Λˆλ∗) =
∑q l=1
∑K k=1
λ∗k( ˆXl(sk)−Xl)2 Subjected to:
(i). 0< λ∗k<1, ∀k, (ii).
∑K k=1
λ∗k = 1,
(iii). πi(λ∗) = πi(λ), ∀i∈U.
where C(λ∗) denotes cost function for the solution vector λ∗ in the above optimiza- tion problem. In the optimization problem above, first and second constraints ensure that λ∗k is a sampling distribution while third constraint ensures that first-order inclusion probabilities under the proposed procedure are same as those computed under the cube method.
For the above optimization problem, a stochastic global optimization algorithm known as simulated annealing algorithm is used. Mullen(2014) discussed implementations for differ- ent optimization algorithms including simulated annealing algorithm which are available in two different packages of R statistical software (R Core Team,2022). First, R-function optim(method="SANN") in R-package stats; second, R-function and package GenSA (Xi- ang et al., 2013). These implementations for the simulated annealing algorithm rarely produce a sensible solution for this particular problem because of its complexity. Simu- lated annealing algorithm is easy to program and manipulate manually in R. An R-code is created for the simulated annealing algorithm which is giving a reasonable solution, but has scalability problem as the population size becomes large. If a better algorithm is available, one can plug-in but the idea of the proposed procedure does not change. A