Full Terms & Conditions of access and use can be found at
http://www.tandfonline.com/action/journalInformation?journalCode=ujsp20
Download by: [York University Libraries] Date: 14 July 2017, At: 16:40
Journal of Statistical Theory and Practice
ISSN: 1559-8608 (Print) 1559-8616 (Online) Journal homepage: http://www.tandfonline.com/loi/ujsp20
Recent Developments in Systematic Sampling: A Review
Sayed A. Mostafa & Ibrahim A. Ahmad
To cite this article: Sayed A. Mostafa & Ibrahim A. Ahmad (2017): Recent Developments in Systematic Sampling: A Review, Journal of Statistical Theory and Practice, DOI:
10.1080/15598608.2017.1353456
To link to this article: http://dx.doi.org/10.1080/15598608.2017.1353456
Accepted author version posted online: 13 Jul 2017.
Submit your article to this journal
View related articles
View Crossmark data
Accepted
Manuscript
1
Recent Developments in Systematic Sampling: A Review
Sayed A. Mostafa
Department of Statistics, Indiana University, Bloomington, Indiana 47408, USA
[email protected] Ibrahim A. Ahmad
Department of Statistics, Oklahoma State University, Stillwater, Oklahoma 74078, USA
[email protected] Abstract
Systematic sampling is one of the most prevalent sampling techniques. The popularity of the systematic design is mainly due to its practicality. Compared with simple random sampling, it is easier to draw a systematic sample specially when the selection of sample units is done in the field. In addition, systematic sampling can provide more precise estimators than simple random sampling when explicit or implicit stratification is present in the sampling frame.
However, the systematic design has two major drawbacks. First, if the population size is not an integral multiple of the desired sample size, the actual sample size will be random.
Second, a single systematic sample cannot provide an unbiased estimator for the sampling variance. Another limitation in the systematic design is that for populations with a periodic component, the efficiency of systematic sampling estimators will be highly dependent on the relation between the length of the period and the sampling interval. In the literature, one can find that many attempts have been made towards handling one or more of these issues. The present paper offers a review of the recent work in this area and provides some recommendations for survey practitioners using the systematic design for different sampling situations.
AMS Subject Classification: 62D05
Key words: Systematic sampling; sampling variance; superpopulation; systematic sampling with probability proportional to size; spatial surveys.
1. Introduction and the basic systematic design
The sampling design in which only the first unit is randomly selected and the rest being automatically selected according to a predetermined pattern is known as systematic sampling.
In its simplest form, called linear systematic sampling (LSS) (Table 2 at the end of this paper gives a glossary for all abbreviations used), the systematic design can be described as follows. In order to choose a systematic sample of size n from a population with N units
1 2
( ,Y Y, ,YN), the population is first divided into n groups each of size k units, where
= /
k N n is called the sampling interval. A random start R is chosen from the first group of k-units. The R-th unit in each of the remaining groups is selected in the sample. The sample obtained will hence comprise the units with labels R (j 1)k ( = 1, 2,j ,n). This systematic design is obviously an equal probability of selection sampling method. Systematic sampling with unequal probabilities is discussed in Section 5.
Levy and Lemeshow (2008, p.83) started their chapter about systematic sampling by the statement: Systematic sampling, either by itself or in combination with some other
Accepted
Manuscript
2
method, may be the most widely used method of sampling. Additionally, systematic sampling can provide implicit stratification, and hence produce better estimates, if the sampling frame is in some order (Cochran, 1977). Systematic sampling is both practically convenient and efficient in sampling some natural populations like forests (Finney, 1948 and Zinger, 1964).
Actually, the early developments in systematic sampling were highly related to problems of forestry. This made Buckland (1951), in his first review paper about systematic sampling, note that: just as the methods of Factor Analysis grew up in the atmosphere of the psychologist, so has Systematic Sampling grown up alongside problems of forestry and land use. One example of using the systematic design for sampling such natural populations is the Global Forest Resources Assessment Survey (GFRAS, 2010) conducted by the Food and Agriculture Organization (FAO) of the United Nations. Systematic sampling is also commonly used in household surveys such as the Current Population Survey (CPS) in the U.S. (U.S. Census Bureau, 2006) and the Demographic and Health Surveys (DHS) conducted in many developed countries (ICF International, 2012).
The theory of systematic sampling was first studied by Madow and Madow (1944).
However, along time before the Madows’ paper, systematic sampling was being utilized in many applications like forestry, agriculture, meteorology, ... etc. Some of these applications are reviewed in Buckland (1951), Cochran (1977) and Iachan (1982). Madow (1949; 1953) continued working on the theory of systematic sampling. The main results of these three seminal papers were reviewed in Buckland (1951) and Iachan (1982). The last review of the work in the area of systematic sampling was done by Iachan (1982), about thirty years after the review made by Buckland (1951). Since 1982 no review, except the comprehensive discussion of Bellhouse (1988), is available about the work in this area while in fact a lot of work have been done on systematic sampling during the last few decades. Therefore, in this paper we try to present a critical review for the recent contributions made in this area of research. In our presentation, we try to avoid reviewing the work that has been already reviewed by Buckland (1951) and Iachan (1982). However, we will go over some of this work specially if it is needed to understand the new developments in this area.
The rest of this paper is organized as follows. In Section 2, we discuss the main drawbacks of the systematic design in its crude form. Section 3 presents the attempts made toward overcoming these drawbacks. A comparative analysis of some of these attempts is introduced in Section 4. In Section 5, we describe the unequal probability version of systematic sampling. Section 6 briefly discusses the role of systematic sampling in spatial surveys. We conclude the paper with a brief discussion.
2. Problematic issues in systematic sampling
Systematic sampling has been found in practice to suffer from two main limitations. These limitations are listed in the following.
I. The systematic sampling technique assumes that the sampling interval k is always an integer. However, most of the time N = nk. If this is the case, systematic sampling will result in a variable sample size which depends on the random start R. Consequently, the sample mean will become a ratio estimator which is known to be biased estimator for the population mean (Kish, 1965).
II. Regardless of the problem of non-integer sampling intervals, systematic sampling can be viewed as a cluster sampling where only one cluster is chosen randomly from k clusters each of size n units (Cochran, 1977). Therefore, a single systematic sample alone cannot be used to estimate the design variance of the systematic sample mean,
henceforth pronounced as sampling variance or V ( )sys y , (Madow and Madow, 1944). This
Accepted
Manuscript
3
problem is mainly because most of the second order inclusion probabilities ij; =i j, the probability of including the two units i and j in the same sample, are zeros. To tackle this problem, one common practice in applied surveys is to regard LSS as a simple random sample. However, such practice typically provides highly biased estimators of the sampling variance unless the population is randomly ordered (Obsorne, 1942).
Moreover, the properties of the estimators from systematic samples depend on the order of the units in the frame, and these estimators can be more/less efficient under some arrangements. For instance, the existence of a linear or parabolic trend could produce more precise estimators if systematic sampling is used. On the other hand, if the order of population units in the sampling frame has a periodic trend like a sine curve, and the sampling interval, k, is equal to the period of the curve or an integral multiple of the period, the efficiency of a systematic sample of size n will be very close to that of only one observation taken randomly from the population. Under such periodic populations, if k is an odd multiple of the half-period, the systematic sample mean will coincide with the true population mean (Sukhatme, 1984).
3. Variations of systematic sampling
Since the theory of systematic sampling became available in 1944, scholars started introducing modifications on the systematic technique to overcome its main drawbacks.
Some of these modifications have been reviewed by Iachan (1982). But there are still very nice developments and solutions, that were introduced after 1982, need to be reviewed. These developments are demonstrated in the following subsections.
3.1. Fixing the sample size
Three modifications on the linear systematic sampling design have been proposed to tackle the problem of random sample size when having non-integer sampling intervals. These modifications are reviewed below.
3.1.1. Circular systematic sampling
Lahiri (1951) suggested a sampling design where the units of the population are considered to be arranged around a circle. In such case, a random number R is selected between 1 to N . Every kth unit is then chosen in a cyclic manner to be in the sample, where k = [N n/ ], the integer part of N n/ , is the sampling interval. This design is based on the convention that for any i= 1, 2, ,N, the unit with index iN stands for the unit with index i, and hence this design is known as circular systematic sampling (CSS). Under this design the sample sample size is fixed and hence the sample mean is unbiased estimator for the population mean.
Bellhouse (1984) suggested that k should be taken as the greatest integer in N n/ when
= ( 1)
N n k and the integer nearest to N n/ when N = (n1)k. Sengupta and Chattopadhyay (1987) argued that a necessary and sufficient condition to make the CSS of size n, drawn from a population of N units with sampling interval k, contain all distinct units is that N/ ( , )N k n or equivalently, [ , ] /N k k n, where ( , )N k and [ , ]N k denote, respectively, the greatest common divisor (g.c.d) and the least common multiple (l.c.m) of N and k. See also Sudakar (1978) and Murthy and Rao (1988). Under populations with perfect linear trend, Subramani and Singh (2014) used an empirical study to argue that the optimum choice for the sampling interval k, in the sense that it makes the circular systematic sample
Accepted
Manuscript
4
mean have minimum variance, is to take k such that knmodN= 1 . Subrmanai et al (2014) derived the variance of the sample mean under CSS from populations with perfect linear trend.
In this same direction, Särndal et al. (1992, p.77) report another method called the fractional interval method (FIM). This method randomly selects a number from the uniform (0, =k N n/ ) distribution and uses it to draw the entire sample as follows; the j-th population unit is selected in the sample if j1 < (l 1)k j where l= 1, 2, ,n. Apparently the fractional interval method is a special case of probability proportional to size (
ps) systematic sampling, which will be discussed in details in Section 5, where the size- measure variable X takes the value 1 for all population units.
3.1.2. Remainder linear systematic sampling
Chang and Huang (2000) proposed another sampling procedure that can be used when
= ;0
N nkr r n. This procedure is based on the fact that the population size can be written as N= (n r k ) r k( 1). This means that the population can be divided into two strata, the first consists of the front (n r k ) units, and the second stratum contains the remaining r k( 1) units. A linear systematic sample of size (n r ) units is selected from the first stratum with k as the sampling interval, and another linear systematic sample of size r units is selected from the second stratum with (k1) as the sampling interval. Combining the two samples together, we get a sample of size n as desired. RLSS will be reduced to LSS if the remainder is zero, r= 0. Under the RLSS design, an unbiased estimator for the population mean can be obtained as a weighted sum of the two subsample means where the weights are the relative sizes of the two subpopulations (strata).
3.1.3. Generalized modified linear systematic sampling
This version of systematic sampling was proposed by Subramani and Gupta (2014) as a generalization for the modified linear systematic sampling design of Subramani (2013, a;b).
This generalized design, abbreviated as GMLSS, produces a fixed sample size even if we have fractional sampling interval and provides good estimates of the population mean specially when a linear trend is present among the population units. The idea of the GMLSS design is similar to that of the RLSS scheme in the sense that under GMLSS we divide the population into two subpopulations and draw two subsamples from these populations using two different sampling intervals; N=n k1 1n k2 2 where n=n1n2, and k1 and k2 are assumed to be positive integers. Subramani and Gupta (2014) showed that under populations with perfect linear trend, the RLSS design is a special case of the GMLSS design when
1 = ( ), 2 = , 1 = ,
n n r n r k k and k2 =k1 where r and k are as defined in the RLSS design.
3.2. Variance estimation
Sampling statisticians worked in two parallel directions toward estimating the sampling variance of systematic sampling and many estimators have been suggested. One way is to try to find a model-based estimator for the sampling variance assuming certain model for the sampled population. On the other hand, many authors suggested modifying the design itself in a way that enables the derivation of unbiased estimators for the sampling variance. Here is a review of the recent work in each direction.
Accepted
Manuscript
5 3.2.1. Model-based variance estimators
This approach is based on assigning a model that best characterizes the nature of the values of the variable of interest when they are arranged in certain order. For example, several model- based estimators are given in Cochran (1977, p.223). Each of these estimators is approximately unbiased under a specific model but can be highly biased under the other models. Wolter (1984; 2007) studied eight model-free estimators. Specific guidelines were then introduced based on the comparison of the mean square errors of the eight estimators under several model scenarios. None of these estimators proved to be best overall, and there is a clear interaction between the behaviour of the estimators and the underlying data model.
Two specific estimators have been signaled as good general-purpose estimators when little is known about the population. These two estimators, denoted by v2 and v3 in Wolter (1984;
2007), take the following forms:
2
2
2 1
(1 )
( )
2 ( 1) j
n
j
j
v f y y
n n
/2
2
2
3 2 2 2 1
(1 )
( j j )
n
j
v f y y
n
Montanari and Bartolucci (1998) proposed a model-based estimator of the variance of the systematic sample mean. Their estimator was derived based on the idea that the sampling variance consists of two components. One takes into account the systematic component of the model while the other is due to the stochastic nature of this model. This approach leads to an estimator which is approximately unbiased with respect to any superpopulation model that follows a linear trend with homoscedastic and uncorrelated errors. This estimator was shown to outperform both the overlapping difference estimator, v2, and the simple random sample estimator under several superpopulation models. Later, Montanari and Bartolucci (2006), following the same idea, derived two new estimators that are unbiased under the same linear trend model. One is based on moving averages to estimate the systematic component of the model. The second estimator, utilizes some local polynomials to estimate the systematic component of the model when it is not linear.
Wolter (2007, sec 8.2) proposed a general methodology for constructing model-based estimators for V ( )sys y . In this methodology, the model dependence is explicitly recognized.
The proposed general estimator of the variance is defined as a conditional expectation of V ( )sys y given the data yi from the observed sample; vi = [V ( ) |sys y yi] where denotes the expectation over the assumed model. In this context, Wolter (2007) notes that the survey practitioner must make a professional judgment about the form of the model, as it is never known exactly, and then derive vi under the selected form. Hence, the variance estimator will be subject to errors of estimation as well as to errors of model specification. Therefore, the applicability of the model-based approach is viewed, in practice, as being hampered by lack of robustness.
This lack of robustness led Opsomer et al. (2012) to the use of a nonparametric model specification which makes much less restrictive assumptions on the shape of the relationship between variables. Hence, it significantly reduces the risk of model misspecification. In thier paper, a new estimator of the model-based expectation of the design variance has been proposed under a nonparametric model. This model takes the following form:
( ) [ ( )]1/2 ; 1, 2,, ,
i i i i
y m x v x e i N
Accepted
Manuscript
6
where m x( ) = ( |Y X = )x is a continuous function, v x( ) is a continuous bounded variance function, x is a univariate auxiliary variable and the errors ei are independent random variables with mean 0 and variance 1.
3.2.2. Modifying the sampling design
An important variant of the systematic sampling design is found in Gautschi (1957). The idea of this method is to use multiple random starts when drawing a systematic sample to ensure the availability of an unbiased estimator for the sampling variance of systematic sample estimators. According to this approach, in order to select a systematic sample of size n, one chooses t independent systematic subsamples. For N=nk and n t/ is integer, we first choose t random numbers from the front tk units, say { ,R1 ,Rt}. Then for each chosen random start, a systematic subsample is selected by choosing the corresponding tk-th units.
Finally, the sample will contain units with the labels;
1 1 1
{ ,R R tk, ,R [( / ) 1] ,n t tk ,R Rt, ttk, ,Rt [( / ) 1] }n t tk . This design is called a multi-start systematic sampling (MSSS). An unbiased estimator for the sampling variance can be obtained as;
2 2
1 1
1 1
ˆ ( )ˆ ( ) ( )
( 1) ( 1)
t t
msss i i
i i
f k
V y y y y y
t t t t k
where f =n N/ , y =t1
ti=1yi and yi is the subsample mean.The importance of such approach has increased recently. Many authors incorporated this approach into different systematic sampling methods in order to derive unbiased estimators for the sampling variance of their proposed estimators of the population mean (see more details in sec 3.3).
Zinger (1980) introduced a partially systematic sampling (PSS) design under which a systematic sample of size n is selected using sampling interval k where 1 <n<n and k
are such that N=n k . Then a simple random sample of size n n is taken from the remaining Nn units. Zinger (1980) estimated the population mean using a weighted sum of the two sample means and suggested unbiased estimators for the sampling variance of this estimate. He was unable to show that the suggested variance estimators are non-negative except in the case of equally weighing the two sample means. Wu (1984) suggested using systematic sampling in selecting the second subsample in PSS making the whole sample composed of two systematic subsamples and hence giving a generalization to the two-start systematic sampling design. Wu proposed both unbiased and biased estimators for the sampling variance under the new design and investigated the non-negativity of these estimators.
Singh and Singh (1977) proposed the new systematic sampling (NSS) design where choosing a sample of size n involves two steps. First, a sample of u consecutive units is selected by choosing a random number R between 1 and N . A circular systematic sample of size (n u ) units is then selected using sampling interval k =N n/ . A modification to this technique has been introduced by Leu and Tsui (1996) who suggested the new partially systematic sampling (NPSS) design. The NPSS design modifies the NSS by choosing a random sample of size a from the u consecutive units, { ,R R1, ,R u 1}, where,
= ( )
u N n a k and a= 2 if N=nk; otherwise, k= [N/ (n1)] and 2 a [N/ 2] 1 . A systematic sample of size (n a ) is then chosen, in a circular manner, which includes the
Accepted
Manuscript
7
units with labels; {R u 1 jk j; = 1, 2, , (n a )}. Under this design, all the second order inclusion probabilities are non-zero provided that a2 and uk. Hence, an unbiased estimator for the sampling variance can be obtained through utilizing the well known general purpose estimator due to Yates and Grundy (1953) and Sen (1953).
Huang (2004) proposed a mixed random systematic sampling (MRSS) design from which the sampling variance can be estimated unbiasedly. Considering the population as if arranged in a circular manner, the proposed design involves two steps. In the first step an index R is selected randomly between 1 and N . Then the population is divided into two subpopulations; the first consists of (n r k ) units with indices { ,R R1, ,R (n r k) 1}
and the second subpopulation contains r k( 1) units with the remaining indices. In the second step, a simple random sample of size (n r ) is drawn from the first subpopulation and a sample of r units, with indices {R (n r k) 1 j k( 1); = 1, 2,j , }r is selected systematically from the second one. The final sample is then the union of the two samples. It is clear that if N=nk, MRSS will be equivalent to SRS. Also, if (n r ) =a and (n r k ) =u , MRSS and NPSS are equivalent, where a and u are the NPSS design parameters. Under the proposed procedure, the inclusion probabilities were derived and the Horvitz-Thompson (HT) (Horvitz and Thompson 1952) estimator was used to estimate the population mean. The second order inclusion probabilities are shown to be non-zero for all pairs of units, and used to define an unbiased estimator for the variance of the HT estimator.
Sampath and Uthayakumaran (1998) introduced a new systematic sampling scheme with Markovian behavior which yields positive inclusion probabilities for all pairs of units.
This sampling scheme overcomes the difficulties in Markov sampling proposed by Chandra et al. (1991). Markov systematic sampling procedure assumes that the sample size is even and the population size is a multiple of the sample size (i.e. k=N n/ is an integer). The population is divided into n/ 2 groups in a systematic manner, say S1, ,Sm where m= / 2n and the ith group ( )Si includes those units with indices {2(i1)k j j; = 1, 2, , 2 }k . For each group Si, define a transition probability matrix (TPM) Ai of a Markov chain with state space {2(i1)k j j; = 1, 2, , 2 }k . To guarantee that all the sampled units will be distinct, the diagonal elements in each Ai should be zero, = 1, 2,i ,m. Selecting a sample of size n using this procedure involves two steps. First, a random number R is drawn from 1 to 2k. A systematic sample of size m= / 2n is then selected using 2k as the sampling interval. The units with indices { ,R R2 ,k ,R2(m1) }k will be the selected units. A single unit is then drawn from each group independently using the elements of Ai as conditional probabilities. They assumed that the non-diagonal elements of the TPM Ai (apq) are selected so that for each p;
| |
, = 1, 2, , 2
p q
apq q k,
where is a predetermined positive number which can be chosen either to be the same for all TPMs or to be different for the different TPMs. If the same is used, we will have a common TPM, say A. The sampling variance has been estimated with the help of the Horvitz-Thompson estimator since all the pairs of units have non-zero inclusion probability.
Accepted
Manuscript
8
3.3. Systematic sampling from populations with trend component
There are several ways in the literature to tackle the problem that may arise from the presence of a linear or parabolic trend in the population. One way is to use Yates type end-corrections (Yates, 1948) which give different weights for the two most extreme sample units (the first and last sampled units yR and yR (n 1)k with R being the random start in the systematic sample). If the population consists solely of a linear trend and N=nk, the systematic sample mean with end-corrections
*
( 1)
2 1
= ( )
2( 1)
LSS R R n k
y y R k y y
n k
coincides with the true population mean. On the other hand, one may modify the method of selection so that the sample mean is not affected by the presence of certain trend. Most such modified methods are reviewed in Iachan (1982) including the balanced systematic sampling (BSS) of Sethi (1965), the modified systematic sampling (MSS) of Singh et al (1968), and the centered systematic sampling due to Madow (1953). Sampath et al (2009) considered estimating the population total under LSS, BSS, and MSS schemes for parabolic populations.
Subrmanai et al (2014) proposed using Yates end-corrections to improve the efficiency of the sample mean under CSS from populations with linear trend.
Additionally, Subramani (2000) introduced a new systematic sampling scheme that works well for populations that exhibit a linear trend. The new design is called diagonal systematic sampling (DSS). Given a population of size N=nk, where nk, the population units are first arranged in an n k matrix, say, M. Then, using a random start 1 R k, n units are drawn from the matrix M systematically such that the selected n units are the diagonal elements or broken diagonal elements of the matrix M. This method has been generalized by Subramani (2009) where the assumption nk is relaxed to make the DSS scheme applicable for any sample size. Furthermore, Subramani and Tracy (1999) proposed the determinant sampling scheme and showed that it outperforms both simple random sampling and LSS when a linear trend is present in the population (see the reference for the details of the sampling scheme itself).
3.4. Simultaneous solutions
Many scholars tried to handle multiple problems associated with the LSS design using a combination of the ideas we presented in the previous sections.
Uthayakumaran (1998) has extended both of balanced systematic sampling and centered systematic sampling by introducing balanced circular systematic sampling (BCSS) and centered circular systematic sampling (CCSS). These suggested methods can be used for non-integer k’s where the population exhibit linear trend. The following is a brief description of BCSS and CSSS. Let the sampling interval k = [N n/ ], then a BCSS of n units, where n is even, with random start 1 R N, consists of units with labels
2 1 if 1 2 1
2 1 if 2 1
2 1 if 2 1
2 1 if 1 2 1
R l k R l k N
R l k N N R l k
lk R N lk R N
lk R lk R N
. where = 1, 2,l , / 2n . Similarly, using a fixed random start, R=N/ 2 if N is even and R= (N1) / 2 if N is odd, and k= [N n/ ], a CSSS of size n includes units with
Accepted
Manuscript
9 labels
1 if 1 1
1 if N 1 ,
R l k R l k N
R l k N R l k
where l= 1, 2, ,n. Additionally, Leu and Kao (2006) modified the new methods of Uthayakumaran (1998) by introducing the modified balanced circular systematic (MBCSS) sampling and the modified centered circular systematic sampling (MCCSS). These two designs, MBCSS and MCCSS), enable estimators to coincide with the population means in the presence of linear and parabolic trends. These two methods can be described briefly as follows. For n even and random start 1 R N, a MBCSS will include units with labels
2 1 if 1 2 1
2 1 if 2 1
2 1 1 if 1 2 1 1
2 2 1 1 if 2 1 1,
R l k R l k N
R l k N N R l k
N R l k R l k N
N R l k N R l k
where l= 1, 2, , / 2n . Sampath and Ammani (2010) applied the multi-start approach to the BSS due to Sethi (1965) and the MSS of Singh et al. (1968) to provide an unbiased estimator for the sampling variances under these designs. The resulting designs can be described as follows:
I. Balanced Systematic Sampling with Multiple Random Starts (BSSM):
To select a sample of size n using BSSM design with t random starts, one can proceed as follows. First, the population is divided into n/ 2t groups each of 2tk units. Then t random numbers are selected from 1 to tk. Corresponding to every random number chosen, pairs of units equidistant from the group ends are selected in the sample. Clearly, each random number contributes n t/ units to the sample. Therefore, the sample will contain n units. Under this design, the sample mean was proved to be an unbiased estimator for the population mean. It was also proved to coincide with the population mean in the presence of linear trend.
II. Modified Systematic Sampling with Multiple Random Starts (MSSM):
When the sample is desired to be selected using MSSM design, then instead of choosing one random start, t random starts are chosen between 1 and tk. Corresponding to every random start selected, pairs of units equidistant from the population ends are selected in the sample in a systematic manner. The sample corresponding to the random start
; = 1, 2, ,
R ii t, will consist of the n t/ units with labels;
[Ri jtk, ,N Ri jtk1]; = 0,1,j , ( / 2 ) 1n t .
The DSS of Subramani (2000) has been modified by Sampath and Varalakshmi (2008) to be used for N = nk. The new design, called diagonal circular systematic sampling (DCSS), is obtained by incorporating the idea of CSS to the DSS design. Also, Sampath and Varalakshmi (2011) introduced a DSS design with two random starts to be able to estimate the sampling variance under the DSS design. More general, Subramani and Singh (2014) studied DSS with multiple random starts (DSSM) in an attempt to provide an unbiased estimate for the sampling variance.
Kao et al. (2011) proposed the remainder Markov systematic sampling design that extends the RLSS and Markov systematic sampling in an attempt to solve the two main statistical problems of the LSS simultaneously. According to their design, selecting a sample of size n involves the following two steps.
Accepted
Manuscript
10
i. Divide the population into two strata; the first stratum contains the front (n r k ) units and the second stratum contains the remaining r k( 1) units.
ii. Apply the Markov systematic sampling method to each stratum.
An extension of the RLSS design of Chang and Huang (2000) has been proposed by Mostafa and Ahmad (2016) through introducing the design that they called remainder linear systematic sampling with multiple random starts (RLSSM). The idea behind the RLSSM design is to choose a multi-start systematic sample from each of the two subpopulations in the RLSS scheme. This new design is shown to provide unbiased estimators for both the population mean and the sampling variance. Moreover, they showed that the RLSSM can be seen as a generalized systematic sampling design from which many other designs can be obtained as special cases. Hence unifying several designs in one. They also investigated the stability of the proposed variance estimator and used it to construct approximate confidence intervals for the finite population mean.
4. Performance comparisons
From the preceding sections, it is evident that lots of alternative systematic sampling schemes are available. Each of these designs outperforms the others in certain situations and for specific populations. Thus, it may be useful to provide directions for the practitioners about the recommended designs in different sampling situations. This can be done through the following performance comparisons. Since the efficiency of the systematic sampling techniques depends on the characteristics of the sampled population, following Cochran (1946), the comparisons below are based on comparing the expected variances where the expectation is taken with respect to the assumed super-population model. Also, it is noteworthy that four super-population models are commonly used while comparing systematic sampling schemes. Models (1), (2), and (4) can be found in Cochran (1977, p.212- 219), and Model (3) is in Bellhouse and Rao (1975). These models are defined in the following for convenience.
Model 1. Populations in random order
2 2
( ) = ,yi (yi ) =i , i= 1, 2, ,N, (yi)(yj) = 0 i = .j (1) Model 2. Populations with linear trend
= ( )
i i
y i e (2)
2 2
( ) = 0,ei (ei ) = ( )i g, (e ei j) = 0 i = j and g is a predetermined constant.
Model 3. Populations with parabolic trend
= ( ) ( )2
i i
y i i e (3)
2 2
( ) = 0,ei (ei ) = ( )i g, (e ei j) = 0 i = j, and g is a predetermined constant.
Model 4. Auto-correlated populations
2 2 2
( ) = ,yi (yi) = , [(yi)(yj )] = d i = j (4) where d =| j i | and the correlogram d can take one of three forms: [i.]
1. Linear correlogram: d = 1d L/ ;L N 1 2. Exponential correlogram: d =ed
Accepted
Manuscript
11
3. Hyperbolic correlogram: d = tanh(d3/5).
Under these populations, the performance comparisons are carried out among three different groups of designs; designs that fix the sample size only, designs that provide unbiased estimates for the sampling variance, and multi-purpose designs.
4.1. Procedures fixing the sample size
Chang and Huang (2000), Subramani and Gupta (2014) and Mostafa and Ahmad (2016) assessed the performance of the four designs; CSS, GMLSS, RLSS and SRS under various types of populations and presented the following conclusions: [a.]
a. For populations in random order, the RLSS is more efficient than both CSS and SRS if and only if the sample proportion of the second stratum is larger than the second stratum variance proportion.
b. For populations exhibiting perfect linear trend, yi = ( ) ; = 1, 2,i i ,N, both GMLSS and RLSS outperform both CSS and SRS for all cases. The two designs GLMSS and RLSS are competitive in most cases with the GMLSS design being superior in some cases.
c. The RLSS is more efficient than SRS for the three types of autocorrelated populations, namely, populations with linear, exponential and hyperbolic correlogram. Also, the RLSS outperforms the CSS for populations with linear correlogram and they are equally efficient for the other types.
4.2. Procedures providing unbiased estimators for the sampling variance
Gautschi (1957) compared the MSSS with the LSS under different types of populations and concluded that the MSSS is more efficient in many cases except for populations with exponential correlogram. Hence, the researcher is better off choosing MSSS as it provides an unbiased estimator for the sampling variance. On the other hand, if the underlying population has an exponential correlogram, it may be worthy to use the one random start systematic sampling technique and try to find at least a consistent estimator for the sampling variance.
The consistency mentioned here can be taken to be in the sense of Isaki and Fuller (1982) for example.
Combining the comparative studies given by Leu and Tsui (1996) and Huang (2004), we have the following results about the performance of the MSSS, NPSS, and MRSS designs. [a.]
a. For populations with perfect linear trend and N=nk, the NPSS is more
efficient than the MSSS if; (i) t= 2 and m5 or (ii) t3 and m2, where t is the number of random starts and m= /k t.
b. For populations with exponential correlogram, the NPSS outperforms the MSSS in all cases.
c. For populations with linear correlogram, the NPSS outperforms the MSSS except when = 2t and = 4n or 6 .
d. For populations with hyperbolic correlogram, the NPSS outperforms the MSSS except when = 4n .
e. For populations with perfect linear trend, the MRSS is more efficient than the NPSS if (n r ) <a.
Accepted
Manuscript
12
f. For different types of auto-correlated populations, the MRSS is more efficent than the NPSS for some cases and in general the MRSS is more efficient for small (n r ) i.e.
when the part of the sample that is chosen randomly is small.
4.3. Procedures with different purposes
Mostafa and Ahmad (2016) provided a comprehensive numerical performance study for six sampling designs namely, RLSSM, RLSS, CSS, NPSS, MRSS, and SRS. The following conclusions are drawn from their study: [a.]
a. For randomly ordered populations and fixed sample size n, the CSS, NPSS, MRSS, and SRS are equally efficient due to the fact that these designs have identical first order inclusion probabilities n N/ . Also, both of RLSS and RLSSM have the same efficiency, but the latter still has the merit of handling problems I and II simultaneously.
b. For populations with perfect linear trend, RLSS and RLSSM outperform the other four designs whatever the number of random starts. As a trade-off between efficiency and handling the two main problems simultaneously, the RLSS scheme has somewhat higher efficiency than the RLSSM. Additionally, it is found that CSS outperforms MRSS in most cases. The last result can be found also in Huang (2004). Compared to NPSS, the CSS is more efficient in most cases. Also, CSS, MRSS, and NPSS outperform the SRS in all cases.
c. For the auto-correlated populations - linear, exponential and hyperbolic correlogram- the same trade-off between efficiency and handling the two main problems simultaneously appears and makes the RLSS outperforms the RLSSM. Considering CSS with NPSS and MRSS, The CSS has higher performance than both of NPSS and MRSS for almost all cases under the three types of correlograms. The last result may arise from the fact that both NPSS and MRSS offer unbiased estimators for the sampling variance and can be used also for non-integer sampling intervals while CSS does not provide any unbiased estimator for the variance. Moreover, the RLSSM is superior to both of NPSS and MRSS for
populations with linear correlogram. For populations with hyperbolic correlogram, the last result holds only in case of two random starts from each subpopulation in RLSSM. On the other hand, both of NPSS and MRSS are superior to RLSSM in most cases under the populations with exponential correlogram. These results remain the same when comparing RLSSM with CSS.
The comparison among another group of systematic designs has been carried out by Bellhouse and Rao (1975) for four different super-population models. This group mainly includes the trend free systematic methods, namely, BSS, MSS, centered systematic sampling, and LSS with end corrections. The following results were obtained: [a.]
a. For populations with linear trend, given by Model (2), all these designs are approximately equally efficient and superior to the usual LSS.
b. The MSS is preferable to the other schemes under populations with both linear trend and periodic variation, given in Madow and Madow (1944).
c. For populations with parabolic trend, given by Model (3), centered systematic sampling is found to be the most efficient, if k is odd, followed by the LSS with end corrections.
d. Also, they considered the performance of these methods under a super- population model for the autocorrelated populations, as given by Model (4), where d is monotonically decreasing function in d and concave upwards. In the presence of a trend and under the previous assumptions, the centered systematic sampling has the best performance
Accepted
Manuscript
13
while the BSS is the least efficient method among the four methods.
On the other side, Sampath and Varalakshmi (2008) showed the superiority of the DSS over the usual LSS under the super-population model defined in Model (2) above.
Although these methods are trend free methods, they still suffer from the same two major problems of the LSS, stated in I and II. Thus, it seems important to introduce the relative performance of the extensions of these methods which were given by Uthayakumaran (1998), Sampath and Varalakshmi (2008), Sampath and Ammani (2010), and Sampath and Varalakshmi (2011).
Uthayakumaran (1998) compared three systematic designs, namely, CSS, CCSS, and BCSS under super-population models with linear and parabolic trends (Models (2) & (3)).
The CCSS strategy is found to dominate both of BCSS and CSS under the two models. On the other hand, the BCSS scheme works as good as the the CSS scheme. In the same context, Sampath and Ammani (2010) assessed the performance of both BSSM and MSSM relative to each other and relative to the MSSS under the linear trend super-population model. It has been concluded that for all choices of g and n, BSSM and MSSM are equally efficient as long as only two random starts are being used. Moreover, as expected BSSM and MSSM are superior to MSSS for all choices of g and n.
Additionally, the DCSS, as an extension for the DSS, has been compared with CSS by Sampath and Varalakshmi (2008). For populations with linear trend, the usual CSS design dominates the DCSS in all cases. The same result is valid for autocorrelated populations with linear or hyperbolic correlogram. However, the performance of DCSS is higher than that of CSS for populations with symmetric and skewed distributions such as Normally and Exponentially distributed populations. Table 1 summarizes the main results presented in section (4) to help choosing the best design.
5. Systematic
ps sampling
When the units in the systematic sample are selected with probabilities proportional to an auxiliary size variable denoted by xi for i U , the sampling design is called systematic ps sampling. This sampling scheme, first introduced by Madow (1949), is widely used in many sampling surveys. Define x i
T
i U x and let k=Tx/n be an integer. Under systematic ps sampling, one unit is randomly drawn from the first k units in the sampling frame, then the every kth rule is applied to the cumulated total of the size variable. Consequently, the resulting first order inclusion probabilities are proportional to the size variable; i =nx Ti/ x for all i U . Systematic ps sampling is effective when the size variable is strongly correlated with the study variable Y. This version of the systematic design is practical and easy to implement. Also, the Horvitz-Thompson estimator can be easily used to provide an unbiased estimator for the finite population total/mean. However, like LSS, systematic ps sampling does not offer any unbiased estimator for the sampling variance. Several attempts have been made towards estimating the sampling variance under this design. Most of this work have been reviewed in Iachan (1982) and Bellhouse (1988).Another version of systematic ps sampling is known as circular systematic ps sampling. For the details of this design one may refer to Murthy (1967). This design can be briefly described as follows. Assuming that the information for an auxiliary variable X is available for all units in the population, x ii( = 1, 2, ,N), and the total is Tx, to select a sample of size n from this population the integer nearest to Tx/n is taken as the sampling interval k. Using a random integer R selected between 1 and Tx/n, we form the group of n
Accepted
Manuscript
14
integers as; al = (R lk )modT lx; = 0,1, ,n1. The sampled units y ii( = 1, 2, , )n are those units such that; Ci1<al C li; = 0,1, ,n1 and the unit yN if al = 0, where
= i=1 ; = 1, 2, ,
i j j
C
x i N and C0 = 0 are the cumulated sizes. Under circular ps systematic sampling, if the number of distinct sampled units is n, often not the case in practice, i =npi with pi =x Ti / x provided that npi < 1i= 1, ,N. Sometimes npi > 1 which makes obtaining i’s a difficult task. Similar to systematic ps sampling, the circular version does not offer any unbiased estimator for the sampling variance as many of the second order inclusion probabilities are zero under this design. To tackle this issue, one may, following Ray and Das (1997), choose k at random as an integer between 1 and Tx1. Now, it can be verified easily that ij > 0i = j. Also, the HT estimator can be calculated for this technique. Chaudhuri (2000) developed the formula for the unbiased estimator for the variance of the HT estimator under this design to be2 2
ˆV( HT) i j ij i j i i2
i j s ij i j i s i
y y y
y
where i = 1i1
i=jij
i si; = 1, 2,i ,N. To avoid instability in the terms1( )
ij i j ij
, one may use the recommendations of Särndal (1996).
6. Systematic sampling in spatial surveys
Spatial surveys such as strip sampling, quadrat sampling, and distance sampling from lines or points are commonly used to estimate density of animals or plants in a given area (see Wang et al, 2012 for a review of spatial surveys). Systematic sampling, as described in Section 1, provides an easy procedure of locating sampling points in space using a grid of equally spaced strips, lines, points, or quadrats, with a random start point. It also yields estimates with lower variance, relative to other random designs, by spreading the sample more evenly through the entire target region. This made systematic sampling widely used in spatial surveys. However, relatively little work has been done to investigate theoretical and practical properties of systematic spatial sampling.
Quenouille (1949) and Das (1950) initiated the work on systematic spatial sampling.
Using the same notation in Das (1950), Bellhouse (1977) and Flores et al (2003), in spatial sampling, the target region B is assumed to be partitioned into a set of MN sampling units, quadrats;
= { s ; i d, = 1, 2, , },
B Ai s i MN
where the dimension d = 1, 2,3, and the sampling units, s
Ai, are considered arranged in
=
M ml rows and N=nk columns and grouped in mn strata each with lk elements (quadrats). Each sampling unit s
Ai is identified by its coordinates; its row ri and its columns ci. This representation of the population, target region, is common in agricultural and ecological surveys.
In the two-dimensional case, d = 2, Bellhouse (1977) employed a superpopulation model that describes the correlation between population units to study the optimality, in the
Accepted
Manuscript
15
sense of minimum average sampling variance, of systematic sampling when estimating the finite population mean using the sample mean. He gave useful guidelines about appropriate sampling strategies for different forms of correlation but did not deal with the problem of estimating the systematic sampling variance. One approach to estimating the systematic sampling variance, which for a long time has been considered as a common practice in both spatial and non-spatial surveys, is to approximate the systematic design by a random design, or by a stratified design. However, it has been proven that this approach results in estimates that usually tend to overestimate the systematic sampling variance in many situations. Using a model-based approach to model the spatial correlation structure, Flores et al (2003) considered estimating the sampling variance of the systematic sample mean and discussed the the sample size estimation problem. Flores et al (2003) suggested estimating the systematic sampling variance by
/
ˆV ( )
ˆV ( ) = ,
RE
srs sys
sys srs
y y
where ˆV ( )srs y is the estimated sampling variance under SRS and
/
[V ( )] 1 ( , )
RE = = ,
[V ( )]
1 ( , ) ( , )
1 1
sys sys srs
srs
sys
y M N
MN lk MN lk
y m n M N
lk lk
with (M N, ) being the average correlation between all pairs of elements of the population while sys( , )m n is the average correlation between a pair of elements of the same systematic sample. They argued that ˆV ( )sys y is both model-unbiased and model-consistent in the sense that ˆV ( ) V ( ) / REsys y srs y sys srs/ by the design-consistency of ˆV ( )srs y .
Fewster (2011) developed a new striplet variance estimator based on modeling the encounter process over space and showed that the new estimator has negligible bias and good precision under several simulation scenarios. Using data from Spotted Hyena Survey, Serengeti Plains, Tanzania, she compared the striplet estimator with the variance estimator obtained by assuming the sampling design was SRS and stratification variance estimators, like 2 and 3 in Section 3, as approximations to the systematic sampling variance and reported the following coefficient of variations for estimated density; 11%, 20% and 17%
from the three variance estimation strategies, respectively.
Recently, Lundberg and Strand (2014) studied several estimators, not including the striplet estimator of Fewster (2011), for the sampling variance of the two-dimensional systematic sampling design when applied in land use surveys. They concluded that variance estimation by stratification gives good overall results but may underestimate the variance when spatial autocorrelation is absent while treating the systematic sample as a SRS is safe and conservative when spatial autocorrelation is absent or unknown. It seems like there is a growing literature in this specific area and in the area of spatial sampling in general.
7. Discussion
Systematic sampling is a continuously open research area due to the practicality of the systematic design in the field along with the issues associated with this design. A fair amount of research have been done in this area with the main focus being directed to handling the problems that arise when using the systematic design in practice. The main theme of the recent research in this area is merging the multi-start idea with one of the schemes that assures a fixed sample size. Among all designs reviewed in this article, no single systematic
Accepted
Manuscript
16
design can be declared as an optimal design in all situations (see Table 1). This observation results from the fact that developing a new design that tackles the drawbacks of the systematic scheme may come at the expense of reduction in the efficiency of the resulting estimators or partially losing the operational convenience of the systematic design. Therefore, choosing the appropriate version of the systematic design to implement should be based on a careful investigation of the characteristics of the population under sampling.
Going beyond estimating the finite population mean under systematic sampling, some scholars tried to estimate the finite population variance V( )y from a systematic sample via the multi-start approach. Sampath (2009) tried to estimate V( )y under LSS with two random starts. Under different superpopulations, Sampath (2009) compared his variance estimator with the usual simple random sample estimator. More recently, Sampath (2012) proposed estimates for V( )y under three designs; namely, the linear, balanced, and modified systematic sampling with multiple random starts. The proposed estimators outperform the simple random sample estimator for different superpopulations. A new area of application for systematic sampling is the area of industrial process quality control. Subramani (2004) showed that when a linear trend exists in the process, taking a systematic sample to monitor the process is recommended over simple random sampling.
Zhang (2008) studied systematic sampling from a statistical decision point of view using the sampling variance as a loss function. He compared efficiency of systematic sampling with that of alternative random sampling methods under various population configurations including homogeneous and ratio regression populations. Zhang (2008) argued that in some situations where the application of systematic sampling is a common practice, this application may be accompanied with large fluctuations in the sampling variance both in cross-sectional and longitudinal surveys. A similar study is needed to investigate the performance of systematic sampling in spatial surveys where the underlying population model may contain correlations over time and/or space.
Despite all this work on the systematic design, there is still further room for more research in this area. For instance, the asymptotic theory needs to be developed for the new versions of the systematic technique. Iachan (1983) introduced the asymptotic theory for LSS and mentioned that it would be straightforward to develop similar theory for other designs such as circular systematic sampling and ps systematic sampling. Moreover, getting beyond the estimation of the population mean toward estimating other parameters such as the median or any other population quantile and the finite population distribution function under systematic sampling schemes could be a fruitful direction for research in this area.
Acknowledgment: The authors are grateful to the editor and an anonymous referee for their suggestions.
References
Bellhouse, D. R. (1977). Some optimal designs for sampling in two dimensions.
Biometrics 64, 605-611.
Bellhouse, D. R. (1988). Systematic sampling. In: P.R. Krishnaiah and C. R. Rao (eds.), Handbook of Statistics, Vol. 6. North Holland, Amsterdam, pp. 125-145.
Bellhouse, D. R. and Rao, J. N. K. (1975). Systematic sampling in the presence of a trend. Biometrika 62, 694-697.
Buckland, W.R. (1951). A review of the literature of systematic sampling. Journal of the Royal Statistical Society B 13, 208-215.
Chandra, K. S., Sampath, S., and Balasubramani, G. K. (1991). Markov sampling for
Accepted
Manuscript
17 finite populations. Biomertika 79, 210-213.
Chang, H.and Huang, K. (2000). Remainder linear systematic sampling. Sankhyā B 62, 249-256.
Chaudhuri, A. (2000). Network and adaptive sampling with unequal probabilities.
Bulletin of the Calcutta Statistical Association 50:199-200, 237-254.
Cochran, W.G. (1946). Relative accuracy of systematic and stratified random samples for a certain class of populations. Annals of Mathematical Statistics 17, 164-177.
Cochran, W.G. (1977). Sampling Techniques. New York: John Wiley & Sons.
Das, A., C. (1950). Two dimensional systematic sampling and the associated stratified and random sampling. Sankhyā 10, 95-108.
Finney, D.J. (1948). Random and systematic sampling in timber surveys. Forestry 22, 64-99. Food and Agriculture organization of the United Nations (2010). Global Forest Resources Assessment 2010: Main Report. Rome 163.
Flores, L. A., Martínez, L. I. and Ferrer, C. M. (2003). Systematic sample design for the estimation of spatial means. Environmetrics 14, 45-61.
Fewster, R. M. (2011). Variance estimation for systematic designs in spatial surveys.
Biometrics 67, 1518-1531.
Gautschi, W. (1957). Some remarks on systematic sampling. Annals of Mathematical Statistics 28, 385-394.
Horvitz, D.G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association 47, 663–685.
Huang, K. (2004). Mixed random systematic sampling designs. Metrika 59, 1-11.
Iachan, R. (1982). Systematic sampling: A critical review. International Statistical Review 50, 293-303.
Iachan, R. (1983). Asymptotic theory of systematic sampling. Annals of Mathematical Statistics 11, 959-969.
ICF International. (2012). Demographic and Health Survey Sampling and Household Listing Manual. MEASURE DHS, Calverton, Maryland, U.S.A.: ICF International.
Isaki, C. T. and Fuller, W. A. (1982). Survey Design under the Regression Superpopulation Model. Journal of the American Statistical Association, 77, 89-96.
Kao, F., Leu, C. and Ko, C. (2011). Remainder Markov systematic sampling. Journal of Statistical Planning and Inference 141, 3595- 3604.
Kish, L. (1965). Survey Sampling. New York: John Wiley & Sons.
Lahiri, D.B. (1954). On the question of bias in systematic sampling in population censuses. Proceedings of the World Population Conference II, 349-361.
Leu, C. and Kao, F. (2006). Modified balanced circular systematic sampling.
Statistics and Probability Letters 76, 373-383.
Leu, C. and Tsui, K. (1996). New partially systematic sampling. Statistica Sinica 6, 617-630.
Levy, S. and Lemeshow, S. (2008). Sampling of populations: Methods and Applicat