A Guide

(1)

Non-probability sampling

Chapter · January 2016

CITATIONS

155

READS

94,337

3 authors:

Vasja Vehovar University of Ljubljana 133PUBLICATIONS 4,138CITATIONS

SEE PROFILE

Vera Toepoel Utrecht University

79PUBLICATIONS 2,177CITATIONS SEE PROFILE

Stephanie Steinmetz University of Lausanne 66PUBLICATIONS 1,960CITATIONS

SEE PROFILE

(2)

22

Non-probability Sampling

V a s j a V e h o v a r, V e r a To e p o e l a n d S t e p h a n i e S t e i n m e t z

A sample is a subset of a population and we survey the units from the sample with the aim to learn about the entire population.

However, the sampling theory was basically developed for probability sampling, where all units in the population have known and positive probabilities of inclusion. This definition implicitly involves randomization, which is a process resembling lottery drawing, where the units are selected according to their inclusion probabilities. In probability sampling the randomized selection is used instead of arbitrary or purposive sample selection of the researcher, or, instead of various self-selection processes run by respondents. Within this context, the notion of non-probability sampling denotes the absence of probability sampling mechanism.

In this chapter we first reflect on the practice of non-probability samples. Second, we introduce probability sampling principles and observe their approximate usage in the non-probability setting and we also discuss some other strategies. Third, we provide a

closer look at two contemporary – and perhaps also the most exposed – aspects of non- probability sampling: online panels and weighting. Finally, we summarize recommendations for deciding on probability–non- probability sampling dilemmas and provide concluding remarks.

TYPOLOGY, PRACTICE AND PROBLEMS OF NON-PROBABILITY SAMPLES

We initially defined non-probability sampling as a deviation from probability sampling principles. This usually means that units are included with unknown probabilities, or, that some of these probabilities are known to be zero. There are countless examples of such deviations. The most typical ones received their own labels and we pre- sent below a non-exhaustive list of illustrations:

(3)

• Purposive sampling (also judgemental sampling) – the selection follows some judgement or arbitrary ideas of the researchers looking for a kind of ‘representative’ sample, or, researchers may even explicitly seek for diversity (deviant case samplings); sometimes unit are added sequentially until researchers satisfy some criteria.

• Expert selection – subject experts pick the units, for example when, say, only two most typical settlements are to be selected from a region or when the units with highest level of certain characteristics are selected (modal instance sampling).

• Case study – an extreme example of expert selection where the selected unit is then profoundly studied, often with qualitative techniques.

• Convenience sampling – the prevailing non- probability approach where units at hand are selected; the notion roughly overlaps also with accidental, availability, opportunity, haphazard or unrestricted sampling; most typical formats are recruitment at events (e.g. sports, music, etc.) and other locations (e.g. mall intercept, where customers are approached in shopping malls, street recruitment, where people on the street are invited into a survey) or at virtual venues (e.g. web intercept, where the web visitors are exposed to pop-up invitations); particularly frequent is also the recruitment of social ties (e.g.

friends, colleagues and other acquaintances).

• Volunteer sampling – a type of convenience sampling, where the decision to participate strongly relies on respondents due to the non-individualized nature of invitations (e.g. general plea for par- ticipation appears in media, posters, leaflets, web, etc.).

• Mail-in surveys – a type of volunteer sampling with paper-and-pencil questionnaires, which are exposed as leaflets at public locations (e.g. hotels, restaurants, administration offices) or enclosed to magazines, journals, newspapers, etc.

• Tele-voting (or SMS voting) – a type of volunteer sampling where people are invited to express their vote by calling-in some (toll-free or chargeable) number or by sending an SMS; most typically this is related to TV shows (particularly entertainment and reality) and various contests (music, talents, performance), but also to political polling.

• Self-selection in web surveys – a type of volunteer sampling, where a general invitation is posted on the web, such as invitations on web pages (particularly on social media), question-of-the-day

polls and invitations to online panel, which all is typically done via banner recruitment; a considerable overlap exists here also with the notion of river sampling, where users are recruited with various approaches while surfing on the web.

• Network sampling – convenience sample where some units form the starting ‘seeds’, which then sequentially lead to additional units selected from their network; a specific subtype is snow- ball sampling, where number of included network ties at each step is further specified.

• Quota sampling – convenience sample can be

‘improved’ with some socio-demographic quotas (e.g. region, gender, age) in order to reflect the population.

In addition, we may talk about non-probability sampling also when we have probability sampling, but:

a) we previously omitted some (non-negligible) segments of the target population, so that the inclusion probabilities for corresponding units are zero (i.e. we have non-coverage);

b) we face very low response rates, which – among others – occurs with almost any traditional public media recruitment, including the corresponding advertising, from TV and radio spots to printed ads and out-of-home-advertising, i.e.

OOH media, being digital (e.g. media screens in public places) or non-digital (e.g. billboards).

The bordering line when a certain part of the population is non-negligible (a) or when the response rate is very low (b) is a practical question, which is conceptually much blurred, disputable and depends on numerous circumstances. In particular, it depends on whether we know the corresponding missing data mechanism and control the probabilities of inclusion, as well as on whether the missing segments substantially differ in their characteristics. The opposite is also true: whenever we have low non-coverage and high response rates – especially with relatively small populations (e.g. hotel visitors, attendants of an event) – some of the above types of non- probability sampling can turn to probability sampling, particularly in case of volunteer samples (e.g. mail-in, web self-selection).

(4)

There are countless examples of disasters, where results from non-probability sampling were painfully wrong. There also exist some historical well-documented evidences, which have already become textbook cases. One is related to early purposive sampling attempts at the end of the 19th century from official statistics in Norway (Kiaer, 1997 [1897]), which ended in serious errors. Two cases from the US elections have also become popular warning against deviations from probability sampling. The first one is the 1936 Literary Digest Poll, where around two million mail-in polls failed to predict the winner in the US presi- dential election. The failure was basically due to the fact that disproportionally more repub- licans were in the sampling frame. The second example refers to the US 1948 election prediction failure, which was – among other reasons – due to a lack of sufficient randomization in quota sampling.

We should underline that wrong results based on non-probability samples – including the above three examples – are typically linked to situations where the standard statistical inference assuming probability sampling was used with non-probability samples. For this reasons we first overview the probability sampling principles and their approximate implementation in non-probability settings.

PRINCIPLES OF PROBABILITY SAMPLING¹

It was only when the corresponding sampling theory was implemented in 1930s that major breakthrough in the development of modern surveys occurred. We should also recall that only few decades before, at the end of 19th century, the statistical profession (i.e. the International Statistical Institute) rejected the practice from the Norwegian statistical office (Kiaer, 1997 [1897]) of using parts of the population to infer about the entire population. One direct consequence of the changes from 1930s was huge cost saving, because

the characteristics (e.g. attitudes) of a population with millions of units could be measured with a sample of just a few hundreds.

Correspondingly, the sampling practice expanded enormously in the commercial and in public sector.

Sampling Design Principles

We can select a probability sample using a rich array of sampling techniques. Most textbooks start with a simple random sample (SRS), which fully reflects lottery drawing.

In SRS, all units in the population have the same probability of selection, but also all samples of a certain size have the same probability to appear. We can also use complex designs with strata, clusters, stages, phases etc. The general overview on probability sampling designs is provided already in this Handbook (see Tillé and Mattei, Chapter 21).

For further reading we also point to general survey (e.g. Bethlehem, 2009; Fowler, 2014;

Groves et al., 2009) or statistical textbooks (e.g. Agresti and Frankin, 2013; Healey, 2011). Rich sampling literature also exist, either as general introduction (e.g. Kalton, 1983; Lohr, 2010) or as advanced statistical treatment (Särndal et al., 2003).

Statistical Inference Principles When we infer from a probability sample to the entire population, we face risks that due to randomization the selected sample is by chance different from the population. For example, it is theoretically possible, although extremely unlikely, that we drew a SRS of the general population of a country (where men have roughly a 50% share), which contains only men, but no women. The advantage of probability sampling is that this risk (e.g.

selecting such extreme samples) can be quantified, which is then often evaluated at a level of 5% using so-called confidence intervals. A risk of 5% means that the corresponding

(5)

confidence interval contains the population value in 19 out of 20 randomly selected samples of certain type. The width of the confidence interval communicates the precision of the estimate, which strongly depends on the sample size; larger samples produce more precise estimates. Another advantage of probability samples – which is of extreme practical value – is that the confidence interval can be estimated from the sample itself. However, the construction of the confidence interval is only one aspect of the process of statistical inference, which is a set of procedures for generalizing results from a sample to a population, but it also deals with other inferential diagnostics, as well as with statistical hypothesis testing.

Various inferential approaches exist. We referred here only to the most popular fre- quentist one, which (a) takes into account the sampling design, (b) assumes that unknown population values are fixed and (c) builds on a sampling distribution of the estimates across all possible samples. Other inferential approaches can differ in one or more of the above-mentioned aspects; however, for majority of practical situations they provide very similar results. More differences between approaches may appear with unu- sual distributions and with specific modelling requests.

The related inferential procedures can be very simple, as in the case of estimating the sample mean in a SRS setting, while the estimation of some other parameters (e.g. ratios, various coefficients, etc.) in complex samples (e.g. with stages, strata, phases, etc.) or with complicated estimators (e.g. regression estimator) may require additional statistical resources.

APPROACHES AND STRATEGIES IN NON-PROBABILITY SAMPLING

We now turn to the non-probability context, which is, in principle, very problematic,

because sampling theory was developed for the probability sampling. Consequently, we first discuss approaches where probability sampling principles are used in non- probability settings, although – due to unfulfilled assumptions (i.e. lack of known probabilities) – formally this could (and should) not be done. For this reason we will call it as an approximate usage of probability sampling principles. Next, we provide an overview of some other strategies, developed to deal specifically with non-probability samples.

Approximations of Standard Probability Sampling Principles The approximate usage of probability sampling in non-probability setting is sometimes understood in a sense that we first introduce certain modelling assumptions, e.g. we assume that there is actually some randomization in the non-probability sample. We will discuss these assumptions and corresponding implementations separately for sampling design and for statistical inference part.

Approximations in Sampling Design

The very basic idea here is to approximate and resemble, as much as possible, the probability sampling selection of the units. This will then hopefully – but with no guarantee – also assure other features of probability sampling, particularly the randomization. This goal can be achieved indirectly or directly.

Indirect measures to approximate probability sample designs

The most essential recommendation is to spread the non-probability sample as broadly as possible. In practice this predominantly means we need to combine various recruitment channels, which may appear also as media planning strategies. A successful example is the WageIndicator survey (see WageIndicator.org), implemented in over

(6)

85 countries worldwide, which uses an elaborated recruitment strategy by posting invitations to their volunteer web survey across a wide range of websites (Steinmetz et al., 2013). With this strategy they approximate the randomization spread of recruited units.

Similarly, the Vehovar et al. (2012) study of online hate speech, based on a non-probability web survey, successfully applied well-spread recruitment procedures, ranging from off- line strategies (e.g. delivering leaflets with survey URL at public places) to systematic publishing of the invitation posts across numerous online forums (covering broad array of topics, from cooking to cycling).

The systematic attempt to include a variety of recruitment channels assured in this study a considerable level of spread, which approximated a good level of randomization. Of course, there was no guarantee in advance that this approach would work, but the results showed that corresponding estimates were very close to those from a control probability sample.

Another indirect strategy is to shape non- probability samples, so they would reflect, as much as possible, the structure of the survey population. If possible, of course, the match need to be established for control variables, which are related to the target ones. Quota sampling is a typical approach here, usually based on controls for socio-demographics (e.g.

gender, age, education, region, etc.). The quotas can be applied immediately at the recruitment or at the screening phase of a survey. For example, if we use gender quotas, this implies that once we have reached the quota for women, we then reject to further survey any women. Alternatively, we may also intensify the recruitment channels, which attract more men (e.g. advertising on sports media).

In case of a sampling frame, where we have additional information about the units (e.g. as in online panels), we can use this same strategy already at sample selection.

A fine selection can be then used, not only according to the target population structure of socio-demographics (age, gender, education,

region, etc.), but also according to attitudes, media consumptions, purchase behaviour, etc. Sophisticated algorithms can be used to select an optimal sample for certain purpose. An extreme case is individual match- ing, which is fully effective only when we have large databases of units available (as in online panels), where we first design a ‘nor- mal’ sample from some ideal or high-quality control source. We then search for units in our database, using some distance measures (e.g. nearest neighbour), which most closely resemble the ideally desired units (Rivers and Bailey, 2009).

Direct incorporation of probability sampling design principles

Whenever possible, we can also directly include probability sampling principles.

Sometimes components of probability sample selection can be incorporated already into the early steps of the sampling process.

For instance, in probability sampling with quotas we may first select a probability sample of schools, where we then recruit pupils with unrestricted banner invitations on the web according to some quotas (e.g. age, gender, year of schooling). Another example is a direct inclusion of randomization, which can be incorporated into various ‘random walk’ selections related to the recruitment across streets and buildings, or in randomization across time (e.g. days, hours). In specific types of convenience sampling (mall intercept, street recruitment, invitation on the web) we can introduce randomization also by including every ith (e.g. every tenth) passen- ger, customer or visitor.

Of course, none of the above-described approaches, direct or indirect, assures with some known accuracy that the corresponding statistical inference will be equivalent to the situation with probability samples. However, it is also true that these approximations usually contribute to certain improvements, while they rarely cause any serious damage.

In the meanwhile, practitioners have developed elaborated skills, which often work well

(7)

for their specific circumstances, research problems and target variables. Again, there is no formal guarantee (in advance) that their solutions will also work in modified circumstances or in some other situations.

Approximations in Statistical Inference

As already mentioned, with non-probability samples, by definition, the inclusion probabilities are unknown or zero, so without further assumptions this very fact formally prevents any statistical inference calculations (e.g. estimates, variances, confidence intervals, hypothesis testing, etc.).

Uncontrolled inclusion probabilities can also introduce a certain selection bias, for which Bethlehem and Biffignandi (2012:

311) showed that it depends on (a) the varia- bility of inclusion probabilities and on (b) the correlation between target variable and inclusion probabilities. For example, the Internet- savvy respondents typically participate more frequently in self-selected web surveys. As they differ from other Internet users not only in socio-demographics (e.g. younger) but also in web skills or altruism (Couper et al., 2007;

Fricker, 2008; Malhotra and Krosnick, 2007), the results from such samples may show a bias towards the characteristics of these users.

Against this background, in survey practice the same statistical inference procedures – developed in probability sampling – are routinely applied also into a non-probability setting, although there is no guarantee whether and how this will work. For instance, if we apply for a certain non-probability sample a standard procedure to calculate confidence intervals, we implicitly assume that probability sample selection (e.g. SRS) was used.

However, at this time we are no longer con- fident that the interval actually contains the population value with the pre-assumed, say, 5% risk. It is very likely, that the actual risk is way above 5%, but we have no possibility to calculate it from the sample. Of course, we can still pretend and use the same interpre- tation as in a probability sampling context,

although this is not true. Applying standard statistical inference approaches as an approximation in non-probability samples can thus

‘seduce’ the users into believing that they have ‘real’ confidence intervals. Therefore, it seems reasonable to consider using a dis- tinct terminology to formally separate a non- probability from a probability setting. For example, as proposed by Baker et al. (2013) we might rather use the term ‘indications’

instead of ‘estimates’. This is also a direction to which the hard-core statistical approach against any calculation of confidence interval in non-probability setting is slowly melt- ing in the recent AAPOR (2015) Code on Professional Ethic and Practice.

In practice the situation is not that bad and this approximate usage often provides reasonable results, which is mainly due to the following reasons:

a) In almost all non-probability samples there still exists a certain level of ‘natural’ randomization, which, of course, varies across circumstances and from sample to sample.

b) This can be further accelerated with measures researchers undertake for improving the non- probability sampling designs (see above discussion on spread, randomization, quotas and matching).

c) Powerful techniques also exist in post-survey adjustment steps, such as imputations, weighting, complex estimators, and statistical models (see discussion further below).

All these three streams (a, b, c) together can be very effective, at least according to the perceptions of users and clients, who judge the results by using some external controls and their knowledge. The above-mentioned study on online hate speech (Vehovar et al., 2012) is a typical example here, because comparisons with a parallel telephone probability sample confirmed the results from non-probability samples:

• Estimated means on rating scales fully matched between the two samples.

• A similar result was true for the correlations, ranks and subgroup analysis.

(8)

• The estimates for the shares (percentages) were skewed towards more intensive Internet users, but the ranks and relations between categories were still usually preserved.

The above pattern is very typical for many studies with ‘well-spread’ and ‘well-weighted’

non-probability samples. Successful results from non-probability samples are thus often reported, particularly for online panels, as shown in Rivers (2007) and in Callegaro et al.

(2014). However, the latter reference also points to many exceptions. Similarly, scholars often compare the two approaches and as a result they typically expose advantages of the probability setting (e.g. Yeager et al., 2011).

We may add that advocates (e.g. Rivers, 2010) of using probability sampling principles in non-probability setting also emphasise that papers in leading scholarly journals often calculate confidence intervals for non- probability samples, as well as that the non- probability sampling approach is already acceptable in many neighbouring fields (e.g. experimental and quasi-experimental studies, evaluation research and observational studies). On the other hand, many statisticians basically insist that without sound and explicit assumptions or models – which are usually missing in non-probability sampling – there are no solid grounds to infer from any sample.

In any case, whenever we use, run or discuss non-probability samples, we should clearly differentiate among them, because huge differences exist. While some carefully recruit and spread across the entire spectrum of the population, using elaborated sampling design procedures (e.g. strata, quotas, matching), others may simply self-select only within one narrow sub-segment. Similar differences appear also with (non)using advanced procedures in post-survey adjustment (imputation, weighting, calibration, estimation, modelling).

Moreover, the specifics of the target variables and research problem are usually also very important here. Some variables are simply more robust in a sense that they are less

affected by deviations from probability sampling principles. Consequently, they can be successfully dealt even with low-quality non- probability samples; this is often the case in marketing research. On the other hand, certain variables, such as employment status, health status, poverty measures or study of rare characteristics (e.g. diseases, minori- ties, sensitive behaviour), are less robust for deviations and require a strict probability sampling design and an elaborated estimation treatment.

Specific Modelling Approaches Being very formal and strict, we should acknowledge that randomization – as well as the related probability sampling – is not a necessary precondition for valid statistical inference (Valliant et al., 2000: 19). If we have clear and valid assumptions, a specific modelling approach can also provide corresponding statistical inference. In such situation, we first assume that the data – or at least specific relations among the variables – are generated according to some statistical model. Next, additional external data are then particularly valuable (e.g. socio- demographic background variables in case of online panels), so that certain model-based approaches to statistical inference can be used to build the model and then estimate the values of interest. However, the corresponding statistical work (i.e. assumption testing, model development, data preparation, etc.) can be complicated and time-consuming, particular because it typically needs to be done for each variable separately.

Another stream of approaches is related to sample matching, mentioned above in the relation to quota selection (which is a type of aggregated matching) and to individual matching (where we may use some ‘nearest neighbour’ techniques). However, we can also use some advanced models, which are related to propensity score matching. The latter is closely related to causal inferences

(9)

techniques (Rubin, 1974), developed for quasi-randomized and non-randomized settings, and in particular to propensity score procedures developed for causal inference for observational studies (Rosenbaum and Rubin, 1983). An extensive overview of sample matching implementation for non-probability survey settings is provided in Baker et al.

(2013: 34).

The achievements of some modelling approaches are in fact admirable. For example, the successes of various voting pre- dictions, which are based on results from non-probability samples, are very notable.

Sometimes they are perhaps even shocking for many traditional statisticians, who believe that no statistical inference exists outside probability sampling. It is true, however, that weak documentation of the corresponding procedures (particularly in commercial sector) might possibly violate the principles of research transparency (AAPOR, 2014).

Nevertheless, due to their specific and narrow targeting, these approaches are often too demanding in skills and resources to be gen- erally used. As a consequence, whenever we need a straightforward practical implementation of statistical inference, such as running descriptive statistics of all variables in the sample, we usually still run standard statistical inference procedures (which assume probability-based samples). Of course, as indicated above, this is formally correct only if we can prove that the necessary assumptions hold, which is usually not the case.

Otherwise, we run them as approximations or indications, together with the related risks, which then need to be explicitly stated, warned and documented.

SELECTED TOPICS IN NON- PROBABILITY SAMPLING

When discussing non-probability sampling, two issues deserve special attention: non- probability online panels, because they are

perhaps the most frequent and most advanced example of contemporary non-probability sampling, and weighting, because it is subject of high expectations about solving the problems arising from non-probability samples.

Non-probability Online Panels When the same units are repeatedly surveyed at various points in time, we talk about longi- tudinal survey designs. Important advantages over independent cross-sectional studies are efficiency gains in recruiting and reduced sampling variation, because we can observe change at the individual level (Toepoel et al., 2009). In traditional longitudinal designs, all units in the sample are repeatedly surveyed with questions belonging to a certain topic (e.g. labour market). However, we will use here the notion of panel somehow differently and according to the understanding, which actually prevails within non-probability sampling context: panels are large databases of units, which initially agreed to cooperate, provided background information and were recruited to serve for occasional or regular selection into specific sample surveys addressing related or unrelated topics.

With the rise of the Internet, the so-called online panels of general populations – sometimes also labelled as access panels – emerged in developed countries and they are rapidly becoming the prevailing survey data collection method, particularly in marketing research (Macer and Wilson, 2014). Our focus here is on non-probability online panels, which are the dominant type of online panels, while probability online panels are relatively rare. We discuss general population panels, but the same principles can be extended to other populations, e.g. customers, profession- als, communities, etc.

In general, non-probability online panels comply with issues, challenges and solutions (i.e. approximations), which have been discussed in previous sections. Therefore, conceptually, there is not much to add. On the

(10)

other hand, the practices of online panels are very interesting and illustrative for non- probability sampling context. One specific is that recruitment (i.e. sample selection) actually occurs at two points: first, with the recruitment of units into the panel from, say, the general population, where we need to focus on a good approximation of probability sampling (i.e. spread and coverage). Second, when we recruit units from our panel into a specific sample, we need to focus on selecting an optimal structure by using quotas, stratification, matching, etc.

Careful management is required to prove the representation of essential subgroups in each sample, but also to control the work- load for each unit. A panel management sys- tem thus needs to record and optimize when and how units are recruited, how often they participate in surveys, and how much time it takes them to complete a survey. It also has to detect various types of ‘cheaters’, which can be among others recruited from ‘professional respondents’, who may maintain fake profiles to participate in online panels. After the data are collected, another step of optimization occurs by editing, imputing and weighting, which can be especially tailored to certain circumstances.

Results from the NOPVO-study (Van Ossenbruggen et al., 2006), in which 19 non- probability online panels in the Netherlands were compared, showed that panels use a broad array of recruitment methods, with the main approach relying on online recruitment and various network procedures. Sometimes a subset – particularly hard-to-reach subgroups – of panel members is also recruited via probability-based sampling using traditional modes (telephone, face-to-face). The incentives also vary, from lotteries, charity donations, vouchers or coupons for books or other goods, to prevailing method of col- lecting points, which can be then monetized.

Due to these variations, it is not surprising that considerable differences were found also with respect to the estimates. Some panels in the NOPVO study performed much better

than others, but sometimes serious discrepancies to benchmark values were found for all panels (e.g. political party preference).

A similar study, where 17 online panels in the US contributed a sample of respondents (Thomas and Barlas, 2014), showed that online panels actually provided consistent estimates on a range of general topics, such as life satisfaction and purchasing behaviour.

The corresponding variations often resembled variations in estimates from probability samples. However, there were also specific topics, particularly those related to small population shares, where larger variations were found, such as occurrence of binge drinking (i.e. drinking more than 4–5 glasses of alcohol in one evening).

An exhaustive overview of studies evaluat- ing results from non-probability samples can be found in Callegaro et al. (2014). The overview reveals that professional online panels often provide results, which rarely differ significantly from the corresponding bench- marks. On the other hand, it also showed that exceptions do exist. Similarly, this overview demonstrates that the estimates from probability samples consistently outperform the ones from non-probability online panels.

Nevertheless, in practice, the dramatic differences in costs between the two alternatives usually (but not always) outbalance the differences in the estimates.

We may add here that cost-error optimization is a very complex and much under- researched issue even for probability-based samples, where we can minimize the product of total survey error (usually approximated with means squared error) and the costs (see Vehovar et al., 2010; Vannieuwenhuyze, 2014). Also in this case, the cost-error optimization in non-probability samples could, in principle, approximate the optimization in probability-based approach, although in reality these judgements are often much more arbitrary.

With non-probability online panels one has to be careful with statistical inferences, as noted previously. Different to probability

(11)

samples, in non-probability online panels – as in non-probability samples in general – the corresponding quality relies on a successful mix of approximations. The quality of a certain panel can be checked by monitoring their documentation and past results. Both ISO 20252 (Market, Opinion and Social Research) and ISO 26362 (Access Panels in Market, Opinion and Social Research) standards for instance specify key requirements to be met and reported. ESOMAR (2015), the global organization for market, opinion and social research, also regularly issues practical guide- lines to help researchers make fit-for-purpose decisions about selecting online panels.

We can summarize that contemporary non-probability online panels successfully serve for studying numerous research topics. However, here too, we cannot escape from the usual limitations of non- probability samples. Correspondingly, for certain research problems these panels may not be the right approach. Still, they are perhaps the best example, how the limitations of non- probability setting can be overcome or at least minimized through carefully applied methodological approaches.

Weighting

In probability samples so-called base weights are used first to compensate for the inclusion probabilities. In a second step, specific non-response weights are applied to reduce the bias – a difference between the estimate and the true value – resulting from non- response. In a similar way, specific weights can be developed for other methodological problems (e.g. non-coverage). In addition, with so-called population weighting we usually correct for any discrepancies in available auxiliary variables. Typically, these are socio-demographic variables (e.g. age, gender, region), but when available, other variables are also considered (e.g. media consumption). A good overview regarding the general treatment of weights can be

found in Bethlehem (2009), Biffignandi and Bethlehem (2012) and Valliant et al. (2013).

A core requirement for weights is that the variables used have to be measured in the sample and also in the population. Moreover, for weighting to be effective, they also need to be correlated with target variables.

Different weighting approaches exist;

one is based on generalized regression esti- mation (linear weighting), where linear regression models are used (Kalton and Flores-Cervantes, 2003). Poststratification is a special and also most frequently used example of this approach, where auxiliary variables simply divide the population into several strata (e.g. regions). All units in a certain strata are then assigned the same weight. The problems appear, if strata are too small and we have insufficient or no observations in a strata, which consequently hinder the compu- tation of weights for such strata. We may also lack corresponding population control data.

The second stream of approaches is based on raking ratio estimation. This approach is closely related with iterative proportional fitting (Bethlehem and Biffignandi, 2012), where we sequentially adjust for each auxiliary variable separately (e.g. age, gender, etc.) and then repeat the process until con- vergence is reached. This saves the above- described problem of small number of units in stratification and lack of population controls, and is thus very convenient when more auxiliary variables are considered. However, we are limited here to categorical auxiliary variables; problems also appear with variance estimation. In most situations the linear weighting and raking provide asymptotically similar results. Still, differences may appear, depending on whether the underlying relation between auxiliary (control) and target variables is linear or loglinear (as assumed in raking).

We should also mention general calibra- tion approaches, which enable to simultaneously include auxiliary information, as well as potential restrictions for the range of weights. In addition, they produce the

(12)

estimates together with corresponding variance calculations. Linear weighting and raking can be then treated as special cases of these general calibration approaches.

We should be aware that weights may (or may not) remove the biases in target variables, but they for sure increase the sampling variance. However, the underlying expecta- tion is of course, that the gains in reducing the bias outweigh the corresponding loss due to increased variance. This is not necessary true, particularly in case of weak correlations between auxiliary and target variables, where weights may have no impact on bias removal. The bias–variance trade-off is usually observed in the context of the so-called mean squared error (MSE), which is typically reduced to the sum of sampling variance and squared bias. In addition, the sampling variance and potentially also the bias might further increase when the auxiliary information cannot be obtained from the population, but only from some reference survey.

With probability samples the weighting typically reduces only a minor part of the bias (Tourangeau et al., 2013), while the differences between weighting approaches them- selves are usually rather small (of course, as long as they use the same amount of auxiliary information).

The above-described procedures (and their characteristic) developed for probability sampling can be used also in a non- probability setting. Again, as we do not know the probabilities of inclusion, this usage is conditional and should be treated only as an approximation. Correspondingly, the potentials of weighting in the case of non-probability sampling hardly surpass the benefits in probability sampling. Still, in survey practice, basically the same procedures are applied in non-probability setting, hoping they will somehow work. With non-probability samples we often have only one step, because we cannot separate sampling and non-response mecha- nisms. The weighting process thus simply assign weights to units in the sample, so that the underrepresented ones get a weight larger

than 1, while the opposite happens to over- represented ones, which is similar as in population weighing. For example, if the share of units from a certain segment in our sample is 25%, while it is 20% in the population, these units receive a weight proportional to 20/25

= 0.8.

Within this framework, the propensity score adjustment (PSA) has been proposed as a specific approach to overcome problems of non-probability samples (e.g. Lee, 2006;

Steinmetz et al., 2014; Terhanian et al., 2000;

Valliant and Dever, 2011), particularly in situations when weighting based on standard auxiliary variables (socio-demographics) failed to improve the estimates. One reason for this is the fact that the attitudes of Internet users might differ substantially from those of non-users (Bandilla et al., 2003; Schonlau et al., 2007). Therefore, additional behav- ioural, attitudinal or lifestyle characteristics are considered. Observational and process data (so-called paradata) can also be useful (Kreuter, et al., 2011). PSA then adjusts for the combined effects of selection probabilities, coverage errors and non-response (Bethlehem and Biffignandi, 2012; Schonlau et al., 2009; Steinmetz et al., 2014). To estimate the conditional probability of response in the non-probability sample, a small probability-based reference survey is required, which has been ideally conducted in the same mode using the same questionnaire. After combing both samples, a logistic (or probit) model estimates the probability of participat- ing in the non-probability sample using the selected set of available covariates. There are several ways of using propensity scores in estimations (Lee, 2006). For instance, while Harris Interactive uses a kind of post- stratified estimator based on propensity score (Terhanian et al., 2000), YouGovPolimetrix uses a variation of matching (Valliant and Dever, 2011). Again, we should repeat that differences between weighting approaches are usual very small, which also means that PSA rarely significantly outperform alternative weighting procedures, given that both

(13)

approaches use the same auxiliary information (Tourangeau et al., 2013).

To summarize, weighting is by no mean a magic wand to solve the non-probability sampling problems. Still, it usually does provide a certain improvement; at least it assures cosmetic resemblance to population controls.

In practice it is thus routinely implemented, although it might sometimes also cause certain damage for some estimates.

DECIDING ABOUT PROBABILITY SAMPLING APPROXIMATIONS

When we implement approximations from probability samples into non-probability setting, there is by the very definition little theo- retical basis for running a sound statistical inference. Therefore, the justification and validation of these procedures rely solely on the accumulation of anecdotal practical evidence, which is sometimes ironically called

‘faith-based’ sampling. Only when a certain non-probability approach has repeatedly worked sufficiently well in specific circumstances, can this then become a basis for its justifications.

We may reflect here on two comments on the initially mentioned Literary Digest poll failure. First, despite long series of successful evidence on previous election forecasts (from 1916 onwards), the applied non-probability sampling approach unexpectedly failed in 1936. This is an excellent illustration about the risk when relying on successful anecdotal evidence from past success. Second, with today’s knowledge and approaches it is very likely that this failure could have been prevented. In other words, an elaborated non-probability sampling strategy would have been less likely to fail so unexpectedly, although it perhaps still could not provide the same assurance as a probability sample.

In this context, we should also point out again that in some circumstances, certain non-probability sampling approaches may

repeatedly work well, while in others they do not. Similarly, sometimes even a very deficient probability sample (e.g. with a very low response rate) outperforms high-quality non-probability samples, but it can also be the opposite. In any case, only repeated empiri- cal evidence within specific circumstances may gradually justify a certain non-probability sampling approach. Within this framework, the justification of New York Times (NYT) about why they switched from probability to non- probability sampling is very informative (Desilver, 2014), i.e. they simply observed the repeated empirical evidence of the new supplier (which relied on non-probability sampling) and were persuaded by the corresponding success in the past (and, perhaps, also by lower costs).

Of course, with this decision they also took the above-mentioned risk that past success might not guarantee the success also in the future.

When deciding between probability and non-probability sampling, the first stream of considerations is thus to profoundly check all past evidence of approaches suitable for certain type of survey research. Whenever possible, for available non-probability samples in the past we should inspect the distributions of relevant target variables, together with the effects of accompanying weights, and also compare them with external controls, e.g. administrative data or data from official statistics. This will give us an impression of how the non-probability sample resembled the probability ones in terms of randomization and deviations from population controls (i.e. ‘representativeness’).

Another stream of considerations is to systematically evaluate broader survey settings. Namely, we are rarely in the situation to decide only about the probability vs non- probability dilemma, but rather between whole alternative packages, each containing an array of specifics related also to non- coverage, non-response, mode effects and other measurement errors, as well as those related to costs, timing and comparability.

Extreme care is often required to separate the genuine problem of non-probability sampling

(14)

(e.g. selection bias, lack of randomization) from other components. For example, it would be of little help to reject a non-probability sample, if problems actually stem from mode effect. In such case replacing the non-probability web survey with a probability web survey brings no improvement, but only lots of costs.

We thus rarely decide separately about the probability–non-probability component alone, but rather simultaneously judge over a broad spectrum of features. For instance, in a survey on youth tourism we may choose between a probability-based telephone, mail or web survey, and a non-probability online panel or a sample recruited through online social media. The decision involves simulta- neous considerations of many criteria, where all above-mentioned components (non- response, non-coverage, selection bias, mode effect, costs, timing, comparability, accuracy, etc.) are often inseparably embedded into each alternative (e.g. the social media option perhaps comes with high non-coverage and selection bias, but also with low costs and convenient timing, etc.). Within this context, the selection bias itself, arising from the use of a non-probability setting, is only one specific element in the broad spectrum of factors and cannot be isolated and treated separately.

Sometimes, combining the approaches can be a good solution. We may thus obtain population estimates for key target variables with some (expensive) probability survey, while in- depth analysis is done via some (inexpensive) non-probability surveys. Again, the study by Vehovar et al. (2012) on online hate speech might serve as a good example, where key questions were put in a probability-based omni- bus telephone survey, while extensive in-depth questioning was implemented within a broadly promoted non-probability web survey.

CONCLUDING REMARKS

When introducing sampling techniques for probability samples we recalled a major

breakthrough from 1930s, which brought spectacular changes into survey practice.

After the elaboration of non-probability sampling approaches we can perhaps paraphrase the same also for the gains arising from modern non-probability sampling. The advent of non-probability web surveys, in particular, revolutionized the survey land- scape. Numerous survey projects were ena- bled, which otherwise could not have been done, and many others were facilitated to run dramatically faster, simpler and less expensive.

We continuously indicated in this chapter – explicitly or implicitly – that in a non- probability setting the statistical inference based on probability sampling principles formally cannot be applied. Still, the practitioners routinely use the corresponding

‘estimates’, which should be rather labelled as ‘indications’ or even ‘approximations’.

Of course the corresponding evaluation of such ‘estimates’ cannot be done in advance (ex-ante), as in probability settings, but only afterwards (ex-post). Still, such evaluations are very precious, as we can improve the related procedures, or, at least recognize their limitations and restrict their future usage only to a narrow setting for which it works. After many successful repetitions, with corresponding trials and errors, the practitioners often develop very useful practical solutions. Further discussion on these so-called ‘fit-for-purpose’

and ‘fitness-to-use’ strategies can be found in the AAPOR report on non-probability samples (Baker et al., 2013), which systematically and cautiously addresses selection, weighting, estimation and quality measures for non- probability sampling

We should be thus clearly aware that appli- cation of standard statistical inference procedures in a non-probability setting implicitly assumes that inclusion probabilities actually do exist (which is of course not true). For example, calculating the mean for a target variable in a non-probability sample of size n automatically assumes that probabilities are all equal (e.g. SRS). Consequently, the inclusion

(15)

probabilities n/N are implicitly assigned to all units in the sample, where N is the population size. It then depends on corresponding ex- post evaluation procedures to judge the quality of such an ‘estimate’, which as mentioned should be perhaps better called ‘indication’ or

‘approximation’.

By abandoning probability sampling principles we usually also abandon the science of statistical inference and enter instead into the art and craft of shaping optimal practical procedures. The experience based on trials and errors is thus essential here, as well as the intuition of the researcher. Correspondingly, with complex problems an expert advice or a panel of experts with rich experience can be extremely valuable.

Of course, hard core statisticians may feel doubts or even strong professional opposition towards procedures, where inferential risks cannot be quantified in advance. On the other hand, is this not also the very essential characteristic of the applied statisticians – which separates them from ‘out-of-reality’ math- ematical statisticians – to stop and accept the procedures, which bring them close enough for all practical purposes?

We should also add that the above- discussed issues are nothing really new. More than a hundred years ago researchers were already trying various inexpensive alternatives to replace studying entire population or to replace expensive probability sampling procedures.

Wherever these alternatives worked sufficiently well, they were simply preserved.

Therefore, on one (conceptual) side we have serious objections to non- probability samples, summarized in the profound remarks from Leslie Kish (1998) in his reply to the repeated requests from the marketing research industry for a review of his attitudes to quota samples:

My section 13.7 in Survey Sampling expresses what I thought then (1965) and I still think. I cannot add much, because quota methods have not changed much, except that it is done by many more people and also by telephone. Quota sampling is difficult to discuss precisely, because it is

not a scientific method with precise definition. It is more of an art practiced widely with different skills and diverse success by many people in different places. There exists no textbook on the subject to which we can refer to base our discussion. This alone should be a warning signal against its use.

On the other (practical) side, it is also true that it is hard to deny advances in modern approaches for dealing with non-probability samples. It is also hard to object to countless examples of successful and cost-effective implementations in research practice.

We thus recommend to more openly accept- ing the reality of using a standard statistical inference approach as an approximation in non-probability settings. However, this comes with two warnings: first, the sample selection procedure should be clearly described, documented, presented and critically evaluated.

We thus join the AAPOR recommendations that the methods used to draw the sample, to collect the data, to adjust it and to make inference should be even more detailed compared to probability samples (Baker et al., 2013).

The same should also apply for reporting on standard data quality indicators. Within this context, the AAPOR statement related to NYT introduction of non-probability sampling is very illustrative (AAPOR, 2014).

Second, we should also elaborate on the underlying assumptions (e.g. models used) and provide explanations about conceptual divergences, dangers, risks and limitations of the interpretations (AAPOR, 2015). When standard statistical inference is applied to any non-probability sample, the minimum should be thus to clearly acknowledge that estimates, confidence intervals, model fitting and hypothesis testing may not work properly or may not work at all. Of course, the entire approach should be accompanied with corresponding survey profession integrity, which means exhaustively and transparently.

Even though these concluding remarks predominantly relate to the prevailing situation, where we use probability sampling principles as approximations in non-probability setting, they are also highly relevant to dedicated

(16)

statistical modelling approaches, which have particular potential for future developments in this area.

NOTE

1 Some parts of this chapter rely on Section 2.2, which Vasja Vehovar wrote for Callegaro et al. (2015).