This brings us to the critical notion of asample. A sample is the part of the population we actually measure. Sampling is the process of selecting those members of the population we will measure. Different ways of sampling lead to different types of samples. The types of statistical error we can encounter in our study depend on how our sample differs from the population we are interested in. Understanding the limits of how confident we can be about the results of our study is critically tied to the types of statistical error created.
Choosing the right sampling procedure and knowing the errors it creates is critical to the design and execution of any statistical study.
KEY POINT
Choosing the right sampling procedure and knowing the errors it creates is critical to the design and execution of any statistical study.
The relationship between sampling and error is not as hard as it seems. We begin by wanting to know general facts about a situation: What were last year’s sales like? How will our current customers react to a price increase?
Which job applicants will make the best employees? How many rejects will result from a new manufacturing process? If we can measure all of last year’s sales, all of our current customers, all of our future job applicants, etc., we will have a comprehensive sample and we will only have to worry about measurement error. But to the degree that our sample does not include someone or something in the population, any statistics we calculate will have errors. General descriptions of some of last year’s sales, some of our current customers, or just the current crop of job applicants will be different from general descriptions of all of the sales, customers, or applicants, respectively.
Which members of the population get left out of our measurements determine what the error will be.
HANDY HINTS
Note that sampling error is a question of validity, not reliability. That is, sampling error introduces bias. Differences between the sample and the population will create statistical results that are different from what the results would have been for the entire population, which is what we started out wanting to know. On the other hand, our choice of sample size affects reliability. The larger the sample size in proportion
to the population, the more reliable the statistics will be, whether they are biased or not.
Here are some of the most common types of samples:
. Comprehensive sample. This is when the sample consists of the entire population, at least in principle. Most often, this kind of sample is not possible and when it is possible, it is rarely practical.
. Random sample.This is when the sample is selected randomly from the population. In this context,randomlymeans that every member of the population hasan equal chance of being selected as part of the sample.
In most situations, this is the best kind of sample to use.
. Convenience sample. This is usually the worst kind of sample to use, but, as its name implies, it is also the easiest. Convenience sampling means selecting the sample by the easiest and/or least costly method available. Whatever kinds of sampling error happen, happen.
Convenience sampling is used very often, especially in small studies.
The most important thing to understand about using a convenience sample is to understand the types of errors most likely to happen, given the particular sampling procedure used and the particular population being sampled. Each convenience sampling process is unique and the types of sampling error created need to be understood and stated clearly in the statistical report.
. Systematic sample. This is when the sample is selected by a non- random procedure, such as picking every tenth product unit off of the assembly line for testing or every 50th customer off of a mailing list.
The trick to systematic sampling is that, if the list of items is ordered in a way that is unrelated to the statistical questions of interest, a sys- tematic sample can be just as good as, or even better than, a random sample. For example, if the customers are listed alphabetically by last name, it may be that every customerof a particular type will have an equal chance of being selected, even if not every customer has a chance of being selected. The problem is that it is not often easy to determine whether the order really is unrelated to what we want to know. If the stamping machine produces product molds in batches of ten, choosing every tenth item may miss defects in some part of the stamp- ing mold.
. Stratified sample. Also called a stratified random sample. This is a sophisticated technique used when there are possible problems with ordinary random sampling, most often due to small sample size.
It uses known facts about the population to systematically select subpopulations and then random sampling is used within each sub- population. Stratified sampling requires an expert to plan and execute it.
. Quota sample.This is a variant on the convenience sample common in surveys. Each person responsible for data collection is assigned a quota and then uses convenience sampling, sometimes with restrictions. An advantage of quota sampling is that different data collectors may find different collection methods convenient. This can prevent the bias created by using just one convenient sampling method. The biggest problem with a quota sample is that a lot of folks find the same things convenient. In general, the problems of convenience samples apply to quota samples.
. Self-selected sample. This is a form of convenience sample where the subjects determine whether or not to be part of the sample. There are degrees of self-selection and, in general, the more self-selection the more problems and potential bias. Any sampling procedure that is voluntary for the subjects is contaminated with some degree of self- selection. (Sampling invoices from a file or products from an assembly line involves no self-selection because invoices and products lack the ability to refuse to be measured.) One of the most drastic forms of self- selection is used in the Internet polls common to TV news shows.
Everyone is invited to log onto the Web and vote for this or that.
But the choice to view the show is self-selection, and others do not get the invitation. Not everyone who gets the invitation has Internet access. Since having Internet access is a personal choice, there is self- selection there, as well. And lots and lots of folks with Internet access don’t vote on that particular question. The people who make choices that lead to hearing the invitation, being able to vote, and voting, are self-selected in at least these three different ways. On TV, we are told these polls are ‘‘not scientific.’’ That is polite. Self-selection tends to create very dangerous and misleading bias and should be minimized whenever possible.
We will have much more to say about exactly what kinds of errors result from sampling in Chapters 3, 8, and 11. There is always more to learn about sampling. Note that, although we discussed measurement first, the practical order is: Define the population; Select the sample; Take the measurements.
When we have that, we have our data. Once we clean up our data—see Chapter 6 ‘‘Getting the Data’’ about that—we are ready to analyze the data.
Analysis
Analysis is the process that follows measurement. In Chapter 1 ‘‘Statistics for Business,’’ we discussed the difference between descriptive and inferential