Statistics deals not only with the organization and analysis of data once it has been collected but also with the development of techniques for collecting the data. If data is not properly collected, an investigator may not be able to answer the questions under consideration with a reasonable degree of confidence. One common problem is that the target population—the one about which conclusions are to be drawn—may be different from the population actually sampled. For example, advertisers would like various kinds of information about the television-viewing habits of potential cus- tomers. The most systematic information of this sort comes from placing monitoring devices in a small number of homes across the United States. It has been conjectured that placement of such devices in and of itself alters viewing behavior, so that char- acteristics of the sample may be different from those of the target population.
When data collection entails selecting individuals or objects from a frame, the simplest method for ensuring a representative selection is to take a simple random sample.This is one for which any particular subset of the specified size (e.g., a sam- ple of size 100) has the same chance of being selected. For example, if the frame consists of 1,000,000 serial numbers, the numbers 1, 2, . . . , up to 1,000,000 could be placed on identical slips of paper. After placing these slips in a box and thor- oughly mixing, slips could be drawn one by one until the requisite sample size has been obtained. Alternatively (and much to be preferred), a table of random numbers or a computer’s random number generator could be employed.
Sometimes alternative sampling methods can be used to make the selection process easier, to obtain extra information, or to increase the degree of confidence in conclusions. One such method, stratified sampling,entails separating the population units into nonoverlapping groups and taking a sample from each one. For example, a manufacturer of DVD players might want information about customer satisfaction for units produced during the previous year. If three different models were manu- factured and sold, a separate sample could be selected from each of the three corre- sponding strata. This would result in information on all three models and ensure that no one model was over- or underrepresented in the entire sample.
Frequently a “convenience” sample is obtained by selecting individuals or objects without systematic randomization. As an example, a collection of bricks may be stacked in such a way that it is extremely difficult for those in the center to be selected. If the bricks on the top and sides of the stack were somehow different from the others, resulting sample data would not be representative of the population. Often an investigator will assume that such a convenience sample approximates a random sample, in which case a statistician’s repertoire of inferential methods can be used;
however, this is a judgment call. Most of the methods discussed herein are based on a variation of simple random sampling described in Chapter 5.
Engineers and scientists often collect data by carrying out some sort of designed experiment. This may involve deciding how to allocate several different treatments (such as fertilizers or coatings for corrosion protection) to the various experimental units (plots of land or pieces of pipe). Alternatively, an investigator may systematically vary the levels or categories of certain factors (e.g., pressure or type of insulating material) and observe the effect on some response variable (such as yield from a production process).
An article in the New York Times (Jan. 27, 1987) reported that heart attack risk could be reduced by taking aspirin. This conclusion was based on a designed experi- ment involving both a control group of individuals that took a placebo having the appearance of aspirin but known to be inert and a treatment group that took aspirin
Example 1.5
according to a specified regimen. Subjects were randomly assigned to the groups to protect against any biases and so that probability-based methods could be used to analyze the data. Of the 11,034 individuals in the control group, 189 subsequently experienced heart attacks, whereas only 104 of the 11,037 in the aspirin group had a heart attack. The incidence rate of heart attacks in the treatment group was only about half that in the control group. One possible explanation for this result is chance variation—that aspirin really doesn’t have the desired effect and the observed dif- ference is just typical variation in the same way that tossing two identical coins would usually produce different numbers of heads. However, in this case, inferential methods suggest that chance variation by itself cannot adequately explain the mag-
nitude of the observed difference. ■
An engineer wishes to investigate the effects of both adhesive type and conductor material on bond strength when mounting an integrated circuit (IC) on a certain sub- strate. Two adhesive types and two conductor materials are under consideration. Two observations are made for each adhesive-type/conductor-material combination, resulting in the accompanying data:
Adhesive Type Conductor Material Observed Bond Strength Average
1 1 82, 77 79.5
1 2 75, 87 81.0
2 1 84, 80 82.0
2 2 78, 90 84.0
Conducting material Average
strength
1 2
80
85 Adhesive type 2
Adhesive type 1
Figure 1.3 Average bond strengths in Example 1.5
The resulting average bond strengths are pictured in Figure 1.3. It appears that adhe- sive type 2 improves bond strength as compared with type 1 by about the same amount whichever one of the conducting materials is used, with the 2, 2 combina- tion being best. Inferential methods can again be used to judge whether these effects are real or simply due to chance variation.
Suppose additionally that there are two cure times under consideration and also two types of IC post coating. There are then combinations of these four factors, and our engineer may not have enough resources to make even a single obser- vation for each of these combinations. In Chapter 11, we will see how the careful selec- tion of a fraction of these possibilities will usually yield the desired information. ■
2?2?2?2516
EXERCISES Section 1.1 (1–9)
1. Give one possible sample of size 4 from each of the follow- ing populations:
a. All daily newspapers published in the United States b. All companies listed on the New York Stock Exchange c. All students at your college or university
d. All grade point averages of students at your college or university
2. For each of the following hypothetical populations, give a plausible sample of size 4:
a. All distances that might result when you throw a football b. Page lengths of books published 5 years from now c. All possible earthquake-strength measurements (Richter
scale) that might be recorded in California during the next year
d. All possible yields (in grams) from a certain chemical reaction carried out in a laboratory
3. Consider the population consisting of all computers of a cer- tain brand and model, and focus on whether a computer needs service while under warranty.
a. Pose several probability questions based on selecting a sample of 100 such computers.
b. What inferential statistics question might be answered by determining the number of such computers in a sample of size 100 that need warranty service?
4. a. Give three different examples of concrete populations and three different examples of hypothetical populations.
b. For one each of your concrete and your hypothetical pop- ulations, give an example of a probability question and an example of an inferential statistics question.
5. Many universities and colleges have instituted supplemental instruction (SI) programs, in which a student facilitator meets regularly with a small group of students enrolled in the course to promote discussion of course material and enhance subject mastery. Suppose that students in a large statistics course (what else?) are randomly divided into a control group that will not participate in SI and a treatment group that will participate. At the end of the term, each student’s total score in the course is determined.
a. Are the scores from the SI group a sample from an exist- ing population? If so, what is it? If not, what is the rele- vant conceptual population?
b. What do you think is the advantage of randomly dividing the students into the two groups rather than letting each student choose which group to join?
c. Why didn’t the investigators put all students in the treat- ment group? Note:The article “Supplemental Instruction:
An Effective Component of Student Affairs Programming”
(J. of College Student Devel.,1997: 577–586) discusses the analysis of data from several SI programs.
6. The California State University (CSU) system consists of 23 campuses, from San Diego State in the south to Humboldt State near the Oregon border. A CSU administrator wishes to make an inference about the average distance between the hometowns of students and their campuses. Describe and dis- cuss several different sampling methods that might be employed. Would this be an enumerative or an analytic study? Explain your reasoning.
7. A certain city divides naturally into ten district neighborhoods.
How might a real estate appraiser select a sample of single- family homes that could be used as a basis for developing an equation to predict appraised value from characteristics such as age, size, number of bathrooms, distance to the nearest school, and so on? Is the study enumerative or analytic?
8. The amount of flow through a solenoid valve in an automo- bile’s pollution-control system is an important characteristic.
An experiment was carried out to study how flow rate depended on three factors: armature length, spring load, and bobbin depth. Two different levels (low and high) of each fac- tor were chosen, and a single observation on flow was made for each combination of levels.
a. The resulting data set consisted of how many observations?
b. Is this an enumerative or analytic study? Explain your rea- soning.
9. In a famous experiment carried out in 1882, Michelson and Newcomb obtained 66 observations on the time it took for light to travel between two locations in Washington, D.C. A few of the measurements (coded in a certain manner) were
and 31.
a. Why are these measurements not identical?
b. Is this an enumerative study? Why or why not?
31, 23, 32, 36, 22, 26, 27,
Descriptive statistics can be divided into two general subject areas. In this section, we consider representing a data set using visual techniques. In Sections 1.3 and 1.4, we will develop some numerical summary measures for data sets. Many visual techniques may already be familiar to you: frequency tables, tally sheets, histograms, pie charts,
1.2 Pictorial and Tabular Methods in
Descriptive Statistics
bar graphs, scatter diagrams, and the like. Here we focus on a selected few of these techniques that are most useful and relevant to probability and inferential statistics.