• Tidak ada hasil yang ditemukan

Distribution of a Variable

The first step in analyzing data collected on a variable is to look at the observed values by using graphs and numerical summaries. The goal is to describe key fea- tures of the distribution of a variable.

In Practice Data Analysis Depends on Type of Variable Why do we care whether a variable is quantitative or categorical, or whether a quantitative variable is discrete or continuous? We’ll see that the method used to analyze a data set will depend on the type of variable the data represent.

In Words

A discrete variable is usually a count (“the number of . . .”). A continuous variable has a continuum of infinitely many possible values (such as time, distance, or physical measurements such as weight and height).

Distribution

The distribution of a variable describes how the observations fall (are distributed) across the range of possible values.

For a categorical variable, the possible values are the different categories, and each observation falls in one of the categories. The distribution for a categorical

Section 2.1 Different Types of Data 55

variable then simply shows all possible categories and the number (or proportion) of observations falling into each category. For a quantitative variable, the entire range of possible values is split up into separate intervals, and the number (or pro- portion) of observations falling in each interval is given.

The distribution can be displayed by a graph (see next section) or a table.

Features to look for in the distribution of a categorical variable are the category with the largest frequency, called the modal category, and more generally how frequently each category was observed. Features to look for in the distribution of a quantitative variable are its shape (do observations cluster in certain intervals and/or are they spread thin in others?), center (where does a typical observation fall?), and variability (how tightly are the observations clustering around a cen- ter?). We will learn more about these features and how to visualize them in the next section.

Frequency Table

A frequency table displays the distribution of a variable numerically.

Frequency Table

A frequency table is a listing of possible values for a variable, together with the number of observations for each value.

For a categorical variable, a frequency table lists the categories and the number of times each category was observed. A frequency table can also dis- play the proportions or percentages of the number of observations falling in each category.

Proportion and Percentage (Relative Frequencies)

The proportion of observations falling in a certain category is the number of observa- tions in that category divided by the total number of observations. The percentage is the proportion multiplied by 100. Proportions and percentages are also called relative frequencies and serve as a way to summarize the distribution of a categorical variable numerically.

Example 2

Shark Attacks

Picture the Scenario

The International Shark Attack File (ISAF) collects data on unprovoked shark attacks worldwide. When a shark attack is reported, the region where it took place is recorded. For the ten-year span from 2004 to 2013, a total of 689 unprovoked shark attacks have been reported, with most of them, 203, occurring in Florida. The frequency table in Table 2.1 shows the count for Florida and counts for other regions of the world (other U.S. states and some other countries with frequent shark attacks). For each region, the table lists the number (or frequency) of reported shark attacks in that region. The proportion is found by dividing the frequency by the total count of 689. The percentage equals the proportion multiplied by 100.

56 Chapter 2 Exploring Data with Graphs and Numerical Summaries

Questions to Explore

a. What is the variable that was observed? Is it categorical or quantitative?

b. How many observations were there? Show how to find the proportion and percentage for Florida.

c. Identify the modal category for this variable.

d. Describe the distribution of shark attacks.

Think It Through

a. For each observation (a reported shark attack), the region was recorded where the attack occurred. Each time a shark attack was reported, this created a new data point for the variable. Region of attack is the vari- able. It is categorical, with the categories being the regions shown in the first column of Table 2.1.

b. There were a total of 689 observations (shark attack reports) for this variable, with 203 reported in Florida, giving a proportion of 203/689 = 0.295. This tells us that roughly 3 out of 10 shark attacks were reported in Florida. The percentage is 10010.2952 = 29.5%.

c. For the regions listed, the greatest number of attacks occurred in Florida, with three-tenths of all reported attacks. Florida is the modal category because it shows the greatest frequency of attacks.

d. The relative frequencies displayed in Table 2.1 are numerical sum- maries of the variable region. They describe how shark attacks are distributed across the various regions: Most of the attacks (29%) reported in the International Shark Attack File occurred in Florida, followed by Australia (18%), Hawaii (7%), and South Africa (6%).

The remaining 40% of attacks are distributed across several other U.S. states and international regions, with no single region having more than 5% of all attacks.

Table 2.1 Frequency of Shark Attacks in Various Regions for 2004–2013*

Region Frequency Proportion Percentage

Florida 203 0.295 29.5

Hawaii 51 0.074 7.4

South Carolina 34 0.049 4.9

California 33 0.048 4.8

North Carolina 23 0.033 3.3

Australia 125 0.181 18.1

South Africa 43 0.062 6.2

Réunion Island 17 0.025 2.5

Brazil 16 0.023 2.3

Bahamas 6 0.009 0.9

Other 138 0.200 20.0

Total 689 1.000 100.0

* Source: Data from www.flmnh.ufl.edu/fish/sharks/statistics/statsw.htm. Current as of March 2013.

Section 2.1 Different Types of Data 57

Insight

Don’t mistake the frequencies or counts as values for the variable. They are merely a summary of how many times the observation (a reported shark attack) occurred in each category (the various regions). The variable summarized here is the region in which the attack took place. For tables that summarize frequen- cies, the total proportion is 1.0, and the total percentage is 100%, such as Table 2.1 shows in the last row. In practice, the separate numerical summaries may sum to a slightly different number (such as 99.9% or 100.1%) because of rounding.

c Try Exercises 2.8 and 2.9

Table 2.1 showed the distribution (as a frequency table) for a categorical vari- able. To show the distribution for a discrete quantitative variable, we would sim- ilarly list the distinct values and the frequency of each one occurring. (The table in the margin next to Example 6 on page 65 shows such a frequency table.) For a continuous quantitative variable (or when the number of possible outcomes is very large for a discrete variable), we divide the numeric scale on which the vari- able is measured into a set of nonoverlapping intervals and count the number of observations falling in each interval. The frequency table then shows these inter- vals together with the corresponding count. For example, in Section 2.5, we show the distribution of the waiting time (measured in minutes) between eruptions of the Old Faithful geyser in Yellowstone National Park. A frequency table for this variable is shown in the margin and uses six non-overlapping intervals.

Frequency Table: Waiting Time Between Two Consecutive Eruptions of the Old Faithful Geyser

Minutes Frequency Percentage

6 50 21 7.7

50–60 56 20.6

60–70 26 9.6

70–80 77 28.3

80–90 80 29.4

7 90 12 4.4

Total 272 100.0

2.1 Categorical/quantitative difference

a. Explain the difference between categorical and quantitative variables.

b. Give an example of each.

2.2 Common types of cancer in 2012 Of all cancer cases around the world in 2012, 13% had lung cancer, 11.9% had breast cancer, 9.7% had colorectal cancer, 7.9% had pros- tate cancer, 6.8% had stomach cancer and 50.7% had other types of cancer (www.wcrf.org/int/cancer-facts-figures/

worldwide-data). Is the variable “cancer type” categorical or quantitative? Explain.

2.3 Classify the variable type Classify each of the following variables as categorical or quantitative.

a. The number of social media accounts you have (Facebook, Twitter, LinkedIn, Instagram, etc.) b. Preferred soccer team

c. Choice of smartphone model to buy

d. Distance (in kilometers) of commute to work

2.4 Categorical or quantitative? Identify each of the following variables as either categorical or quantitative.

a. Choice of diet (vegan, vegetarian, neither) b. Time spent shopping online per week c. Ownership of a tablet (yes, no) d. Number of siblings

2.5 Discrete/continuous

a. Explain the difference between a discrete variable and a continuous variable.

b. Give an example of each type.

2.6 Discrete or continuous? Identify each of the following variables as continuous or discrete.

a. The upload speed of an Internet connection b. The number of apps installed on a tablet c. The height of a tree

d. The number of emails you send in a day

2.7 Discrete or continuous 2 Repeat the previous exercise for the following:

a. The total playing time of a CD

b. The number of courses for which a student has received credit

c. The amount of money in your pocket (Hint: You could regard a number such as $12.75 as 1275 in terms of “the number of cents.”)

d. The distance between where you live and your statistics classroom, when you measure it precisely with values such as 0.5 miles, 2.4 miles, 5.38 miles

2.1 Practicing the Basics

58 Chapter 2 Exploring Data with Graphs and Numerical Summaries

2.9 Fatal Shark Attacks Few of the shark attacks listed in Table 2.1 are fatal. Overall, 63 fatal shark attacks were recorded in the ISAF from 2004 to 2013, with 2 reported in Florida, 2 in Hawaii, 4 in California, 15 in Australia, 13 in South Africa, 6 in Réunion Island, 4 in Brazil, and 6 in the Bahamas. The rest occurred in other regions.

a. Construct the frequency table for the regions of the reported fatal shark attacks.

b. Identify the modal category.

c. Describe the distribution of fatal shark attacks across the regions.

2.8 Number of children In the 2008 General Social Survey (GSS), 2020 respondents answered the question, “How many children have you ever had?” The results were

No.

children 0 1 2 3 4 5 6 7 8+ Total

Count 521 323 524 344 160 77 30 19 22 2020

a. Is the variable, number of children, categorical or quantitative?

b. Is the variable, number of children, discrete or continuous?

c. Add proportions and percentages to this frequency table.

Looking at a graph often gives you more of a feel for a variable and its distribu- tion than looking at the raw data or a frequency table. In this section, we’ll learn about graphs for categorical variables and then graphs for quantitative variables.

We’ll find out what we should look for in a graph to help us understand the dis- tribution better.

Graphs for Categorical Variables

The two primary graphical displays for summarizing a categorical variable are the pie chart and the bar graph.

j A pie chart is a circle having a slice of the pie for each category. The size of a slice corresponds to the percentage of observations in the category.

j A bar graph displays a vertical bar for each category. The height of the bar is the percentage of observations in the category. Typically, the vertical bars for each category are apart, not side by side.

2.2 Graphical Summaries of Data

Example 3

Shark Attacks in the United States

Picture the Scenario

For the United States alone, a total of 387 unprovoked shark attacks were reported between 2004 and 2013. Table 2.2 shows the breakdown by state;

states such as Oregon, Alabama, or Georgia with only a few attacks are sum- marized in the Other category.

Questions to Explore

a. Display the distribution of shark attacks across U.S. states in a pie chart and a bar graph.

b. What percentage of attacks occurred in Florida and the Carolinas?

c. Describe the distribution of shark attacks across U.S. states.

Think It Through

a. The state where the attack occurred is a categorical variable. Each re- ported attack for the United States falls in one of the categories listed in Pie Charts

and Bar Graphs b

Section 2.2 Graphical Summaries of Data 59

Table 2.2 Unprovoked Shark Attacks in the U.S. Between 2004 and 2013*

U.S. State Frequency Proportion Percentage

Florida 203 0.525 52.5

Hawaii 51 0.132 13.2

South Carolina 34 0.088 8.8

California 33 0.085 8.5

North Carolina 23 0.059 5.9

Texas 16 0.041 4.1

Other 27 0.070 7.0

Total 387 1.000 100.0

*Source: http://www.flmnh.ufl.edu/fish/sharks/statistics/statsus.htm

Table 2.2. Figure 2.1 shows the pie chart based on the frequencies listed in Table 2.2. States with more frequent attacks have larger slices of the pie. The percentages are included in the labels for each slice of the pie.

Figure 2.2 shows the bar graph. The states with larger percentages have higher bars. The scale for the percentages is shown on the vertical axis.

The width is the same for each bar.

b. Of all U.S. attacks, 67% 152.5 + 8.8 + 5.92 occurred in Florida and the Carolinas.

c. As the bar graph (or pie chart) shows, 52% of all shark attacks re- ported for the United States occurred in Florida. Florida is the modal category. Far fewer attacks were reported for Hawaii (13%), South Carolina (9%), California (9%), North Carolina (6%), and Texas (4%). The remaining 7% of attacks occurred in other U.S. states.

Insight

The pie chart and bar graph are both simple to construct using software.

The bar graph is generally easier to read and more flexible. With a pie chart, when two slices are about the same size, it’s often unclear which value is

Other (7%)

Florida (52%) Texas (4%)

North Carolina (6%) California (9%)

South Carolina (9%)

Hawaii (13%)

Pie Chart of Shark Attacks by U.S. State

mFigure 2.1 Pie Chart of Shark Attacks Across U.S. States. The label for each slice of the pie gives the category and the percentage of attacks in a state. The slice that represents the percentage of attacks reported in Hawaii is 13% of the total area of the pie.

Question Why is it beneficial to label the pie wedges with the percent? (Hint: Is it always clear which of two slices is larger and what percent a slice represents?)

60 Chapter 2 Exploring Data with Graphs and Numerical Summaries

larger. This distinction is clearer when comparing heights of bars in a bar graph. We’ll see that the bar graph can easily summarize how results com- pare for different groups (for instance, if we wanted to compare fatal and nonfatal attacks in the United States). Also, the bar graph is a better visual display when there are many categories.

c Try Exercise 2.10

Bar Graph of Shark Attacks by U.S. States

Florida Hawaii South Carolina

California North Carolina

Texas Other 0

10 20 30 40 50 60

Percent (%)

mFigure 2.2 Bar Graph of Shark Attacks Across U.S. States. Except for the Other category, which is shown last, the bars are ordered from largest to smallest based on the frequency of shark attacks. Question What is the advantage of ordering the bars this way rather than alphabetically?

The bar graph in Figure 2.2 displays the categories in decreasing order of the category percentages except for the Other category. This order makes it easy to separate the categories with high percentages visually. In some applications, it is more natural to display them according to their alphabetical order or some other criterion. For instance, if the categories have a natural order, such as summarizing the percentages of grades (A, B, C, D, F) for students in a course, we’d use that order in listing the categories on the graph.