• Tidak ada hasil yang ditemukan

Summarizing the Distribution of Two Discrete Variables

i.e. both judges give high ranks to good candidates and low ranks to not so good candidates. We are therefore interested in studying the association between the ranks given by the two judges. In all these examples, the intention lies in measuring the degree of association between two (or more) variables. We therefore need to study different ways of measuring the association level for different types of variables. In this chapter, we present measures and graphical summaries for the association of two variables—dependent on their scale.

Table 4.1 Contingency table for travel class and satisfaction

Overall rating of flight quality Total (row)

Poor Fair Good Very good

Travel class Economy business first

2 2 2 1 7

0 1 2 1 4

0 0 0 1 1

Total (column)

2 3 4 3 12

flying in economy class and rated the flight quality as poor, there were no passengers from both business class and first class who rated the flight quality as poor; there were 2 passengers who were flying in economy class and rated the quality as fair (2), and so on. Table4.1is a two-dimensional table summarizing this information.

Note that we not only summarize the joint frequency distribution of the two variables but also the distributions of the individual variables. Summing up the rows and columns of the table gives the respective frequency distributions. For example, the last column of the table demonstrates that 7 passengers were flying in economy class, 4 passengers were flying in business class and 1 passenger in first class.

Now we extend this example and discuss a general framework to summarize the absolute frequencies of two discrete variables in contingency tables. We use the following notations: Letx1,x2, . . . ,xk be thekclasses of a variableX and let y1,y2, . . . ,yl be thelclasses of another variableY. We assume that bothX andY are discrete variables. It is possible to summarize the absolute frequenciesni jrelated to(xi,yj),i=1,2, . . . ,k, j =1,2, . . . ,l, in ak×lcontingency tableas shown in Table4.2.

Table 4.2 k×lcontingency table

Y

y1 yj yl Total

(rows)

x1 n11 · · · n1j · · · n1l n1+

x2 n21 · · · n2j · · · n2l n2+

... ... ... ... ...

X xi ni1 · · · ni j · · · nil ni+

... ... ... ... ...

xk nk1 · · · nk j · · · nkl nk+

Total (columns)

n+1 · · · n+j · · · n+l n

We denote the sum of theith row asni+=l

j=1ni j and the sum over the jth column asn+j =k

i=1ni j. The total number of observations is therefore n=

k

i=1

ni+= l

j=1

n+j = k

i=1

l

j=1

ni j. (4.1)

Remark 4.1.1 Note that it is also possible to use the relative frequencies fi j =ni j/n instead of the absolute frequenciesni jin Table4.2, see Example4.1.2.

4.1.2 Joint, Marginal, and Conditional Frequency Distributions When the data on two variables are summarized in a contingency table, there are several concepts which can help us in studying the characteristics of the data. For example, how the values of both the variables behave jointly, how the values of one variable behave when another variable is kept fixed etc. These features can be studied using the concepts of joint frequency distribution, marginal frequency distri- bution, and conditional frequency distribution. If relative frequency is used instead of absolute frequency, then we speak of the joint relative frequency distribution, mar- ginal relative frequency distribution, and conditional relative frequency distribution.

Definition 4.1.1 Using the notations of Table4.2, we define the following:

The frequenciesni j represent thejoint frequency distributionofX andY. The frequenciesni+represent themarginal frequency distribution ofX. The frequenciesn+j represent themarginal frequency distributionofY. We define fiX|j|Y =ni j/n+j to be theconditional frequency distributionof X givenY =yj.

We define fjY|i|X =ni j/ni+to be theconditional frequency distributionofY givenX =xi

The frequencies fi j represent thejoint relative frequency distribution ofX andY.

The frequencies fi+=l

j=1 fi j represent themarginal relative frequency distribution of X.

The frequencies f+j =k

i=1 fi j represent themarginal relative frequency distributionofY.

We define fiX|j|Y = fi j/f+j to be theconditional relative frequency distrib- utionofXgivenY =yj.

We define fjY|i|X = fi j/fi+to be theconditional relative frequency distribu- tionofY givenX =xi.

Table 4.3 Contingency table for travel class and satisfaction Overall rating of flight quality

Poor Fair Good Very good Total (rows)

Travel class Economy 10 33 15 4 62

Business 0 3 20 2 25

First 0 0 5 8 13

Total (columns)

10 36 40 14 100

Note that for a bivariate joint frequency distribution, there will only be two mar- ginal (or relative) frequency distributions but possibly more than two conditional (or relative) frequency distributions.

Example 4.1.2 Recall the setup of Example4.1.1. We now collect and evaluate the responses of 100 customers (instead of 12 passengers as in Example4.1.1) regarding their choice of the travel class and their overall satisfaction with the flight quality.

The data is provided in Table4.3where each of the cell entries illustrates how many out of 100 passengers answeredxi and yj: for example, the first entry “10”

indicates that 10 passengers were flying in economy classanddescribed the overall service quality as poor.

• The marginal frequency distributions are displayed in the last column and last row, respectively. For example, the marginal distribution of X refers to the frequency table of “travel class” (X) and tells us that 62 passengers were flying in economy class, 25 in business class, and 13 in first class. Similarly, the marginal distribution of “overall rating of flight quality” (Y) tells us that 10 passengers rated the quality as poor, 36 as fair, 40 as good, and 14 as very good.

• The conditional frequency distributions give us an idea about the behaviour of one variable when the other one is kept fixed. For example, the conditional distribution of the “overall rating of flight quality” (Y) among passengers who were flying in economy class (fY|X=Economy) gives f1Y|1|X =10/62≈16 % which means that approximately 16 % of the customers in economy class are rating the quality as poor, f2Y|1|X =33/62≈53 % of the customers in economy class are rating the quality as fair, f3Y|1|X =15/62≈24 % of the customers in economy class are rating the quality as good and f4Y|1|X =4/62≈7 % of the customers in economy class are rating the quality as very good. Similarly, f3Y|2|X =20/25≈80 % which means that 80 % of the customers in business class are rating the quality as good and so on.

• The conditional frequency distribution of the “travel class” (X) of passengers given the “overall rating of flight quality” (Y) is obtained by fX|Y=Satisfaction level. For example, fX|Y=goodgives f1X|3|Y =15/40=37.5 % which means that 37.5 %

of the passengers who rated the flight to be good travelled in economy class, f2X|3|Y =20/40=50 % of the passengers who rated the flight to be good travelled in business class and f3X|3|Y =5/40=12.5 % of the passengers who rated the flight to be good travelled in first class.

• In total, we have 100 customers and hence n =

k

i=1

ni+=62+25+13= l

j=1

n+j =10+36+40+14

= k

i=1

l

j=1

ni j =10+33+15+4+ +3+20+2+5+8=100

• Alternatively, we can summarizeXandYusing the relative frequencies as follows:

Overall rating of flight quality

Poor Fair Good Very good Total (rows) Travel class Economy 10010 10033 10015 1004 10062

Business 0 1003 10020 1002 10025

First 0 0 1005 1008 10013

Total (columns) 10010 10036 10040 10014 1

To produce the frequency table without the marginal distributions, we can use the R commandtable(X,Y). To obtain the full contingency table including the marginal distributions inR, one can use the functionaddmargins(). For the relative frequencies, the functionprop.table()can be used. In summary, a full contingency table is obtained by using

addmargins(table(X,Y))

addmargins(prop.table(table(X,Y)))

4.1.3 Graphical Representation of Two Nominal or Ordinal Variables

Bar charts (see Sect.2.3.1) can be used to graphically summarize the association between two nominal or two ordinal variables. The bar chart is drawn forXand the categories ofY are represented by separated bars or stacked bars for each category ofX. In this way, we summarize the joint distribution of the contingency table.

Example 4.1.3 Consider Example4.1.2. There are 62 passengers flying in the econ- omy class. From these 62 passengers, 10 rated the quality of the flight as poor, 33 as fair, 15 as good, and 4 as very good. This means forX =x1(=Economy), we can

Class

absolute frequencies

0 10 20 30

1: Economy 2: Business 3: First

Rating

1=poor

2=fair 3=good

4=very good

Class

absolute frequencies

0 20 40 60

1: Economy 2: Business 3: First

Rating

1=poor

2=fair 3=good

4=very good

(a) Categories next to each other (b) Categories stacked Fig. 4.1 Bar charts for travel class and rating of satisfaction

either place 4 bars next to each other, as in Fig.4.1a, or we can stack them on top of each other, as in Fig.4.1b. The same can be done for the other categories of X, see Fig.4.1. Stacked and stratified bar charts are prepared inRby calling the library latticeand using the functionbar chart. In detail, one needs to specify:

Class <- c(rep('1: Economy',62),rep('2: Business',25), rep('3: First',13))

Rating <- c(rep('1=poor',10),rep('2=fair',33),...) library(lattice)

barchart(table(Class,Rating),horizontal=FALSE,stack=FALSE) barchart(table(Class,Rating),horizontal=FALSE,stack=TRUE)

Remark 4.1.2 There are several other options in R to specify stratified bar charts.

We refer the interested reader to Exercise 2.6 to explore how theRpackageggplot2 can be used to make such graphics. Sometimes it can also be useful to visualize the difference of two variables and not stack or stratify the bars, see Exercise 2.1.

Independence and Expected FrequenciesAn important statistical concept isinde- pendence. In this section, we touch upon its descriptive aspects, see Chaps.6 (Sect.6.5) and7(Sect.7.5) for more theoretical details. Two variables are considered to be independent if the observations on one variable do not influence the observa- tions on the other variable. For example, suppose two different persons roll a die separately; then, the outcomes of their rolls do not depend on each other. So we can say that the two observations are independent. In the context of contingency tables, two variables are independent of each other when the joint relative frequency equals the product of the marginal relative frequencies of the two variables, i.e. the

Table 4.4 Observed and expected absolute frequencies for the airline survey Overall rating of flight quality

Poor Fair Good Very good Total

Travel Economy 10 (6.2) 33 (22.32) 15 (24.8) 4 (8.68) 62

class Business 0 (2.5) 3 (9.0) 20 (10.0) 2 (3.5) 25

First 0 (1.3) 0 (4.68) 5 (5.2) 8 (1.82) 13

Total 10 36 40 14 100

following equation holds:

fi j = fi+f+j. (4.2)

Theexpected absolute frequencies under independenceare obtained by

˜

ni j =n fi j =nni+

n n+j

n =ni+n+j

n . (4.3)

Note that the absolute frequencies are always integers but the expected absolute frequencies may not always be integers.

Example 4.1.4 Recall Example 4.1.2. The expected absolute frequencies for the contingency table can be calculated using (4.3). For example,

˜

n11= 62·10

100 =6.2, n˜12= 62·36

100 =22.32 etc.

Table4.4lists both the observed absolute frequency and expected absolute frequency (in brackets).

To calculate the expected absolute frequencies inR, we can access the “expected”

object returned from aχ2-test applied to the respective contingency table as follows:

chisq.test(table(Class,Rating))$expected

A detailed motivation and explanation of this command is given in Sect.10.8.