NONPARAMETRIC EFFECT SIZE INDEXES

The whole of science is nothing more than a refinement of everyday thinking.

—Albert Einstein (1973, p. 283)

Some outcomes are categorical instead of continuous. The levels of a categorical outcome are mutually exclusive, and each case is classified into just one level. Nonparametric effect size indexes for categorical outcomes that are widely used in areas such as medicine, epidemiology, and genetics are introduced in this chapter. They are also frequently analyzed in meta- analysis. Note that some of these indexes can also be estimated in techniques such as log-linear analysis or logistic regression (W. Rodgers, 1995; Wright, 1995). Doing so bases nonparametric effect size indexes on an underlying statistical model. In contrast, the same indexes computed with the methods described later should be considered descriptive statistics. Exercises with answers for this chapter are available on this book's Web site.

CATEGORICAL OUTCOMES

The simplest categorical outcomes are binary variables (dichotomies) with only two levels, such as relapsed or not relapsed. When two groups are compared on a dichotomy, the data are frequencies that are represented in a 2 X 2 contingency table, also known as a fourfold table. Categorical variables can also have more than two levels, such as agree, disagree, and uncertain. The size of the contingency table is larger than 2 X 2 if two groups

are contrasted across more than two outcome categories. Only some effect size indexes for 2 x 2 tables can be extended to larger tables. These same statistics can also be used when three or more groups are compared on a categorical outcome.

The levels of a categorical variable are either unordered or ordered.

Unordered categories do not imply a rank order. Examples of unordered categories include those for ethnicity or marital status. Ordered categories—

also called multilevel ordinal categories—imply a rank order. The Likert-type response format strongly agree, agree, disagree, or strongly disagree is an example of ordered categories. There are specialized methods for ordered categories (Darlington, 1996), but they are not as well developed or known as for unordered categories. As a consequence, they are not discussed in detail.

One alternative is to rescale ordered categories to interval data and then apply methods for parametric variables. Another approach collapses multilevel categories into two clinically meaningful, mutually exclusive outcomes.

Estimation of effect size magnitude is then conducted with methods for fourfold tables.

Another framework that analyzes data from fourfold tables is the sensi- tivity, specificity, and predictive value model. Although better known in medicine as a way to evaluate the accuracy of screening tests for disease, this approach can be fruitfully applied to psychological tests that screen for problems such as depression or learning disabilities (e.g., Glares & Kline, 1988; Kennedy, Willis, & Faust, 1997). Because screening tests are not usually as accurate as more individualized and costly diagnostic methods, not all persons with a positive screening test result really have the disorder the test is intended to detect. Likewise, not everyone with a negative test result is actually free of the disorder. The 2x2 table analyzed is the cross- tabulation of screening test results (positive—negative) and true status (disorder—no disorder), and "effect sizes" concern the estimated accuracies of positive and negative test results.

EFFECT SIZE INDEXES FOR 2x2 TABLES

Various effect size indexes for fourfold tables are introduced next. I also discuss how to construct confidence intervals for the population parameters estimated by these indexes.

Parameters

A total of four parameters that reflect the degree of relative risk for an undesirable outcome across different populations are introduced next.

These same parameters can also be defined when neither level of outcome

dichotomy corresponds to something undesirable, such as agree— disagree.

The idea of "risk" is just replaced by that of comparing relative proportions for the two different outcomes. See Fleiss (1994) and Haddock, Rindskopf, and Shadish (1998) for more detailed presentations.

Suppose in a comparative study that treatment and control groups are to be compared on the dichotomy relapsed-not relapsed. The proportion of cases that relapse in the treatment population is Uj, and 1 - Tiy is the proportion that do not relapse. The corresponding proportions in the control population are, respectively, Tic and 1 - TCc- The simple difference between the two probabilities, TCc - HT, *^s ^ population risk difference, also called the proportion difference. So defined, Ttc - TCT = -10 indicates a relapse rate 10% higher in the control population than in the treatment population.

Likewise, Ttc - Tlj = --20 indicates a higher relapse rate in the treatment population by 20%.

The population risk ratio — also called the rate ratio — is the ratio of the proportions for the undesirable outcome, in this case relapse. It is defined as TCcMi- If this ratio equals 1.30, for example, the risk for relapse is 1.3 times higher in the control population than in the treatment population.

Likewise, if the risk ratio is .80, the relapse risk among treated cases is only 80% as great as that among control cases. The risk ratio thus compares the proportionate difference in relapse risk across the two populations.

The population odds ratio is designated below as co, but note that the symbol "CO refers to a different parameter for a continuous outcome (see chap. 4, this volume). The parameter u) is the ratio of the within-populations odds for the undesirable outcome. The odds for relapse in the control population equals QC = Ttc /(I ~ KC), the corresponding odds in the treatment population equals QT = Tty /(I - TIT), and the odds ratio equals (0 = QC /^T- Suppose that Tie = -60 and Ttj = -40. The odds for relapse in the control population are QC = -60/.40 = 1.50; that is, the chances of relapsing are IVz times greater than not relapsing. The odds for relapse in the treatment population are Qj = -40/.60 = .67; that is, the chances of relapsing are two thirds that of not relapsing. The odds ratio is u) = QC/QT =

1.50/.67 = 2.25, which means that the odds for relapse are 2¹A times higher in the control population than in the treatment population.

The population Pearson correlation between the treatment-control and relapsed-not relapsed dichotomies is the (p coefficient, which equals:

= ,r ,\

The subscripts C, T, R, and NR mean control, treatment, relapsed, and not relapsed. The proportions in the numerator represent the four possible outcomes and thus sum to 1.0. For example, TCCR is the probability of being

in the control population and relapsing, and Tt-fNR is the probability of being in the treatment population and not relapsing. The subscript "•" indicates a total (marginal) proportion. For example, Tic. and Tij. are, respectively, the relative proportions of cases in the control and treatment populations, and they sum to 1.0. Likewise, JI.R and TC.NR are, respectively, the relative proportions of cases across both populations that relapsed or did not relapse, and they also sum to 1.0.

Statistics and Evaluation

Table 5.1 presents a 2x2 table for the contrast of treatment and control groups on the dichotomy relapsed—not relapsed. The letters in the table stand for observed frequencies in each cell. For example, the size of the control group is HC = A + B, where A and B, respectively, stand for the number of untreated cases that relapsed or did not relapse. The size of the treatment group is nj = C + D, where C and D stand for the number of treated cases that relapsed or did not relapse, respectively. The total sample size is thus N = A + B + C + D.

Table 5.2 presents definitions of sample estimators of the parameters described in the previous section. These definitions are expressed in terms of the observed cell frequencies represented in Table 5.1. The population proportions of cases that relapsed are estimated by the observed proportions pc and £>T, respectively. The sample risk difference (RD) is computed as RD = pc ~ PT. and it estimates the population risk difference. The statistic RD is easy to interpret, but it has a significant limitation: Its range depends on the values of KC and 717. Specifically, the range of RD is greater when both KG and TCj are closer to .50 than when they are closer to 0 or 1.0. The implication is that values of RD may not be comparable across different studies where the associated values of Tic and TCj are quite different.

The sample risk ratio (RR) indicates the difference in the observed proportionate risk for relapse across the control and treatment groups. It is defined in Table 5.2 as RR = pc /Pr- If RR > 1-0, relapse risk is higher among the untreated cases, and RR < 1.0 indicates higher risk among treated cases. The statistic RR is also easy to interpret, but it has some drawbacks too.

Only the finite interval 0-1.0 indicates lower risk in the group represented in TABLE 5.1

A Fourfold Table for an Observed Group Contrast on a Dichotomy Group Relapsed Not relapsed Control A B Treatment C D

Note. The letters A-D represent observed cell frequencies.

TABLE 5.2

Definitions of Effect Size Statistics for 2 x 2 Contingency Tables Parameter Statistic Equation

Proportions of undesirable outcome

nc PC A/(A + B)

TCT Pr C/(C+D) Comparative risk

TIC - IT RD PC- Pr UC/KJ RR . A/(A + B)

Wf* ~ C/(C + D)

co = Qcfa OR oc _ Pc/t1 -Pc) Measure of association

cp cp _ /AD- 5C

S)(C + D)(4 + C)(6 + 0)

Wofe. RD = risk difference; RR = risk ratio; OD = odds ratio. The letters A-D represent observed cell frequencies in Table 5.1. If A, B, C, or D= 0 in the computation of OR, add .5 to all cells.

the numerator, but the interval from 1.0 to infinity is theoretically available for describing higher risk in the other group. The range of possible values of RR thus vary according to its denominator. Suppose that pr is .40 in one sample but .60 in another sample. The theoretical range of RR = pc/Pr in the first sample is 0-2.50, but in the second sample it is 0-1.67. This characteristic limits the value of RR as a standardized index for comparing results across different samples. This problem can be addressed by analyzing logarithm transformations of RR and then converting the results back to RR units with antilog transformations. This point is elaborated momentarily.

The sample odds ratio (OR) is the ratio of the within-groups odds.

It is defined in Table 5.2 as the ratio of the odds for relapse in the control group, DC = pc (1 ~ Pc)> over the odds in the treatment group,

°T = PT (1 ~Pr)- In fourfold tables where all margin totals are equal, the odds ratio equals the squared risk ratio, or OR = RR2. The statistic OR shares with RR the limitation that the finite interval 0-1.0 indicates lower risk in the group represented in the numerator, but the interval from 1.0 to infinity describes higher risk for other group. Analyzing logarithm transformations of OR and then taking antilogarithms of the results can deal with this limitation, just as for RR.

A convenient property of OR is that it can be converted to a kind of standardized mean difference for a fourfold table known as a logit d. A logit is the natural log (base e. = 2.7183) of OR, In (OR). The logistic distribution is approximately normal with a standard deviation that equals pi/31'2, which

is about 1.8138. The ratio of In (OR) over pi/31/2 is a logit d that is directly comparable to a standardized mean difference for a contrast of the same two groups on a continuous outcome. Shadish, Robinson, and Lu (1999) showed that the logit d can also be expressed in basically the same form as a standardized mean difference:

. . ln(OR) ln(oc) - ln(oT) ,,-.

logit d = p- = = P-2)

pi/V3 pi/^/3

where DC and OT are, respectively, the relapse odds in the control and treatment groups. Suppose that pc = .60 and pj = .20, which implies DC = 1.50, OT = .25, and OR = 6.00. The logit d for the group contrast equals:

logit d = In (6.00)/1.8138 = [In (1.50) - In (.25)1/1.8138 = .9878 Thus, the finding that the odds for relapse are six times higher among untreated cases corresponds to a treatment effect size magnitude of about a full standard deviation in logistic units. The Effect Size (ES) program by Shadish et al. (1999) automatically calculates logit d for dichotomous outcomes. There are other ways to adjust for dichotomization of the outcome variable, including arcsine and probit transformations—see Lipsey and Wil- son (2000, pp. 52-58) for more information.

The sample odds ratio may be the least intuitive index of comparative risk indexes reviewed, but it probably has the best overall statistical proper- ties, especially in epidemiological studies of risk factors for disease. This is because OR can be estimated in prospective studies, in studies that randomly sample from exposed and unexposed populations, and in retrospective studies where groups are first formed based on the presence or absence of a disease before their exposure to a supposed risk factor is determined (Fleiss, 1994).

Other indexes may not be valid in retrospective studies, such as RR, or in studies without random sampling, such as <p> which is described next.

The estimator of the population Pearson correlation between two dichotomies, <p, is the sample correlation (p. It can be calculated using the standard equation for the Pearson correlation r if the levels of both dichotomies are coded as 0 or 1. It may be more convenient to calculate (f> directly from the cell frequencies and margin frequencies using the equation in Table 5.2. The theoretical range of (j> derived this way is -1.00 to +1.00, but the sign of (f> is arbitrary because it is determined by the particular arrangement of the cells. For this reason, some researchers report absolute values of <p.

However, keep in mind that effects in 2 X 2 tables are directional. For example, either treated or untreated cases will have a higher relapse rate (if there is a difference). The absolute value of (f> also equals the square root of X2(1)/N, the ratio of the chi-square statistic with a single degree of

freedom for the fourfold table over the sample size. The relation just described can be algebraically manipulated to express the contingency table chi-square statistic as a function of sample size and standardized effect size measure by

<p (see Table 2.9). The square of 9, (p2, equals the proportion of variance in the dichotomous outcome explained by group membership. In fourfold tables where the row and column marginal totals are all equal (which implies equal group sizes), the absolute values of the risk difference and the phi coefficient are identical (RD = <f>).

The correlation (f> can reach its maximum absolute value of 1.00 only if the marginal proportions for rows and columns in a fourfold table are equal. For example, given a balanced design, the theoretical maximum absolute value of (p is 1.00 only if the marginal proportions for the outcome dichotomy are also .50. As the row and column marginal proportions diverge, the maximum absolute value of (f> approaches zero. This implies that the value of (p will change if the cell frequencies in any row or column are multiplied by an arbitrary constant. Because of this characteristic, Darlington (1996) described 9 as a margin-bound measure of association; the point - biserial correlation rpt for continuous outcomes is also margin bound because of the influence of relative group size (see chap. 4, this volume). The effect of marginal proportions on (p also suggests that it may not be an appropriate effect size index when sampling is not random (Fleiss, 1994). The correlation (p also treats the two dichotomous variables symmetrically—that is, its value is the same if the fourfold table is "flipped" so that the rows become columns and vice versa. There are other measures of association that differentiate between predictor and criterion variables, and thus treat the rows and columns asymmetrically; see Darlington (1996) for more information.

Interval Estimation

Sample estimators of population proportions tend to have complex distributions, and most methods of interval estimation for them are approximate and based on central test statistics. Table 5.3 presents equations for the estimated (asymptotic) standard errors in large samples for each of the statistics described except the correlation (p. The equation for the asymptotic standard error of <p is quite complicated; interested readers can find it in Fleiss (1994, p. 249). The width of a 100 (1 - oc)% confidence interval based on any of the statistics listed in Table 5.3 is the product of its asymptotic standard error and Z2-n\\, <x> the positive two-tailed critical value of z in a normal curve at the a level of statistical significance. Some examples follow.

Suppose the following results are from a study of the relapse rates among treated and untreated cases: n^ = HX = 100, pc = -60, and pj = .40.

The sample risk difference is RD = .20 in favor of the treatment group.

TABLE 5.3

Asymptotic Standard Errors for Sample Proportions

Statistic Standard error Proportions of undesirable outcome

PC Pr

Comparative risk

\

Al

Pc(1 - PC) nc

Pr(1 - Pr)

RD /Pc(1 ~ PC) , Prd - Pr)

nc In (RR)

, "c PC "r Pr In (OR)

LZl

⁺

L_

\nc PC (1 - PC) n\ Pr (1 -,

Note. RD = risk difference, RR = risk ratio, and OR = odds ratio, and In = natural log.

Using the third equation in Table 5.3, the asymptotic standard error of RD is estimated as follows:

SRD = {[-60 (1 - .60)]/100 + [.40 (1 - .40)]/100}1/2 = .0693 The value of 22-tail, .05 equals 1.96, so the 95% confidence for Tic ~ ^T is

.201.0693(1.96) or .20 ± .14

which defines the interval .06-.34. Thus, RD = .20 is just as consistent with a population risk difference as low as .06 as it is with a difference as high as .34, with 95% confidence.

For the data reported earlier, RR = .60/.40 = 1.50 and OR = (.60/

•40)/(.40/.60) = 2.25, both of which indicate higher risk for relapse in the control group. Distributions of RR and OR are not generally normal, but natural log transformations of both are approximately normal. Consequently, the method of confidence interval transformation (see chap. 2, this volume) can be used to construct approximate confidence intervals based on In (RR) or In (OR). The lower and upper bounds of these intervals in logarithm units are then converted back to their original metric by taking their antilogs.

Because the method is the same for both indexes, it is demonstrated below

only for OR. The log transformation of the observed odds ratio is In (2.25) = .8109. The estimated standard error of the log-transformed odds ratio calculated using the fifth equation in Table 5.3 equals

sin (OR) = {1/1100 x .60 (1 - .60)] + 1/[100 x .40 (1 - .40)]}1/z = .2887 The approximate 95% confidence interval for ln(co), the log population odds ratio, is

.8109 ± .2887 (1.96) or .8109 ± .5659

which defines the interval .2450-1.3768 in log units. To convert the lower and upper bounds of this interval back to the original metric, we take their antilogs:

In'1 (.2450) = e-2450 = 1.2776 and In'1 (1.3768) = e13768 = 3.9622 The approximate 95% confidence interval for CO is thus 1.28—3.96 at two- decimal accuracy. We can say that the observed result OR = 2.25 is just as consistent with a population odds ratio as low as CO = 1.28 as it is with a population odds ratio as high as CO = 3.96, with 95% confidence.

EFFECT SIZE ESTIMATION FOR LARGER TWO-WAY TABLES If the categorical outcome variable has more than two levels or there are more than two groups, the contingency table is larger than 2 x 2 . Measures of relative risk (RD, RR, OR) can be computed for such a table only if it is reduced to a 2 x 2 table by collapsing or excluding rows or columns. What is probably the best known measure of association for contingency tables with more than two rows or columns is Cramer's V, an extension of the (j>

coefficient. Its equation is

V i n ( r - l , c - l ) x N

where the numerator under the radical is the chi-square statistic with degrees of freedom equal to the product of the number of rows (r) minus one and the number of columns (c) minus one. The denominator under the radical is the product of the sample size and smallest dimension of the table minus one. For example, if the table is 3 x 4 in size, then min ( 3 - 1 , 4 - 1 ) = 2.

For a 2 X 2 table, the equation for Cramer's V reduces to that for (f>. For larger tables, however, V is not technically a correlation coefficient, although

Dalam dokumen EBUPT180036.pdf (Halaman 144-164)