• Tidak ada hasil yang ditemukan

INTERVALS/HYPOTHESIS TESTING/EFFECT SIZES

Inferential statistics play a critical role in assessing whether training, tests, or other organizational inter- ventions have an effect that can be reliably expected based on data collected from samples of organiza- tional members. For example, the score from a selec- tion test administered to a sample of 100 job applicants could be correlated with the test takers’ subsequent performance ratings to determine how well the test predicts job performance. If we find that the correla- tion (r) calculated from the sample’s test scores and Confidence Intervals/Hypothesis Testing/Effect Sizes———91

ratings is .25, we are left with several questions: How does a sample-derived correlation of .25 compare with the correlation that could be obtained if we had test and ratings data on all cases in the population? How likely is it that we will get the same or approximately the same value for r with another sample of 100 cases? How good is an rof .25? Confidence intervals, significance testing, and effect sizes play pivotal roles in answering these questions.

CONFIDENCE INTERVALS

Using the foregoing example, assume that the correla- tion (ρ) between the test scores and ratings is .35 in the population. This ρ of .35 is called a parameter because it is based on a population, whereas the rof .25 is called a statisticbecause it is based on a sample.

If another sample of 100 cases were drawn randomly from the same population with an infinite number of cases, the rcalculated using test and ratings data from the new sample would probably differ from both the ρ of .35 and the rof .25 calculated for the first sample.

We can continue sampling and calculating ran infinite number of times, and each rs would be an estimate of ρ. Typically,rwould be distributed around ρ, with some being smaller than .35 and others being larger. If the mean of all possible rs equals ρ, then each ris said to be an unbiased estimate of ρ. The distribution of ris called the sampling distributionof the sample cor- relation, and its standard deviation is called the stan- dard error (SE). The SEof rcan be calculated as

SE=(1 −——– ρ2)

N—— (1)

where N is the sample size. Notably, as Nincreases, SEdecreases.

A confidence interval (CI) is the portion of the sampling distribution into which a statistic (e.g., r) will fall a prescribed percentage of the time. For example, a 95% CI means that a statistic will fall between the interval’s lower and upper limits 95% of the time. The limits for the 95% CIcan be calculated as follows:

95% CI = ρ ±(1.96) SE (2) If a 99% CI were desired, the value 2.58 would be substituted for 1.96. These two levels of CIare those most commonly used in practice.

Let us compute CI using ρ = .35, N = 100, the common assumption that the distribution of r is nor- mal, and Equation 1, SE = (1 − .352)/√100—— = .088.

Using Equation 2, the lower limit is .35 −(1.96)(.088)

= .178, and the upper limit is .35 + (1.96)(.088) = .472. This example CIcan be interpreted as follows: If an infinite number of random samples of 100 each were drawn from the population of interest, 95% of the rs would fall between .178 and .472, and 5%

would fall outside the CI.

HYPOTHESIS TESTING

In the previous example, ρ was known, but this is rarely the case in practice. In such cases, the investi- gator may use previous research findings as a basis for assuming that ρis .40. Because we now have the nec- essary information, we can compute the 95% CI as follows: SE = (1 − .402)/√100 —— = .084. The CI lower limit is .40 −1.96(.084) =.235, and the CIupper limit is .40 +1.96(.084) =.565. According to the definition of 95% CI,correlations between test scores and rat- ings, based on samples of 100 randomly drawn from a population with ρ = .40, would fall between .235 and .565 95% of the time and outside that range or inter- val 5% of the time. Because the r of .25 is inside the 95% CI,it is very likely that the sample at hand came from a population in which ρ is .40. That is, we are 95% confident that our sample came from a popula- tion with a ρof .40, but there is a 5% chance that it did not.

In practice, it is not easy to come up with a reason- able hypothesis about a population parameter. Yet a practitioner may want to say something of value about a statistic. In our example ofr=.25, a practitioner may want to know whether it came from a population with ρ =.00. The rationale is that if it can be shown with a certain degree of confidence that the sample withr= .25 is unlikely to have come from a population with zero correlation, then that information would have practical utility—that is, the test is very likely to have nonzero validity in the population of interest. This is commonly referred to as null hypothesis testing.

If ρ = .00, then theSE for a sample of 100 cases is (1 − .002)/√100—— = .100, and the 95% CI is (−.196, +.196). Our r of .25 falls outside this interval, and therefore there is a 95% chance that our sample did not come from a population withρ =.00. That is, the null hypothesis ofρ =.00 is unlikely given that theris .25;

some would say that the null hypothesis is rejected 92———Confidence Intervals/Hypothesis Testing/Effect Sizes

with 95% confidence. There is, however, a 5% chance that our sample came from a population withρ =.00.

Because this probability is small, we may assume that ρis different from zero and conclude that the test is a valid predictor of performance ratings.

We did not say, however, that the null hypothesis was proved or disproved. According to Sir Ronald Fisher, the eminent statistician who popularized the null hypothesis, the null hypothesis is never proved or established, but it may be disproved as we obtain data from more samples. Although professionals debate whether null hypothesis testing is useful, most researchers agree that CIs and effect sizes probably provide more information than null hypothesis testing.

APPLICATION OF CONFIDENCE INTERVALS The assumption underlying Equation 1 is that we know or are willing to hypothesize the value of ρ.

Making an assumption about ρleads us into hypothe- sis testing. One way to avoid this is to develop CIs by using ras our best estimate of ρ. When we use ras an estimate of ρ, Equation 1 becomes

SE= ———(1−– r2)

N————−1 (3)

When r and Nare known, Equation 3 can be used to compute SE and CI. Given r = .25 and N = 100, SE= (1 −.252)/√100——1 =.094. The CI lower limit is .25 −(1.96)(.094) =.066, and the CI upper limit is .25 + (1.96)(.094) = .434. Therefore, the 95% CI is (.066, .434). We cannot say with 95% confidence that the unknown ρ falls within this interval, but we say that ρ is likely contained in this interval without attaching any probability to that statement. Is such a statement of any help? Yes, it is. If an investigator gen- erated an infinite number of random samples from the same population, computed a 95% CIeach time, and concluded that ρ falls within that CI,the investigator would be correct 95% of the time in the sense that 95% of the CIs will include ρ. For a given CI,either ρ is or is not in it. But if we consider an infinite number of CIs, 95% of them will include ρ.

In the preceding sections, we only used correla- tions to describe and illustrate CIs and hypothesis test- ing. These techniques and interpretations are equally valid for means and other statistics.

EFFECT SIZES

Effect sizes (ESs) play an increasingly important role in the behavioral sciences, especially given their emphasis in meta-analysis. An ES, such as r or the standardized difference (d), reflects the degree to which a predictor measure is related to an outcome measure or the degree of difference between outcome scores for an intervention group and a comparison group.

Consider a situation in which an intervention group of randomly chosen employees received sales training and a comparison group did not. After one year, the trained and comparison groups were selling, on aver- age, 345 units (MI = 345) and 275 units (MC=275), respectively. The standard deviation (SD) of units sold equals 70 for both groups. The difference between the average number of units sold is 70 (MEMC=345 − 275 =70). This means that the trained group did bet- ter than the comparison group. Suppose that the same marketing technique was used with a different product in a different organization in which the trained group sold an average of 15 units and the comparison group sold 12 units. Again, the SDof units sold is the same in both groups and equal to 2. The mean difference is 3 (15 −12 =3).

Even though the trained groups did better than the comparison groups, it is difficult to tell whether the training was more effective for one product than for the other because the difference measures are not comparable. In the first example, the units sold were automobiles, and in the second example, the units sold were mainframe computers. If we were to run a test of the null hypothesis, we would probably find that the two mean differences (70 and 3) were significantly different from zero. This still would not help a practi- tioner assess whether the training was more effective for one product than another. One currently popular method of doing so is to express the gain in sales in both scenarios on a common metric. A common ES metric is d:

SE= MEMC

(4) SD

The dmetric is like a z-score metric, and it is com- parable across scenarios; this is generally not true for mean scores or mean score differences. In the scenario involving automobiles, the ES is 1 ([345 − 275]/70), which means that the trained group sold at a level 1 Confidence Intervals/Hypothesis Testing/Effect Sizes———93

SD above the comparison group. Similarly, in the scenario involving mainframe computers, the ES is 1.5 ([15−12]/2), which indicates that the trained group performed 1.5 SDs above the comparison group. Because the ES metrics are comparable, we can conclude, all else being equal, that the training is more effective for selling mainframe computers than automobiles. The mean differences alone (70 versus 3) would not allow us to make that statement. Finally, we can calculate a pooled SD (the weighted average of the two SDs) and use it in Equation 4 when the sepa- rate SDs are different.

Another ES metric is r.Because r varies between

−1 and+1, values ofrcan be compared without trans- forming them to another common metric. Most statis- tics can be converted into ESs using formulas found in the works given in the Further Reading section. These references also provide guidelines for combining ESs from many studies into a meta-analysis and determin- ing whether an ES is small, medium, or large.

We have only touched on confidence intervals, hypothesis testing, and effect sizes. Readers should consult the references given in Further Reading for additional guidance, especially in light of the current emphasis on confidence intervals and effect sizes and the controversy over null hypothesis testing.

—Nambury S. Raju, John C. Scott, and Jack E. Edwards

FURTHER READING

Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003).

Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). Mahwah, NJ: Lawrence Erlbaum.

Harlow, L. L., Mulaik, S. A., & Steiger, J. H. (1997). What if there were no significant tests? Mahwah, NJ:

Lawrence Erlbaum.

Howell, D. C. (2002). Statistical methods for psychology (5th ed.). Pacific Grove, CA: Duxbury.

Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta- analysis: Correcting error and bias in research findings (2nd ed.). Thousand Oaks, CA: Sage.

Kirk, R. E. (1995). Experimental design: Procedures for the behavioral sciences (3rd ed.). Pacific Grove, CA:

Brooks/Cole.

Rosenthal, R. (1991). Meta-analytic procedures for social sciences(Rev. ed.). Newbury Park, CA: Sage.

Wilkinson, L. (1999). Statistical methods in psychology jour- nals: Guidelines and explanations. American Psychologist, 54,594−604.