One way analysis of variance

(1)

One–way analysis of variance

In all of the regression models examined so far, both the target and predicting variables

have been continuous, or at least effectively continuous — with one exception. Our analysis

of the pooled / constant shift / full model hierarchy recognized that the existence of two

well–defined subgroups in the data could have predictive power for the target variable.

That is, a categorical predicting variable taking on the values 0 and 1 could be used to

address the effect of being in one or the other subgroup.

A natural question is to wonder if this can be generalized to more than two groups.

For example, does knowing the educational level of a person (say High school, College,

or Postgraduate) have predictive power for their annual salary? Does knowing the

re-ligion of a member of Congress (officially reported as Protestant, Catholic, Jewish, or

Other/unknown) say anything about how much money they accept from certain political

action committees (PACs)? Is the return on a stock related to the industry group of the

company? This is a regression question, but a special kind of regression question; in this

context, saying that group membership has predictive power for the target is the same as

saying that the average value of the target is different for different groups. That is, this is

a question of comparison of means.

Consider the simplest situation of one categorical predicting variable that takes onK

values. The one–way analysis of variance (ANOVA) model is as follows:

yij =µ+αi+εij, i= 1, . . . , K, j = 1, . . . , ni (1)

whereyij is the value ofyfor thejthmember of theith group,µis an overall level (roughly

corresponding to the overall mean), αi is the effect of being in the ith group, εij is the error term, and ni is the number of observations that fall in the ith group.

Theαterms represent the difference in E(y) that comes from being in any particular

group. It is natural to say that αi = 0 for all i if there is no difference between groups, but we have to be careful. Say we have three groups corresponding to High school, College

and Postgraduate degree. If the average salary for each group was $30,000, that would

(2)

in the natural way

µ= 30,000; α1 =α2 =α3 = 0,

but it could also be modeled as

µ= 20,000; α1 =α2 =α3 = 10,000.

The latter set of parameters doesn’t reflect what we want. For this reason, an additional

restriction is put on equation (1),

K X

i=1

αi = 0.

With this additional constraint, it is guaranteed that a situation with no group effect will

be modeled withα=0.

The model (1) can be written easily as a regression model:

y=β0x0 +β1x1+_{· · ·}+βKxK +ε, (2) form (2) shows that the overall F–test of the equality of the slope coefficients to zero

(β1 =· · ·βK = 0) is testing ifE(yij) =µfor all i (that is, no difference in expected target for the different groups). As would be expected, ˆyij = yi (that is, the fitted value for any observation in groupi is the sample mean of y for the observations in that group).

If you attempt to run a regression using x1 through xK as predictors you will get

an error message, since the P

(3)

fit using regression by regressing on K −1, rather than K predictors. There are several

different ways to do this (it should be remembered that good statistical software usually

includes code devoted to one–way ANOVA, so it generally isn’t necessary to fit the model

explicitly as a regression).

(1) Drop any one indicator variable. If you do this, the group that corresponds to the

omitted variable represents a reference group. The constant term ˆβ0 corresponds to

the estimatedyfor that group, and each slope estimate ˆβicorresponds to the difference in estimated y between that of group i and the reference group. The individual t–

statistic for each variable can be used to test the significance of this difference. Thus,

if one group is a natural reference group, this is a natural way to fit the model (for

example, if y is the time until relapse of a medical condition, the groups represent

different dosages of a drug, and one group corresponds to a zero dosage [control]

group).

(2) If there is no natural reference group, a regression model where the coefficients don’t

treat one group as special is desirable. It’s possible to do this using special variables

calledeffect codings. Pick one group as a reference group (unlike for indicator variables,

it doesn’t matter which one). Say it’s groupK. Fori= 1, . . . , K₋1, define a predictor

as

xi =

(₁ _{if observation is in group}_i −1 if observation is in groupK

0 otherwise.

Now the constant term ˆβ0 is an estimate of the overall levelµ, and each slope estimate

ˆ

βi corresponds to the effect of being in group i (αi). Thus, this fit (rather than that

using indicator variables) is consistent with the notation of equation (2). The effect

of being in the reference group (αK) is simply−

PK−1

i=1 βi, since the α’s must sum to

0. The individual t–statistic for each variable can be used to test whether αi = 0. Effect codings also turn out to be useful in situations with more than one categorical

(grouping) variable.

Whatever way the model is fit, it’s important to remember that these ANOVA models

are, in fact, regression models. All of the usual assumptions onεi still hold. A particularly

(4)

definition) that well–defined subgroups do exist in the data.

Say the overall F–test is significant; that is, there is a significant difference in the

average target variable value between groups. Which groups are different from each other?

This is a multiple comparisons question. We could look at all I = K2

= K(K ₋1)/2

pairs, and test each using an indicator variable fit with one of the groups as the reference.

However, at a .05 level (what is termed apairwise error rate), 5% would be significant by

random chance! So, for example, if there are 7 groups (K = 7), I = 212

= 21 tests would

be made, implying that on average one pair would be assessed as statistically significantly

different even when there is no difference between any of the groups (this approach is

sometimes called the Fisher method, or the method of least significant difference).

Multiple comparisons procedures correct for this by controlling the experimentwise

error rate. An experimentwise rate of .05 says that in repeated sampling from a population

where there is no difference between groups, only 5% of the time wouldanypair of groups be considered significantly different from each other. There are many different approaches

to handling multiple comparisons, the most common of which are the Bonferroni and

Tukey methods. The Bonferroni method argues that if the experimentwise error rate is

desired to be α, each pairwise test should be done at an α/I level. So, for example, for

K = 7, each pairwiset–test would be done at a significance level of .05/21 =.00238. The

Bonferroni method is very general and very easy to apply, and usually does a good job of

controlling the experimentwise error rate. Its only drawback is that it can sometimes be

too conservative (that is, it does not reject the null when it should).

The Tukey method is a multiple comparison method specifically derived for ANOVA

multiple comparisons. As such it is less general than the Bonferroni approach, but is

usually less conservative (particularly if the design is balanced).

An alternative approach to the multiple comparisons problem introduced in the last

10-15 years is based on controlling thefalse discovery rate, which is the expected proportion

of falsely rejected hypotheses among all rejected hypotheses. If all of the null hypotheses

are true (that is, in the ANOVA context all of the group means are equal to each other) this

is the same as the experimentwise rate controlled by the Bonferroni and Tukey methods,

(5)

making the test more sensitive and less conservative. Minitab_{does not provide this method}

at this time.

As we noted earlier, the ANOVA situation (where there are by definition well-defined

subgroups in the data) is one where heteroscedasticity (nonconstant variance) is common,

with the errors for observations from different subgroups having different variances. This

is a clear violation of the assumptions of ordinary least squares, but fortunately, there is a

direct cure for the problem: weighted least squares.

The idea behind weighted least squares (WLS) is that least squares is still a good

thing to do if the target and predicting variables are transformed to give a model with

errors with constant variance. Say V(εi) = σ2

i. To keep things simple, consider a simple regression model, although everything here carries over directly to multiple regression and

ANOVA situations. The regression model is

yi=β0+β1xi+εi.

If we divide both sides of this equation by σi we get

yi

σi =β0

1

σi

+β1

xi

σi

+ εi

σi

.

This can be rewritten

y∗

i =β0z1i +β1z2i+δi,

where y∗

i, z1i, z2i, and δi are the obvious substitutions from the previous equation and

V(δi) = 1 for all i. Thus, ordinary least squares (OLS) estimation (without an intercept

term) of y∗ _on _z₁ _and _z₂ _{gives fully efficient estimates of} _β₀ _and _β₁_{. Note that using}

a constant multiple of σi works just as well, since the only requirement is that V(δi) be constant for alli. Any good statistical package will include an option for providing a weight

variable for WLS; while the standardization described here is going on in the background,

it is completely invisible to the user.

The valueWT_i = 1/σ2

i is theith value of the weighting variable. Ordinary least squares is a special case of WLS with WT_i = 1 for all i(and, in fact, most regression packages only

include code for WLS, with OLS the default special case). The problem is that σ2

(6)

unknown, and must be estimated. Fortunately, this is easy to do in the ANOVA situation.

Consider a situation where there is a predictor definingK subgroups in the data. The key

is to assume that the errors for all of the observations that come from group j (say) have

the same variance,σ2

j (note that these values are allowed to be different from one group to another). The weight for each of the observations from groupj would then be 1/σˆ2

j, where ˆ

σ2

j is the estimate of the variance of the errors in thejth group. An estimate that is then easily available is to just separate the residuals by group membership, and then estimate

σ2

j using the sample variance of the residuals for the members of group j.

It is important to recognize the reasons behind the use of weighted least squares. The

goal is not to improve measures of fit like R2

or F; rather, the goal is to analyze the

data in an appropriate fashion. There are several advantages to addressing nonconstant

variance:

(1) The estimates of the regression coefficients are more efficient. That is, on average,

the WLS estimates should be closer to the true regression coefficients than the OLS

estimates are.

(2) More importantly, predictions are more sensible. If the underlying variability of a

certain type of observation is larger than that for another type of observation, the

prediction interval should reflect that. This is not done under OLS, but it is under

WLS. In particular, a rough prediction interval for the ith observation is no longer

±2s _(using Minitab_{’s notation), but is rather} _±₂s_/√WT_{i, since that corresponds to} ±2ˆσi.

Say we are in a situation where the categories have a natural ordering. We might

wonder if that ordering corresponds to a numerical scale. For example, say the target

variable is a person’s salary, and the grouping variable is the amount of schooling the

person has (High school, College, Postgraduate). Is the average change in salary when

going from High School education to College education roughly the same as when going

from College to Postgraduate? That is, is the relationship between salary and schooling

linear if schooling is on an equispaced scale of (say) 1, 2, 3? We can investigate this

question using a partial F–test.

(7)

groups, such as 1, 2, 3. The question is whether an ordering in yis implied by the ordering

of the groups. Consider the following two situations:

Group Average salary Average salary High school $20,000 $20,000

College $35,000 $35,000

Postgraduate $50,000 $65,000

In the first case, salary is linearly related to education level, since each increase in

education level is associated with a constant change in average salary. A regression model

on only Linear would fit these data well. That is a good thing, since the model on only

Linear is simpler than the full ANOVA model (it requires only two parameters, rather

than three). On the other hand, the second case is one where the average increase in salary

from College to Postgraduate is twice as large as that from High school to College. This

is not a linear relationship, and the model based on onlyLinear would not fit the data. A partialF–test comparing the full ANOVA model to the model on onlyLinear _{is a test}

of whether the simpler model is adequate; if the test is not statistically significant, then