Analysis of variance ANOVA 001

(1)

Analysis of variance (ANOVA)

ANOVA is a statistical technique that assesses potential differences in a scale-level dependent variable by a nominal-level variable having 2 or more categories. For example, an ANOVA can examine potential differences in IQ scores by Country (US vs. Canada vs. Italy vs. Spain). The ANOVA, developed by Ronald Fisher in 1918,

extends the t and the z test which have the problem of only allowing the nominal level variable to have just two categories. This test is also called the Fisher analysis of variance.

ANOVA provides a statistical test of whether or not the means of several groups are equal, and therefore generalizes the t -test to more than two groups. ANOVAs are useful for comparing (testing) three or more means (groups or variables)

There are three classes of models used in the analysis of variance, and these are outlined here.

Fixed-effects models

The fixed-effects model (class I) of analysis of variance applies to situations in which the experimenter applies one or more treatments to the subjects of the experiment to see whether the response

variable values change. This allows the experimenter to estimate the ranges of response variable values that the treatment would generate in the population as a whole.

Random-effects models Random effects model (class II) is used when the treatments are not fixed. This occurs when the various factor levels are sampled from a larger population. Because the levels themselves are random variables, some assumptions and the method of contrasting the treatments (a multi-variable

generalization of simple differences) differ from the fixed-effects model.

Mixed-effects models[

A mixed-effects model (class III) contains experimental factors of both fixed and random-effects types, with appropriately different interpretations and analysis for the two types.

(2)

randomly selected texts. The mixed-effects model would compare the (fixed) incumbent texts to randomly selected alternatives. Defining fixed and random effects has proven elusive, with

competing definitions arguably leading toward a linguistic quagmire

Characteristics of ANOVA

ANOVA is used in the analysis of comparative experiments, those in which only the difference in outcomes is of interest. The statistical significance of the experiment is determined by a ratio of two

variances. This ratio is independent of several possible alterations to the experimental observations: Adding a constant to all observations does not alter significance. Multiplying all observations by a

constant does not alter significance. So ANOVA statistical

significance result is independent of constant bias and scaling errors as well as the units used in expressing observations. In the era of mechanical calculation it was common to subtract a constant from all observations (when equivalent to dropping leading digits) to simplify data entry. This is an example of data coding.

Example:

An experiment is performed at a company to compare three different types of food. Three types of food– Chinese, Italian, and Mexican – are tried, on four days that are selected randomly for each. The productivity of each day which is measured by the

number of items that are being produced is recorded and he results are given in the table below:

Chine

(3)

out of the three? Use a = .05.

2) Explain what the p-value that is found in part A means.

3) Which type(s) of food seem to be best?

4) Which type(s) of food seem to be worst?

Solution:

1) The parameters of the interest are m_1 which is for the mean number of the produced items for all days when Chinese food is eaten; m_2 which is for the mean number of the produced items for all days when Italian food is eaten; and m_3 which is for the mean number of the produced items for all days when Mexican food is eaten.

H0 : m1 = m2 = m3 Ha : inequality of at least two of the means

out of three.

The Decision Rule : Accept Ha if the p-value that is calculated is less

than 0.05.

Test statistic, F = (variability among the sample means) / (variability due to chance)

The F – Value = 10.66 and p – value = 0.0042 derived from the calculations from the StatCrunch. It is clear that p-value < .05 Ha is to be accepted.

We can interpret that at the level of significance of 0.05 the mean number of produced items differs for at least two types of food out of all three.

2) If the mean productivity were to be the same for all three types of food that is the null hypothesis holds true, then it is clear that the probability of observation of the three sample means as varied, or more varied, as the one that is obtained in this experiment is equal to 0.0042.

(4)

that the sample means would be having such diverse values if all the population means are considered equal. This is the reason that the alternative hypothesis was accepted.

3) The results that are obtained via the multiple comparison tests that were conducted for the mean numbers of items produced for each of the three types of food are shown in the table below.

Interpretations are listed as in terms of which of them is the largest because "best", here, means largest.

Comparison Value

of t

From these analyses it is clear that Italian food is certainly not the best in terms of worker productivity. Mexican food may be best, but perhaps even Chinese food could be.

4) From the same analyses we can see that Italian food is the worst in terms of worker productivity.

Coefficient of Correlation

Correlation:The relationship between more than one variable is considered as correlation. Correlation is considered as a number which can be used to describe the relationship between two variables. Simple correlation is defined as a variation related amongst any two variables.

(5)

related variation among three or more variables. Two variables are correlated only when they vary in such a way that the higher and lower values of one variable corresponds to the higher and lower values of the other variable. We might also get to know if they are correlated when the higher value of one variable corresponds with the lower value of the other

Coefficient of Correlation

Coefficient of correlation, r, called the linear correlation coefficient, measures the strength and the direction of a linear relationship between two variables. It also called as Pearson product moment correlation coefficient. The algebraic method of measuring the correlation is called the coefficient of correlation.

Types of correlation coefficients include:

 Pearson product-moment correlation coefficient, also known

as r, R, or Pearson's r, a measure of the strength and direction of the linear relationship between two variables that is defined as the (sample) covariance of the variables divided by the product of their (sample) standard deviations.

 Intraclass correlation, a descriptive statistic that can be used when quantitative measurements are made on units that are organized into groups; describes how strongly units in the same group resemble each other.

 Rank correlation, the study of relationships between rankings of different variables or different rankings of the same variable

 Spearman's rank correlation coefficient, a measure of how well the relationship between two variables can be described by a monotonic function

 Kendall tau rank correlation coefficient, a measure of the portion of ranks that match between two data sets.

 Goodman and Kruskal's gamma, a measure of the

(6)

Types of Correlattion

1.Positive correlation

A positive correlation is a correlation in the same direction. 2. Negative Correlation

A negative correlation is a correlation in the opposite direction.

3. Partial Correlation

The correlation is partial if we study the relationship between two variables keeping all other variables constant.

Example:

The Relationship between yield and rainfall at a constant temperature is partial correlation.

4. Linear Correlation

When the change in one variable results in the constant change in the other variable, we say the correlation is linear. When there is a linear correlation, the points plotted will be in a straight line

Example:

Consider the variables with the following values.

X

: 10 20 30 40 50 Y: 2₀ 4₀ 6₀ 8₀ 10₀



Here, there is a linear relationship between the variables. There is a ratio 1:2 at all points. Also, if we plot them they will be in a straight line



Correlation are of three types:

(7)

 Negative Correlation

 No correlation

In correlation, when values of one variable increase with the increase in another variable, it is supposed to be a positive correlation. On the other hand, if the values of one variable decrease with the decrease in another variable, then it would be a negative correlation. There might be the case when there is no change in a variable with any change in another variable. In this case, it is defined as no correlation between the two.

Correlation Symbol

Symbol of correlation = rr

Correlation Formula

The formula for correlation is as follows,

Correlation (r) = N∑XY−(∑X)(∑Y)[N∑X2−(∑X)2][N∑Y2− (∑Y)2]√N∑XY−(∑X)(∑Y)[N∑X2−(∑X)2][N∑Y2−(∑Y)2]

Where,

xx and yy are the variables.

bb = the slope of the regression line is also called as the regression coefficient

aa = intercept point of the regression line which is in the y-axis.

NN = Number of values or elements

(8)

YY = Second Score

∑XY∑XY = Sum of the product of the first and Second Scores

∑X∑X = Sum of First Scores

∑Y∑Y = Sum of Second Scores

∑X2∑X2 = Sum of square first scores.

∑Y2∑Y2 = Sum of square second scores.

r = n∑xy−∑x∑yn∑x2−(∑x)2√n∑y2−(∑y)2√n∑xy−∑x∑yn∑x2− (∑x)2n∑y2−(∑y)2

1. Positive Correlation

A positive correlation is a correlation in the same direction.

2. Negative Correlation

A negative correlation is a correlation in the opposite direction.

3. Partial Correlation

The correlation is partial if we study the relationship between two variables keeping all other variables constant.

Example:

The Relationship between yield and rainfall at a constant temperature is partial correlation.

(9)

When the change in one variable results in the constant change in the other variable, we say the correlation is linear. When there is a linear correlation, the points plotted will be in a straight line

Example:

Consider the variables with the following values.

X

Here, there is a linear relationship between the variables. There is a ratio 1:2 at all points. Also, if we plot them they will be in a straight line.

Positive Correlation

Back to Top

A relationship between two variables in which both variables move in same directions. A positive correlation exists when as one variable decreases, the other variable also decreases and vice versa. When the values of two variables x and y move in the same direction, the correlation is said to be positive. That is in positive correlation, when there is an increase in x, there will be and an increase in y also. Similarly when there is a decrease in x, there will be a decrease in y also.

Positive Correlation Example

(10)

When Price increases, supply also increases; when price decreases, supply decreases.

Positive Correlation Graph

Strong Positive Correlation

(11)

Weak Positive Correlation

(12)

Negative Correlation

Back to Top

In a negative correlation, as the values of one of the variables

increase, the values of the second variable decrease or the value of one of the variables decreases, the value of the other variable

increases. When the values of two variables x and y move in

opposite direction, we say correlation is negative. That is in negative correlation, when there is an increase in x, there will be a decrease in y. Similarly when there is a decrease in x, there will be an increase in y increase.

Negative Correlation Example

(13)

Perfect Negative Correlation

The closer the correlation coefficient is either -1 or +1, the

stronger the relationship is between the two variables. A perfect

negative correlation of -1.0 indicated that for every member of

the sample, higher score on one variable is related to a lower

score on the other variable.

Solved Example Question:

To determine the correlation value for the given set of X and Y values:

X Values Y Values

21 2.5

23 3.1

37 4.2

19 5.6

24 6.4

(14)

Solution:

Let us count the number of values. N = 6

Determine the values for XY, X2_{, Y}2

X Value Y Value X*Y X*X Y*Y

21 2.5 52.5 441 6.25

23 3.1 71.3 529 9.61

37 4.2 155.4 1369 17.64

19 5.6 106.4 361 31.36

24 6.4 153.6 576 40.96

33 8.4 277.2 1089 70.56

Determine the following

values ∑X∑X , ∑Y∑Y , ∑XY∑XY , ∑X2∑X2 , ∑y2∑y2. ∑X=157∑X=157

∑Y=30.2∑Y=30.2

∑XY=816.4∑XY=816.4 ∑X2=4365∑X2=4365 ∑Y2=176.38∑Y2=176.38

Correlation (r) = N∑XY−(∑X)(∑Y)[N∑X2−(∑X)2][N∑Y2−(∑Y)2]√

Correlation (r) = N∑XY−(∑X)(∑Y)[N∑X2−(∑X)2][N∑Y2−(∑Y)2]√N∑XY− (∑X)(∑Y)[N∑X2−(∑X)2][N∑Y2−(∑Y)2]

(15)

(r)=0.33

Regression Line

Definition: The Regression Line is the line that best fits the data, such that the overall distance from the line to the points (variable values) plotted on a graph is the smallest. In other words, a line used to minimize the squared deviations of predictions is called as the regression line

Regression is concerned with the study of relationship among

variables. The aim of regression (or regression analysis) is to make models for prediction and for making other inferences. Two variables or more than two variables may be treated by regression.

Regression line usually written as Yˆ=a+bXY^=a+bX. The general properties of the regression line Yˆ=a+bXY^=a+bX are given below:

o We know that Y¯¯¯¯=a+bX¯¯¯¯Y¯=a+bX¯. This shows that the line passes through the means X¯¯¯¯X¯ and Y¯¯¯¯Y¯.

o The sum of errors is equal to zero. The regression equation is Yˆ=a+bXY^=a+bX and the sum of derivatives of

observed YY from estimated YˆY^ is

∑(Y−Yˆ)=∑(Y−a−bX)=∑Y−na−b∑X=0∑(Y−Y^)=∑(Y−a−bX)=∑Y−na −b∑X=0

[∑Y=na+b∑X][∑Y=na+b∑X]

When ∑(Y−Yˆ)=0∑(Y−Y^)=0, it means that ∑Y=∑Yˆ∑Y=∑Y^

In the table below, the xi column shows scores on the aptitude test. Similarly, the yi column shows statistics grades. The last two rows show sums and mean scores that we will use to conduct the

(16)

Stude b1. Computations are shown below.

b1 = Σ [ (xi - x)(yi - y) ] / Σ [