P : pressure V : Volume T : Temperature n : number of moles R : universal gas constant

(1)

Mathematical Model

Equation, formula

2 mc

E= V=IR

Mathematical Model

(

log

θ

)

cos2 3 2 x y=

NRT PV=

P: pressure V: Volume T: Temperature

n: number of moles R: universal gas constant

Assumptions : ideal gas, static and close environment

Mathematical Model

Ideal gas law : PV=NRT

Q1 : Is this relationship true?

Q2 : What is the value of the constant R?

Answer these questions by a set of measurements :

(

Pi,Vi,Ti,Ni

)

i i

i i i

T N

V P

R =

Errors due to unknown outside factors exists.

Statistical Model

Observed data p=P+δp v=V+δv t=T+δt n=N+δn

δ Unobserved measurement errors (random)

Ideal gas law :

NRT PV=

(

p

(

−

v

δ

pp

)

(

v

p

−

δ

vv

) (

=

p

n

v

−

δ

Rt

n

) (

R

n

t

−

Rn

δ

t

)

t

R

n t

)

nRt

pv

=

+

δ

+

δ

−

δ

−

δ

−

δ

+

δ

Statistical Model

Systematic component

Random errors

Data

Model parameter Unknown parameter in systematic component e.g. universal gas constant R

Analysis of Variance Model (ANOVA)

One-way ANOVA Compare multiple populations

………….. 2 1 1,σ

µ

2 2 2,σ

µ

1

1 12 11,Y ,...,Yn Y

2

2 22 21,Y ,...,Yn Y

a

an a

a Y Y

Y1, 2,..., 2

, a aσ µ

Assumptions

1. Normal

2. Equal Variances

3. Independence

(

2

)

1,σ

µ

N

(

2

)

2,σ

µ

N

(

µ_,σ2

)

a

N

One-way ANOVA

∑

= =a

i i

n N

1

Total sample size

∑

= = a i

i i

n

N 1

1 _µ

µ Overall population mean (grand mean)

µ µ αi= i−

ith_{treatment effect} 

    

= ∑

= 0

1 a

i i i

nα

Random errors εij=Yij−µi=Yij−µ−αi

ANOVA model

ij i ij

Y =µ+α +ε j=1,2,...,ni i=1,2,...,a

0

1

= ∑

=

a

i i i

nα

( )

2

, 0

~ σ

ε N

iid ij

One-way ANOVA

ANOVA model

ij i ij

Y =µ+α +ε j=1,2,...,ni i=1,2,...,a

0

1

=

∑

=

a

i i i

nα

( )

2

, 0

~ σ

ε N

iid ij

µ

Between group

1 α µ +

2 α µ +

a α µ + ………….

21 21 2+ =Y

+α ε

µ

Within group

22 22 2+ =Y

+α ε

µ

2 2 2 2 2+ n =Yn

+α ε

µ

(2)

Test for Treatment Effects

same. the all not are s Population : vs same. the are s Population : 1 0 H

HH0:ThereH0is:noα1 treatment=α2=leffect.αa vs=0 vsH1 :ThereH1:someare treatmentαi≠0effects.

ith_{sample mean} _∑

= = ni

j ij i i Y n Y 1 1

overall sample mean ∑∑ ∑

= = = = = a i i i a i n j

ij nY

N Y N Y i 1 1 1 1 1

Total sum of squares ∑ ∑

(

)

= = − =a i n j ij T i Y Y SS 1 1 2

Treatment sum of squares ∑ ∑

( )

∑

( )

= = = − = − = a i i i a i n j i

A Y Y nY Y

SS i 1 2 1 1 2

Error sum of squares ∑∑

(

)

= = − =a i n j i ij E i Y Y SS 1 1 2

Between Group Variation

Within Group Variation

Test for Treatment Effects

Break down of sum of squares

( )

i

(

ij i

)

ij Y Y Y Y Y

Y − = − + −

(

)

∑

( )

∑∑

(

)

∑∑

= = = = = − = − + − a i n j i ij a i i i a i n j ij i i Y Y Y Y n Y Y 1 1 2 1 2 1 1 2 E A T SS SS

SS = +

Treatment mean squares ∑

( )

= − − = − = a i i i A

A nY Y

a a SS MS 1 2 1 1 1

Error mean squares ∑∑

(

)

= = − − = − = a i n j i ij E E i Y Y a N a N SS MS 1 1 2 1 true 1

H αµii not not allall the thesamesame largeMS variationA tendsof toYibearoundlargeY

MSEis unaffected by the population means.

Test for Treatment Effects

Reject H0if Fis too large.

Reject H0if Fobs > F(a-1, N-a, α).

(

a 1,N a,α

)

F − − Obtained from Fdistribution table

Treatment mean squares ∑

( )

= − − = − = a i i i A

A nY Y

a a SS MS 1 2 1 1 1 Test statistic E A MS MS F=

Error mean squares ∑∑

(

)

= = − − = − = a i n j i ij E E i Y Y a N a N SS MS 1 1 2 1

F

Distribution

( ) 1 , 0

2 2 2 2 2 1 1 2 2 2 1 2 1 2

1 1 2

1 1 >       +             Γ       Γ       + Γ = + − − x x r r x r r r r r r x f r r r r

(1,2)

~Fr r X ( ) 2 2 2 − = r r X E 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 1 2 3 4 5

F Densities

r1= 2, r2= 4

r1= 4, r2= 6

r1= 9, r2= 9 r1= 12, r2= 12

( ) ₍ ( _{) (} )₎ 4 2 2 2 2 2 2 1 2 1 2 2 − − − + = r r r r r r X Var

F

Distribution

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

0 1 2 3 4 5

(

r

1

,

r

2

,

α

)

F

α

(

r

1

,

r

2

)

F

Distribution Table

(

3,4,0.05

)

=?

(3)

ANOVA Table

Reject H0if Fobs > F(a-1, N-a, α). Test statistic E A MS MS F= 0 some : vs 0

: 1 2 a 1 i

0 α =α = α = H α ≠

H l

N- 1

SST

Total

SSE/ (N- a)

N-a SSE

Error

MSA/ MSE

SSA/ (a - 1)

a - 1

SSA Treatment F-ratio MS d.f. SS Source

Computational Formulae

( )

T_N

n T Y Y n SS a i i i a i i i A 2 .. 1 2 1 2 − = − =

∑

= =

(

)

∑ ∑

∑

∑ ∑

= = = = = − = − = a i i i a i n j ij a i n j i ij E n T Y Y Y SS i i 1 2 1 1 2 1 1 2

(

)

T_N

Y Y Y SS a i n j ij a i n j ij T i i 2 .. 1 1 2 1 1 2 − = − =

∑ ∑

= = = = ∑ =

=ni j ij i Y T 1 ∑ = =a i i T T 1 ..

ith_total overall total

One-way ANOVA

Example : Color brightness of films

23, 24, 25, 21, 26, 25, 27, 26, 22, 25, 27, 30, 25, 25, 27 Agfa

Data Brand

43, 41, 44, 50, 47, 32, 32, 36, 35, 34, 32, 38, 38, 40, 36 Fuji

32, 34, 31, 30, 37, 28, 28, 27, 30, 32, 26, 29, 27, 30, 31 Kodak 378 578 452 T_i 0 1 3 2 1

0: vs H:not H

H α =α =α

α

=0.05

15 3 2 1=n =n =

n

3 =

a N=45

452

1=

T T2=578 T3=378 TT....==1408452+578+378 46040 3 1 15 1 2₌ ∑∑ = = i j ij Y N T n T SS i i i A 2 .. 3 1 2 − =

∑

= 45 1408 15 378 15 578 15

4522 2 2 2

− + + =

A

SSA=1363.38

SS N T Y SS a i n j ij T i 2 .. 1 1 2₋ =

∑ ∑

= = 45 1408 46040 2 − = T SSSST=1985.24

A T

E SS SS

SSE==1985−.24−1363.38

SSE=621.86

SS

One-way ANOVA

N- 1

SST

Total

SSE/ (N- a)

N-a SSE

Error

MSA/ MSE

SSA/ (a - 1)

a - 1

SSA Treatment F-ratio MS d.f. SS Source 1985.24 621.86 1363.38 44 42 2 14.81 681.69 46.03

From F distribution table F(2,42,0.05)≈F(2,40,0.05)=3.23

23

.

3

03

.

46

>

=

−

ratio

F

Reject H0at α= 0.05 .

The color brightness of the three brands of films are significantly different.

Estimation

Treatment effect : αi

Y Yi−

Point

( )



     − ± − − N n MS t Y Y i E a N i 1 1 2 ,α Interval

Difference in treatment effects : αi-αj

j i Y

Y−

Point

(

)

_

      + ± − − j i E a N j i n n MS t Y

Y ,α2 1 1

Interval

Estimation

Example : Color brightness of films

13 . 30 15 452 1= =

Y 38.53

15 578 2= =

Y 25.2

15 378 3= =

Y 31.29

45 1408

= = Y

95% C.I. For α1 :

( )



     − ± − N n MS t Y Y E 1 1 1 025 . 0 , 42 1 ( ) ( ) ( )       ₋ ± − 45 1 15 1 81 . 14 021 . 2 29 . 31 13 . 301.16±1.64

−

[

−2.80 ,0.48

]

95% C.I. For α₂-α₃:

(

)

_

     + ± − 3 2 025 . 0 , 42 3 2 1 1 n n MS t Y Y E ( ) ( ) ( )       + ± − 15 1 15 1 81 . 14 021 . 2 2 . 25 53 . 38.33 2.84 13 ±

[

10.49 ,16.17

]

95% C.I. For α1-α2:

[

−11.24 ,−5.56

]

95% C.I. For α1-α3:

[

2.09 ,7.77

]

α2> α3

α1< α2

α1> α3

3 1

2

α

(4)

Two way ANOVA

Example : Brightness of synthetic fabric

39, 48, 47 39, 42, 46

40, 45, 36 50

36, 39, 43 37, 35, 40

38, 32, 30 40

400°F 375°F

350°F

Time (cycles)

Temperature

Two-way factorial ANOVA model:

ijk ij j i ijk

Y

=

µ

+

α

+

β

+

γ

+

ε

k=1,2,3 j=1,2,3 i=1,2

0

= = =

=∑ ∑ ∑

∑

j ij i

ij j

j i

i β γ γ

α

(

2

)

, 0

~ σ

ε N

iid ijk

Two way ANOVA

Example : Brightness of synthetic fabric

Two-way factorial ANOVA model:

ijk ij j i ijk

Y

=

µ

+

α

+

β

+

γ

+

ε

k=1,2,3 j=1,2,3 i=1,2

0

= = =

=∑ ∑ ∑

∑

j ij i

ij j

j i

i β γ γ

α

(

2

)

, 0

~ σ

ε N

iid ijk MTB > print 'Bright' 'Time' 'Temp'

Data Display

Row Bright Time Temp

1 38 40 350 2 32 40 350 3 30 40 350 4 37 40 375 5 35 40 375 6 40 40 375 7 36 40 400 8 39 40 400 9 43 40 400 10 40 50 350 11 45 50 350 12 36 50 350

………

MTB > ANOVA 'Bright' = Time Temp Time*Temp.

Analysis of Variance (Balanced Designs)

Factor Type Levels Values Time fixed 2 40 50 Temp fixed 3 350 375 400

Analysis of Variance for Bright

Source DF SS MS F P Time 1 150.22 150.22 9.69 0.009 Temp 2 80.78 40.39 2.61 0.115 Time*Temp 2 3.44 1.72 0.11 0.896 Error 12 186.00 15.50

Total 17 420.44

significant

I nteraction

Group mean

Temperature

350 375 400

Time = 40

Time = 50

Additive

Time = 50

Time = 40

Non-additive

Regression

Sir Francis Galton

(1822 – 1911) _{Height of Son}

Height of Father

Height of the sons of fathers regressed

towards the mean height of the population

Regression

Regression Model the relationship between dependent variable and independent variable(s)

Linear relationship Linear

Regression Simple Linear Regression

one independent variable

Tree age (by tree rings) Tree age (by C14)

A-Level Score Overall CGA

Risk of the stock Return of a stock

Extent of training Job performance

Examples

I ndependent variable (X ) Dependent variable (Y )

Simple Linear Regression

Scatterplot

Regression line

(5)

Simple Linear Regression Model

n i X

Yi=α+β i+εi , =1,2,..., εiiid~N

( )

0,σ2 assumptions

Data :

{

(

X1,Y1

) (

,X2,Y2

) (

,...,Xn,Yn

)

}

Simple Linear Regression Model

Example : Y = Height of son (in cm) X = Height of father (in cm)

Suppose true relation given by Y=0.9X+15

Fathers with same heights Sons with same heights

Unrealistic!

More reasonable relationship : E

( )

Y =0.9X+15

182.2 0.7

181.5 185

174.6 -2.4

177 180

171.7 -0.8

172.5 175

169.3 1.3

168 170

Y

εεεε( Random Error) E( Y) = 0.9X + 15

X

Observed Unobserved Unobserved Observed 182.2

185

174.6 180

171.7 175

169.3 170

Y X

Estimate the regression line from these observed data Fit a regression line to the data

Estimation of Model Parameters

Sample statistics

∑

=

= n

i i

X n X

1

∑

=

= n

i i

Y n Y

1

( ) 2

1 2 1

2

X n X X X S

n

i i n

i i

xx=∑ − =∑ − =

= ( )

2 1

2

Y n Y Y Y S

n

i i n

i i

yy=∑ − =∑ − = =

(

X X

)(

Y Y

)

XY nXY S

n

i i i n

i

i i

xy=

∑

− − =

∑

−

=

=1 1

xx xy S S

b=

=

βK

X b Y

a= −

=

αK

Fitted regression line : _YD₌_a₊_bX

True regression line : E

( )

Y =α+βX

≠≠≠≠

Fitting Regression Line

Example : Study of how wheat yield depends on fertilizer.

80 65 65 70 50 50 40 Y

700 600 500 400 300 200 100 X

X = Fertilizer (in lb/acre) Y = Yield (in bu/acre)

400

=

X Y=60

1400000

7

1 2₌

∑

= i

i

X 26350

7

1 2₌

∑

= i

i

Y

184500

7

1 =

∑

= i

i iY

X

400

=

X Y=60

1400000

7

1 2₌

∑

= i

i

X 26350

7

1 2₌

∑

= i

i

Y

184500

7

1 =

∑

= i

i iY

X

2

1 2

X n X S

n

i i xx=

∑

−

= ( )( )

2

400 7 1400000−

= xx

Sxx=280000

S S XY nXY

n

i i i xy=

∑

−

=1

( )( )( )7400 60 184500−

= xy

SSxy=16500

xx xy

S S

b= 0.059

280000 16500

= =

b aa==Y60−−b(X0.059)( )400=36.43

Fitted regression line : Y=36.43+0.059X

( )( )2

60 7 26350−

= yy

Syy=1150

S

X Y=36.43+0.059

Prediction

400

0=

X

(0.059)( )400 43 . 36

0= +

Y

03 . 60

0=

Y 650

0=

X

78 . 74

0=

Y

0

0= X

43 . 36

0= Y

(6)

Danger of Extrapolation

pi

ta

l

Danger of Extrapolation

SARS Trend

-500 0 500 1000 1500 2000

Date

N

o

. o

f pat

ie

n

ts

i

n

hospi

(7)

Danger of Extrapolation

SARS Trend

-500 0 500 1000 1500 2000

28-Feb 20-Mar 9-Apr 29-Apr 19-May 8-Jun

Date

N

o

. o

f p

a

ti

en

ts

i

n

h

o

sp

it

al

Danger of Extrapolation

SARS Trend

0 200 400 600 800 1000 1200

Date

N

o

. of

pa

ti

e

n

ts

i

n h

os

p

it

a

l

Danger of Extrapolation

SARS Trend

0 200 400 600 800 1000 1200

28-Feb 20-Mar 9-Apr 29-Apr 19-May 8-Jun

Date

N

o

.

o

f p

a

ti

e

n

ts

i

n

h

o

s

p

it

a

l

Nonlinear Relationships

Association

≠

Causation

Example : Price and Demand for gas

Fitted regression line : Demand = 139.24 – 1.11 Price

77 54 1968

35 43 56 87 105 109 136 112 134 Demand

57 54 50 45 43 42 37 31 30 Price

1969 1967 1966 1965 1964 1963 1962 1961 1960 Year

Low demand is due to high price.

? ?

65 58 1970

56 58 1971

58 60 1972

55 73 1973

49 88 1974

39 89 1975

36 92 1976

46 97 1977

40 100 1978

42 Demand

102 Price

1979 Year

Simpson’s Paradox

1960-1965

1966-1973

1974-1979

Year

(8)

Test For Regression Effect

0 : vs 0 :

Test H0 β= H1 β≠

Fitted values Y_i=a+bX_i

Residuals ri Yi Yi

− =

Decomposition of Variation

(

i

) (

i i

)

i Y Y Y Y Y

Y

− + − = −

Variation of Y Explained variation

Unexplained variation

Random Error

i i

i Y α βX

ε = − −

≠

Test For Regression Effect

Decomposition of Variation

(

i

) (

i i

)

i Y Y Y Y Y

Y

− + − = −

Break down of sum of squares

(

)

∑

(

)

∑

(

)

∑

= =

=

− + − =

− n

i i i n

i i n

i

i Y Y Y Y Y

Y

1 2

1

2

SST = SSR + SSE

yy

T S

SS =

Total sum of squares

(

)

∑

=

− + − = n

i

i R Y bX bX Y

SS

1

2

(

)

∑

=

− + = n

i i

R a bX Y

SS

1

2

Regression sum of squares

∑

(

)

= − = n

i i

R b X X

SS

1 2 2

xx xy xx R

S S S b SS

2 2 ₌ =

Error sum of squares SSE=SST−SSR

xx xy yy xx yy E

S S S S b S SS

2

2 ₌ ₋

− =

Test For Regression Effect

ANOVA table

n- 1

SST

Total

SSE/ (n- 2)

n- 2

SSE

Error

MSR/ MSE

SSR

1

SSR

Regression

F-ratio MS

d.f. SS

Source

Reject H0if Fobs > F(1, n-2, α). Test statistic

E R MS MS

F=

0 : vs 0

: ₁

0 β= H β≠

H

R R

R SS

SS

MS = =

1 =n−2

SS

MS E

E

Test For Regression Effect

Example : Wheat yield example

280000 = xx

S Syy=1150 Sxy=16500

Regression line Y=36.43+0.059X

xx R bS

SS ₌(2 ) ( )

280000 059 .

0 2

= R

SSR=974.68

SS SST=Syy=1150 SSSSSSEEE===1751150SST.32−−SS974R.68

6 1150 Total

35.064 5

175.32 Error

27.805 974.68

1 974.68 Regression

F-ratio MS

d.f. SS

Source

(

1,5,0.05

)

=6.61<27.805 F

(9)

Coefficient of Determination

Strong relationship High prediction power

T R SS SS

R2=

Explained variation

Total variation

1

0

≤

R

2

≤

Perfect linear relationship No linearrelationship

Example : 84.8%

1150 68 . 974

2₌ ₌

R

Coefficient of Determination

C.I . For Regression Parameters

100(1 -α)% C.I. for β

xx E n

S MS t b± −2,α2

100(1 -α)% C.I. for α 

    

+ ± −

xx E n

S X n MS t

a 2,α2 1

Large S_xx More accurate estimates

Demonstration

C.I . For Regression Parameters

6 1150 Total

35.064 5

175.32 Error

27.805 974.68

1 974.68 Regression

F-ratio MS

d.f. SS

Source

95% C.I. for β:

xx E

S MS t b±5,0.025( )

280000 064 . 35 57 . 2 059 . 0.059±0.0288

0 ±

[0.0302, 0.0878]

95% C.I. for α: 

    

+ ±

xx E

S X n MS t a

2

025 . 0 , 5

1

( ) ( ) ( ) 

     

+ ±

280000 400 7 1 64 . 35 57 . 2 43 . 36

2

172 . 2 43 . 36 ±

[32.892 ,37.236]

Prediction

Predict the value of Y₀at a fixed value of X= X₀

Point prediction : Y0=a+bX0

100(1 -α)% prediction interval (P.I.)

(

)

    

 

 ₋

+ + ± −

xx E

n

S X X n MS t Y

2 0 2

, 2 0

1 1 α

Prediction

6 1150 Total

35.064 5

175.32 Error

27.805 974.68

1 974.68 Regression

F-ratio MS

d.f. SS

Source

At X0= 450, Y0=36.43+(0.059)( )450

98 . 62

0=

Y

90% prediction interval

(

)

_

  

 

 ₋

+ + ±

xx E

S X X n MS t Y

2 0 05

. 0 , 5 0

1 1

( ) ( ) ( )

  

 

 ₋

+ + ±

280000 400 450 7 1 1 064 . 35 02 . 2 98 . 62

2

837 . 12 98 .

62 ±

(10)

Prediction

Multiple Linear Regression

Example : Fuel consumption data

Data Display

Row State POP TAX NLIC INC ROAD FUELC DLIC 1 ME 1029 9.00 540 3.571 1.976 557 52.4781 2 NH 771 9.00 441 4.092 1.250 404 57.1984 3 VT 462 9.00 268 3.865 1.586 259 58.0087 4 MA 5787 7.50 3060 4.870 2.351 2396 52.8771 5 RI 968 8.00 527 4.399 0.431 397 54.4422 6 CN 3082 10.00 1760 5.342 1.333 1408 57.1058 7 NY 18366 8.00 8278 5.319 11.868 6312 45.0724 8 NJ 7367 8.00 4074 5.126 2.138 3439 55.3007 9 PA 11926 8.00 6312 4.447 8.577 5528 52.9264 10 OH 10783 7.00 5948 4.512 8.507 5375 55.1609 11 IN 5291 8.00 2804 4.391 5.939 3068 52.9957 12 IL 11251 7.50 5903 5.126 14.186 5301 52.4664

………..

ε β β β β

β + + + + +

= TAX DLIC INC ROAD

FUEL 0 1 2 3 4

Multiple Linear Regression

Example : Fuel consumption data

Regression Analysis

The regression equation is

FUEL = 37.7 - 3.48 TAX + 1.34 DLIC - 6.65 INC - 0.242 ROAD

Predictor Coef Stdev t-ratio p

Constant 37.68 18.57 2.03 0.049

TAX -3.478 1.298 -2.68 0.010

DLIC 1.3366 0.1924 6.95 0.000

INC -6.651 1.723 -3.86 0.000

ROAD -0.2417 0.3391 -0.71 0.480

s = 6.633 R-sq = 67.8% R-sq(adj) = 64.9% Analysis of Variance

SOURCE DF SS MS F p

Regression 4 3991.92 997.98 22.68 0.000

Error 43 1892.05 44.00

Total 47 5883.96

Unusual Observations

Obs. TAX FUEL Fit Stdev.Fit Residual St.Resid 37 5.0 63.963 64.758 3.723 -0.795 -0.14 X 40 7.0 96.812 73.371 2.102 23.441 3.73R

R denotes an obs. with a large st. resid.