Sesi 11. Multiple Regression and Correlation Methods

(1)

[email protected] Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Biostatistics I: 2017-18

₁

Lecture 11

Regression and

Correlation methods

(2)

Learni ng Obj ecti ves

1. Describe the Linear Regression Model

2. State the Regression Modeling Steps

3. Explain Ordinary Least Squares

4. Compute Regression Coefficients

5. Understand and check model

assumptions

(3)

Purpose of regressi on



Estimation



Estimate association between outcome

and exposure adjusted for other

covariates



Prediction



Use an estimated model to predict the

outcome given covariates in a new dataset

(4)

Adj usti ng for confounders

• Not adjust

–

Cofactor is a collider

–

Cofactor is in causal path

• May or may not adjust

–

Cofactor has missing

–

Cofactor has error

True value

Unadjusted estimate

(5)

Workfl ow

• Scatterplots

• Bivariate analysis

• Regression

–

Model fitting

• Cofactors in/out

• Interactions

–

Test of assumptions

• Independent errors

• Linear effects

• Constant error variance

–

Influence (robustness

)

–

Interactiom testing

(6)

Correl ati on vs Regressi on

Deterministic vs. Statistical

Relationship

(7)

Determi ni sti c vs. Stati sti cal Rel ati onshi p



Body Mass Index (BMI)



Income (millions $) vs bank’s assets

(billions $)

(8)

BMI and Hei ght



BMI=

(body mass kg)/(height m)

2 

Fix body mass = 80 kg.



Height from 1.5 to 2.0 m.



Deterministic relationship



Mass, height



BMI

wzrost (m)

1.5

1.6

1.7

1.8

1.9

2

20

(9)

9 Income vs. Assets



Income =

a

+

b



Assets



Assets 3.4 - 49 billion $



Income changes, even

for banks with the

same assets!



Statistical relationship

(10)

Descri pti on of Rel ati onshi ps



A detertministic relationship is easy to

describe:



by a formula



It allows for a

perfect prediction

:



body mass and height known



exact BMI



Perfect prediction of quantities subject to a

statistical relationship is not possible:



known assets



varying income



But:

(11)

11 Stati sti cal Rel ati onshi ps: Exampl es

0

Physical Health Score

(12)

Strength and Di recti on of a Li near

Associ ati on



How good a straight line fits the points on a

two-dimensional scatterplot?



Pearson’s correlation coefficient

(often simply

called a

correlation

):

r

.



A measure of a linear association: the stronger the

association, the larger value of

r

.



Gives the “direction” of the relationship:

• positive

r

→

positive association

large values of one variable

→

large values of the other

variable

• negative

r

→

negative association

(13)

13 Pearson’s Correl ati on Coeffi ci ent

(14)

Bl ood Gl ucose and Vcf



23 patients with type I

diabetes.



Velocity of circumferential of

the left ventricle

(Vcf)

seems

to (linearly) increase with

blood glucose.



How to describe the

relation?

(15)

15 Bl ood Gl ucose and Vcf: Correl ati on

Subject

Glucose

Vcf

1

15.3

1.76 

mean glucose: 10.37; mean Vcf: 1.32



(15.3-10.37)

2 _{+…+ (9.5-10.37)}

2 _{= 429.7}



(1.76-1.32)

2 _{+…+ (1.70-1.32)}

2 _=1.19



(15.3-10.37)(1.76-1.32) +…+ (9.5-10.37)(1.70-1.32)

=9.43

(16)

Correl ati on Coeffi ci ent: Speci al Val ues



Perfect positive association when

r

= +1

.



Perfect negative association when

r

= -1

.



No

linear

association (can be non-linear!),

or linear asociation with a horiziontal line

when

r

= 0.

(17)

Correl ati on Coeffi ci ents

r = -0.9 n = 100

(18)

Si gni fi cance Test for Pearson’s Correl ati on

Coeffi ci ent



The computed value of

r

will usually be

different from 0 due to sampling

variability.



One may want to test the null hypothesis

that the true value of the coefficient is 0.

2

(19)

19 Bl ood Gl ucose and Vcf: The Test

Subject

Glucose

Vcf

1

15.3

1.76 

We can reject the null hypothesis that the true value

of the correlation coefficient is 0.

(20)

Further Remarks on Pearson’s Correl ati on

Coeffi ci ent



Reminder: the coefficient describes only a

linear association.



It is sensitive to outliers (i.e., the observations

which are away from the main bulk of data).



Often due to recording errors, but may be genuine

values.



A non-parametric version, Spearman’s rank

correlation coefficient, exists.

(21)

A SIMPLE LINEAR REGRESSION

21

(22)

Rel ati onshi p Between Bl ood Gl ucose

and Vcf



Individual observations on

Vcf vary quite a bit even

for very similar levels of

blood glucose.



It seems, however, that

higher blood glucose level

leads to a higher average

Vcf.



How can we make this

(23)

Si mpl e Li near Regressi on: Bl ood

Gl ucose & Vcf (1)



Assume that Vcf is normally distributed

with

N(



,



2 )

.



Assume a

linear regression model

:

the mean (average) value of Vcf



changes

linearly with the level of blood glucose:



=

α

+

β

· (glucose level)

(24)

Li near Regressi on: Termi nol ogy (1)



The

dependent variable

Y

and the

c

ovariate

(

independent

,

explanatory

variable)

X

.



In our example, Vcf is

Y

, blood glucose level is

X

.



We assume that

Y

is normally distributed

with

N(



_Y

,



2 ₎

_.



We further postulate that, for

X = x

,



_Y

=



_Y

(x) =

α

+

β

· x



α

and

β

are the

coefficients of the model

.



α

is called the

intercept.

(25)

25 Si mpl e Li near Regressi on



The straight line

describes the increase

in the mean of the

dependent variable as

a function of the

covariate

level.



Individual observations

for the dependent

variable vary around

the

regression line

,

according to a normal

distribution with mean

0 and a constant

(26)

Li near Regressi on: Termi nol ogy (2)



For an individual observation of

Y

we can write that

Y =

α

+

β

· x +

ε

,

where

ε

is normally distributed with

N(0 ,



2 ₎

_.



Intepretation: an individual observation of

Y

can

randomly deviate from the mean, which is a linear

function of

x

.



ε

is called the

residual random error

(measurement

error).



Note that



2 is assumed constant for all

x

.

(27)

Li near Regressi on: The Intercept



_Y

(x) =

α

+

β

· x



For

x=0,



_Y

(0) =

α

+

β

_·

0 =

α



α

is the mean value of the dependent variable when

x

=

0. 

But blood glucose level = 0 makes little sense...



Use “centered” covariate:



_Y

(x) =

α

+

β

_·

(x – x

₀

)



Usually, one takes

x

₀

=

sample mean of observed x

values.



For

x=x

₀

,



_Y

(x

₀

) =

α

+

β

_·

(x

₀

-x

₀

) =

α

+

β

_·

0 =

α



α

is then the mean value when

x = x

₀

.



Easier to interpret.



Can help in estimating the model.

(28)

Li near Regressi on: The Sl ope



_Y

(x) =

α

+

β

· x



Consider two values of the covariate:

x

and

x+1

.



For

x

:



_Y

(x) =

α

+

β

· x



For (

x+1

) :



_Y

(x+1) =

α

+

β

· (x+1) =

α

+

β

· x +

β

=



_Y

(x) +

β



β

is the change in the mean value of the dependent

variable corresponding to a unit change in the

covariate.



β

> 0

: positive relationship (

x

increases, the mean

increases).



β

< 0

: negative relationship (

x

increases, the mean

decreases).

(29)

Li near Regressi on: Esti mati on



_Y

(x) =

α

+

β

· x



The equation describes a theoretical relationship.



In practice, we know neither

α

nor

β

.



We have to estimate them from the observed data.



This is often called

fitting a model

to data.



The estimated coefficients will be denoted by

a

and

b

.



How to estimate

α

and

β

?

(30)

Esti mati on of the Coeffi ci ents of

a Li near Regressi on Model



Least squares method:

select the line which

minimizes

the sum of squares of the

differences

between the observed

values and

the values predicted by

the model (line).



Result:

(31)

Li near Regressi on for Vcf & Bl ood

Gl ucose



Estimated model:



_Vcf

(x) = 1.10 + 0.022

_·

x



Interpretation: if the blood glucose level

increases by

1 mmol/l, the mean value of Vcf

increases by

0.022 %/s.



Positive association.



Note that the estimate

b

of the slope is close to

0 .

Perhaps it differes from 0 only by chance…



We need a CI for

β

.

(32)

Confi dence Interval for the Sl ope



CI for

β

:

b

±

t

_n-2,1-

_α

_/2

· SE(b)

(

t

_n-2,1-

_α

_/2

is a percentile from Student’s

t

_n-2

distribution).



In our case,

n

= 23 and SE(

b

) = 0.0105



95% CI for



:

[0.022

±

2.08

· 0.0105] = [0.0002, 0.0438]



99% CI for



:

[0.022

±

2.83 ·0.0105] = [-0.0077, 0.0517]

• For large n (

≥

100), the standard normal distribution can be used.



95% CI does not include 0



we can reject H

₀

:



= 0

.

(33)

Test of Si gni fi cance for the Sl ope



Alternatively, we could conduct a formal test.



H

₀

:

β

=

0 H

_A

:

β

≠

0 

Under the null hypothesis,

T

=

b

/ SE(

b

) should have

Student’s

t

distribution with

n-2

degrees of freedom.



For Vcf data,

T

= 0.022/0.0105 = 2.09.



p

=

P

(|

t

₂₁

|

≥

2.09) = 0.049



p

< 0.05

→

we can reject H

₀

at the 5% significance level.



But not at the 1% level.

(34)

Predi cti on of the Mean Val ue Based on a

Li near Regressi on Model



The prediction would be of interest, e.g., for a

group of subjects with a particular value of

x

.



Example:



Estimated model:



_Vcf

(x) = 1.10 + 0.022 · x



Take

x

= 10:



_Vcf

(x) = 1.10 + 0.022 · 10 = 1.32



This

point prediction

is subject to an error, due to

the estimation of the coefficients of the model.

(35)

Predi cti on Li mi ts for the Mean Val ue



The prediction

limits get wider

the further we are

from the “center”

of the scatterplot.



I.e., precision of

the prediction

decreases if we

move further

away from the

mean of

x

.

(36)

Predi cti on of an Indi vi dual Observati on



One can also try to make a prediction for an

individual observation of the dependent variable.



The prediction would be of interest for, e.g., an

individual patient.



The problem here is that the individual

observation will randomly deviate from the

mean.



Point prediction makes thus no sense.

(37)

Predi cti on Li mi ts for an Indi vi dual

Observati on



The prediction limits

are wider than those

for the mean value.



The prediction error

contains two

components now:



the error due to the

prediction of the

mean value;



the error due to the

variability (



2 ₎

around the mean

value.

(38)

(39)

ASSUMPTION AND HOW TO

CHECK

39

(40)

Li near Regressi on Model : Assumpti ons



The model is developed assuming that:



Y

as independently collected



the mean value of the dependent variable

Y

is a linear

function of the covariate

X

;



for each value of

α

+

β

· X

, the dependent variable is

normally distributed with constant variance



2 _.



These are assumptions: they need to be checked.



If not fulfilled, you may need to consider



using another form of the covariate;

(41)

Checki ng the Assumpti ons



Recall, according to the model,

Y =

α

+

β

· x +

ε

,

where

ε

is normally distributed with

N(0 ,



2 ₎

_.



We can estimate

ε

by

e = y – (a + b

· x)



These estimates are called

residuals



Σ

e

2 /(n-1)

will give an estimate of



2 .



If the assumptions are correct, the residuals

should approximately have a normal

distribution with mean 0.

(42)

Anal ysi s of Resi dual s (1)



Plot the residuals against the observed

covariate values.



If the assumptions are met, the plot should be

(43)

43 Anal ysi s of Resi dual s (2)



The plot of the

residuals may reveal

non-constant

variance

(

heteroscedasticity

).



It can also point towards a

non-linear (w.r.t. the covariate values)

relationship.

(44)

Bl ood Gl ucose & Vcf: Resi dual s



The plot looks

(45)

Bl ood Gl ucose and Vcf



23 patients with type

I diabetes.



Vcf seems to

(linearly) increase

with blood glucose.



How to describe the

relation?



It is not deterministic.

(46)

Bl ood Gl ucose and Vcf: The Test

Subject

Glucose

Vcf

1

15.3

1.76

(47)

Rel ati onshi p Between Bl ood Gl ucose

and Vcf



Individual observations on

Vcf vary quite a bit even

for very similar levels of

blood glucose.



It seems, however, that

higher blood glucose level

leads to a higher average

Vcf.



How can we make this

description more formal?

(48)

Esti mati on of the Coeffi ci ents of

a Li near Regressi on Model



Least squares method

:

select the line which

minimizes

the sum of squares of the

differences

between the observed

values and

the values predicted by

the model (line).



Result:

(49)

Predi cti on Li mi ts for the Mean Val ue



The prediction

limits get wider

the further we are

from the “center”

of the scatterplot.



I.e., precision of

the prediction

decreases if we

move further

away from the

mean of

x

.

(50)

Checki ng Normal i ty of Resi dual s



To this aim, the

normal

probability plot

is used.



Standardized residuals

(residual/st.error) are

ordered and plotted

against the values

expected from the

standard normal

distribution.



The graph should look

approximately linear.



One might have doubts

in our example…

0.00

(51)

Li near Regressi on for Log-Vcf



Let us use ln(Vcf) as the dependent

variable.



The model changes to



_ln(Vcf)

=

α

+

β

· (glucose level)



The estimated model is



_ln(Vcf)

= 0.115 + 0.015

· (glucose level)

(52)

Model for Log-Vcf: Resi dual s (1)



No major problems in the residual plot.

-.4

-.2

0 .2

.4

Residuals

(53)

53 Model for Log-Vcf: Resi dual s



One might argue that the normal probability plot for

the residuals looks better than for untransformed Vcf.

0.00

0.25

0.50

0.75

1.00 Normal F[(lresid-m)/s]

0.00

0.25

0.50

0.75

1.00 Empirical P[i] = i/(N+1)

(54)

Interpretati on of the Model for Log-Vcf



The model implies that



_ln(Vcf)

= 0.115 + 0.015

· (glucose level)



It follows that, if blood glucose increases by 1 unit,

than the mean value of ln(Vcf) increases by 0.015.



Upon taking



_Vcf

≈

exp

(



_ln(Vcf)

),



_Vcf

= e

0.115 _·

_e

0.015

· (glucose level)

= e

0.115 _·

_(1.015)

(glucose level)



We could conclude that the mean value of Vcf

(55)

55 Choi ce of the Transformati on



Consider power

transformations

x

s

_or

_y

s

(

s

=...,-3,-2,-1,-½, 0(=ln),

½ ...)



The circle of powers.



Choose the quadrant,

which most closely

resembles the pattern

of the data.



Increase or decrease

the power of

x

or

y

(relative to 1) according

to the indications.

• Example: for Quadrant II,

take

s

<1 for

x

or

s

>1 for

y

.

(56)

Choi ce of the Transformati on: Exampl e



Data resemble the

pattern of Quadrant

III. 

We might want to use

s

<1 for

x

or

y

.



Let’s use ln(

x

) (i.e,

s

=

0).



The transformed

scatterplot looks

more linear.

(57)

COEFFICIENT OF DETERMINATION R

2

57

(58)

Di fferent Sources of Vari ati on

“Explained” variation

X

_i

Y

X

Y

Total variation

Residual variation

_



Add squares of a particular component for all points

(59)

59 Coeffi ci ent of Determi nati on R

2 

“Total” variation (sum of squares) = “Explained” + “Residual”



The larger “explained” variation, the better the model

“explains” the data.



R

2 _{= (“explained” sum of squares) / (“total” sum of squares)}



Measures the proportion of variation that is explained

by the independent variable

X

in the regression model.



The closer to 1, the better the model “explains” the

data.

(60)



“

Total

” and “

explained

” sums of

squares are considerably

different.



Low

R

2

_.

Coefficient of Determination

R

2 

“

Total

” and “

explained”

sums of

squares are not very

different.

(61)

61 Coeffi ci ent of Determi nati on (

R

2 ) and

Correl ati on (

r

)

R

2 =

r

2 R

2 = 1

→

r

= ±1

Y

X

Y

X

(62)

Coeffi ci ent of Determi nati on (

R

2 ) and

Correl ati on (

r

)

R

2 = .8,

r

= ±0.9

Y

(63)

63 Coeffi ci ent of Determi nati on (

R

2 ) and

Correl ati on (

r

)

R

2 = 0,

r

= 0

Y

X

(64)

Coeffi ci ent of Determi nati on

R

2 

R

2 _high

→

_{models fits well ?}



WRONG!



R

2 can be high for badly-fitting models



R

2 _{can be similar for good and bad models}



Fit of the model should be assessed from

residuals.



Model fits well

→

R

2 meaningful !

(65)

65 Li near Regressi on for Log-Vcf



The estimated model:



_ln(Vcf)

= 0.115 + 0.015

· (glucose level)



Acceptable fit.



R

2 = 0.16



Only 16% of total variation in ln(Vcf) explained by glucose level

0

.2

.4

.6

.8

5 10 15 20

glucose

(66)

Si mpl e Li near Regressi on Model : Summary



Plot the data and check the relationship.



if not linear, transform



Fit the model & check the assumptions (residual

plots).



linear relationship; homoscedascisity; normality



If not fulfilled, consider

• a transformation of the dependent variable;

• another form of the covariate;

• extra covariates (multiple regression).

(67)

67

Mul ti pl e Regressi on

Model s

More Than One Covariate

Adjusted

R

2 Dummy Variables, Interactions

Polynomial Regression

(68)

Si mpl e Li near Regressi on



For an individual observation of

Y

we assume

that

Y =

α

+

β

· x +

ε

,

where

ε

is normally distributed with

N(0 ,



2 ₎

_.



Thus, we assume that the mean of

Y

is a

linear function of

X

:

(69)

Mul ti pl e Li near Regressi on



Simple linear regression involves one

covariate.



The model can be extended for the case

of more than one covariate.



It is then called a

multiple linear

regression

model.

(70)

Mul ti pl e Li near Regressi on



Assume we have two covariates:

X

and

Z

.



Let the dependent variable be normally distributed with

N(



(X,

Z),



2 _),

_{where the mean is given by}



(X, Z) =

α

+

β

_X

· X +

β

_Z

· Z



Consider

z

and

z

+1. We get



(X, z+

1 ) =

α

+

β

_X

· X +

β

_Z

· (

z+

1)

=

α

+

β

_X

· X +

β

_Z

· z +

β

_Z

=



(X, z) +

β

_Z



Interpretation: the mean of

Y

changes linearly with

X

&

Z. For

a

(71)

71 Mul ti pl e Li near Regressi on: Exampl e (1)



Length vs. gestational age

for a sample of 100 low birth

infants born in Boston,

Massachussets.



A linear increase of mean

length with the age may be

postulated.



Not much influence of

mother’s age.



Note a possible outlier.

(72)

Mul ti pl e Li near Regressi on: Exampl e (2)



The model using gestational age and mother’s age as

covariates:



_length

= 9.0909 +0.9361

· (gest. age) +0.0247

· (mother’s age)

(SE=0.1093)

(SE=0.0463)



For two mothers of the same age, if children are born 1 week

apart, than the child born later is, on average, 0.9361 cm longer.



Inference for the coefficients:



Gestational age,95% CI :

[0.936

±

1.985 ·0.109] = [0.720, 1.152]

• T

= 0.936/0.109 = 8.562,

p

< 0.0005

• Statistically significant (at the 5% significance level) effect of the age.



Mother’s age,

95% CI :

[0.025

±

1.985 ·0.046] = [-0.066, 0.116]

(73)

Adj usted Coeffi ci ent of Determi nati on

R

2 

The more covariates, the better the model will “explain” data.



R

2 will increase



Adjust it for the no. of covariates



For instance



Note that (n

– 1) / (n

–

q) < 1, so R

2

(74)

Mul ti pl e Li near Regressi on: Exampl e (3)



The model using gestational age and mother’s age as

covariates:



_length

= 9.0909 +0.9361

· (gest. age) +0.0247

· (mother’s age)

(SE=0.1093)

(SE=0.0463)



R

2 = 0.4575

(75)

STATA OUTPUT

75

(76)

(77)

Categori cal Covari ates



Often, a categorical variable is available as a covariate.



Ordinal: non-, light, medium, or heavy smoker.



Nominal: treatment group A, B or C.



To use it in a model, we would need to assign numerical scores

to its levels.



Problematic for a nominal variable.



Should we assign equidistant scores (e.g, non-smoker=1;

light=2; medium=3; heavy smoker=4)?



In such case,

dummy variables

can be used.

(78)

Dummy Vari abl es



For each of the

G

levels of the categorical covariate, define a binary

indicator equal 1 if the level is observed, and 0 otherwise.



We can then form the model with any

G-1

dummy variables, e.g.,



_Y

(x) =

α

+

β

₂

· x

₂

+

β

₃

· x

₃

+ ... +

β

_G

· x

_G



Note:

x

₁

is not in the model; the first category becomes the

reference

.



β

_i

will be the change of the mean of the dependent variable due to

(79)

Dummy Vari abl es: Exampl e



For “smoking”, we can

construct the following

dummy variables

(exemplary patient

data):

Original x

Dummy variables

x

₁

x

₂

x

₃

x

₄

non-smoker

0

0 light

0

1

0

0 medium

0

1

0 heavy

0

1 

We can then form the model



_Y

(x) =

α

+

β

₂

·x

₂

+

β

₃

·x

₃

+

β

₄

·x

₄

Note that : (

β

₁

·x

₁

=0) at all models



non-smokers:



_Y

(x) =

α

+

β

₂

·0+

β

₃

·0 +

β

₄

·0 =

α



light :



_Y

(x) =

α

+

β

₂

·1+

β

₃

·0 +

β

₄

·0 =

α

+

β

₂

• β

₂

is the change in



_Y

for light smokers, as

compared to non-smokers.



medium :



_Y

(x) =

α

+

β

₂

·0+

β

₃

·1 +

β

₄

·0 =

α

+

β

₃

• β3

is the change in



_Y

for medium smokers, as

compared to non-smokers.



heavy :



_Y

(x) =

α

+

β

₂

·0+

β

₃

·0 +

β

₄

·1 =

α

+

β

₄

• β

₄

is the change in



_Y

for heavy smokers, as

compared to non-smokers.



β

₃

-

β

₂

is the change in



_Y

for medium smokers,

compared to light smokers.

(80)

Mul ti pl e Li near Regressi on: Exampl e (4)



Adjusting the model for mother’s diagnosis of toxemia.



Using a dummy variable equal 1 for toxemia, and 0 for non-toxemia.



The fitted model:



_length

= 6.2843

+1.0699

· (gest. age)- 1.7774

· (toxemia)

(SE=0.1121)

(SE=0.6940)



Inference for the coefficients:



Gestational age, 95% CI :

[1.070

±

1.985 ·0.112] = [0.848, 1.292]

• T

= 1.0699/0.1121 = 9.544,

p

< 0.0005

• Statistically significant (at the 5% significance level) effect of the age.



Toxemia, 95% CI :

[1.777

±

1.985 ·0.694] = [0.399, 3.154]

• T

= -1.7774/0.6940 = -2.561,

p

= 0.012

(81)

81 Mul ti pl e Li near Regressi on: Exampl e (5)



Mother wihout toxemia:



_length

=6.2843+1.0699

· (age)



Mother with toxemia:



_length

=6.2843+1.0699

· (age) -1.7774



The intercept is changed; the

slope for gestational age

remains the same.



The fitted regression lines are

parallel.

(82)

Mul ti pl e Li near Regressi on Wi th

Interacti on (1)



Interaction

means that the effect of

X

on the mean value of

Y

depends on the level of

Z

.



A model allowing for an interaction:

(83)

Mul ti pl e Li near Regressi on Wi th

Interacti on (2)



For

X=x

we get



(x,Z)

=

α

+

β

_X

· x

+

β

_Z

· Z

+

β

_XZ

· x

· Z



For

X=x +

1 we have



(x+

1 ,Z)=

α

+

β

_X

· (x+

1 )+

β

_Z

· Z

+

β

_XZ

· (x+

1 )

· Z

=

α

+

β

_X

· x

+

β

_X

+

β

_Z

· Z

+

β

_XZ

· x

· Z

+

β

_XZ

· Z =

=



(x,Z)

+

β

_X

+

β

_XZ

· Z



Conclusion: the change due to a unit increase for

X

depends on

Z

.

(84)

Mul ti pl e Li near Regressi on: Exampl e (6)



Model with the interaction between toxemia and gest. age.



_length

= 6.2843 + 1.0584·

(age) -

3.4771

· (tox) +

0.0559

· (age)·

(tox)

(SE=0.1263)

(SE=8.5198)

(SE=0.2795)



Mother without toxemia:



_length

=

6.2843

+

1.0584·(age)



Mother with toxemia:



_length

=

(6.2843

-

3.4771

) +

(1.0584

+

0.0559

)·(age)



Interaction effect, 95% CI :

[0.056

±

1.985

· 0.279] = [-0. 498, 0.610]



T

= 0.0559/0.2795 = 0.200,

p =

0.842



Not significant at the 0.05 level of significance.

(85)

Mul ti pl e Li near Regressi on: Assumpti ons



The assumptions are similar to those for simple linear

regression:



a linear function of the covariate(s);



residual errors normally distributed with constant

variance

.



The assumptions should be checked.



Residual analysis as for simple linear regression

.

(86)

Mul ti pl e Li near Regressi on: Exampl e (7)



Length of low birth weight

infants.



The residual plot for the

model with gestational age

and toxemia.



Residual against fitted value

(i.e., the estimated linear

combination of covariates).



Homoscedasticity looks

plausible.

(87)

Pol ynomi al Regressi on



Linear regression means that we consider a linear combination

of covariates.



The covariates themselves can be non-linear functions.



For instance, consider the quadratic model



_Y

(X) =

α

+

β

₁

· X +

β

₂

· X

2



Intepretation: the mean of

Y

changes as a quadratic function of

X.

(88)

Quadrati c Regressi on: Average Bi rth

Wei ght



Non-linear dependence of

the average birth weight

on gestational age.



Quadratic function gives a

very good result.



The fitted model

weight = -22.693

+1.2122

· age

(89)

General i zed Li near Model s



Linear regression is a member of the family of

generalized linear

models

.



In these models, we assume that the dependent variable

Y

has a

(not necessarily normal) distribution with mean



_Y

.



It is further postulated that, for a particular function

g

g(



_Y

) =

α

+

β

₁

· X

₁

+ … +

β

_k

· X

_k



In linear regression, we specify that

Y

is normally distributed, and

g(



)=



.



Other examples are logistic regression, Poisson regression etc.

(90)

Ockham’s Razor



William of Ockham, XIVth century:

''Pluralitas non est ponenda sine neccesitate''

''Entities should not be multiplied unnecessarily''



Implies the use of parsimonious models.



Containing as few parameters as possible.

(91)

''All models are wrong; some models are

useful''

George Box

(apparently)

(92)

Al ternati ve summary: stati sti cs for

vari ous types of outcome data

Outcome Variable

Are the observations independent or

correlated?

Mixed models/ GEE modeling

Outcome is normally

distributed (important

for small samples).

Outcome and predictor

have a linear

relationship.

Binary or

categorical

(e.g. fracture yes/ no)

Difference in proportions

Relative risks

Chi-square test

Logistic regression

McNemar’s test

Conditional logistic regression

GEE modeling

Chi-square test

assumes sufficient

numbers in each cell

(> = 5)

Time-to-event

(e.g. time to fracture)

Kaplan-Meier statistics

Cox regression

n/ a

Cox regression