• Tidak ada hasil yang ditemukan

Sesi 11. Multiple Regression and Correlation Methods

N/A
N/A
Protected

Academic year: 2018

Membagikan "Sesi 11. Multiple Regression and Correlation Methods"

Copied!
95
0
0

Teks penuh

(1)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Biostatistics I: 2017-18

1

Lecture 11

Regression and

Correlation methods

(2)

Learni ng Obj ecti ves

1.

Describe the Linear Regression Model

2.

State the Regression Modeling Steps

3.

Explain Ordinary Least Squares

4.

Compute Regression Coefficients

5.

Understand and check model

assumptions

(3)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Purpose of regressi on

Estimation

Estimate association between outcome

and exposure adjusted for other

covariates

Prediction

Use an estimated model to predict the

outcome given covariates in a new dataset

(4)

Adj usti ng for confounders

Not adjust

Cofactor is a collider

Cofactor is in causal path

May or may not adjust

Cofactor has missing

Cofactor has error

True value

True value

Unadjusted estimate

(5)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Workfl ow

Scatterplots

Bivariate analysis

Regression

Model fitting

Cofactors in/out

Interactions

Test of assumptions

Independent errors

Linear effects

Constant error variance

Influence (robustness

)

Interactiom testing

(6)

Correl ati on vs Regressi on

Deterministic vs. Statistical

Relationship

(7)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Determi ni sti c vs. Stati sti cal Rel ati onshi p

Body Mass Index (BMI)

Income (millions $) vs bank’s assets

(billions $)

(8)

BMI and Hei ght

BMI=

(body mass kg)/(height m)

2

Fix body mass = 80 kg.

Height from 1.5 to 2.0 m.

Deterministic relationship

Mass, height

BMI

BMI

wzrost (m)

1.5

1.6

1.7

1.8

1.9

2

20

(9)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

9

Income vs. Assets

Income =

a

+

b

Assets

Assets 3.4 - 49 billion $

Income changes, even

for banks with the

same assets!

Statistical relationship

(10)

Descri pti on of Rel ati onshi ps

A detertministic relationship is easy to

describe:

by a formula

It allows for a

perfect prediction

:

body mass and height known

exact BMI

Perfect prediction of quantities subject to a

statistical relationship is not possible:

known assets

varying income

But:

(11)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

11

Stati sti cal Rel ati onshi ps: Exampl es

0

Physical Health Score

(12)

Strength and Di recti on of a Li near

Associ ati on

How good a straight line fits the points on a

two-dimensional scatterplot?

Pearson’s correlation coefficient

(often simply

called a

correlation

):

r

.

A measure of a linear association: the stronger the

association, the larger value of

r

.

Gives the “direction” of the relationship:

positive

r

positive association

large values of one variable

large values of the other

variable

negative

r

negative association

(13)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

13

Pearson’s Correl ati on Coeffi ci ent

(14)

Bl ood Gl ucose and Vcf

23 patients with type I

diabetes.

Velocity of circumferential of

the left ventricle

(Vcf)

seems

to (linearly) increase with

blood glucose.

How to describe the

relation?

(15)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

15

Bl ood Gl ucose and Vcf: Correl ati on

Subject

Glucose

Vcf

1

15.3

1.76

mean glucose: 10.37; mean Vcf: 1.32

(15.3-10.37)

2

+…+ (9.5-10.37)

2

= 429.7

(1.76-1.32)

2

+…+ (1.70-1.32)

2

=1.19

(15.3-10.37)(1.76-1.32) +…+ (9.5-10.37)(1.70-1.32)

=9.43

(16)

Correl ati on Coeffi ci ent: Speci al Val ues

Perfect positive association when

r

= +1

.

Perfect negative association when

r

= -1

.

No

linear

association (can be non-linear!),

or linear asociation with a horiziontal line

when

r

= 0.

(17)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Correl ati on Coeffi ci ents

r = -0.9 n = 100

(18)

Si gni fi cance Test for Pearson’s Correl ati on

Coeffi ci ent

The computed value of

r

will usually be

different from 0 due to sampling

variability.

One may want to test the null hypothesis

that the true value of the coefficient is 0.

2

(19)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

19

Bl ood Gl ucose and Vcf: The Test

Subject

Glucose

Vcf

1

15.3

1.76

We can reject the null hypothesis that the true value

of the correlation coefficient is 0.

(20)

Further Remarks on Pearson’s Correl ati on

Coeffi ci ent

Reminder: the coefficient describes only a

linear association.

It is sensitive to outliers (i.e., the observations

which are away from the main bulk of data).

Often due to recording errors, but may be genuine

values.

A non-parametric version, Spearman’s rank

correlation coefficient, exists.

(21)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

A SIMPLE LINEAR REGRESSION

21

(22)

Rel ati onshi p Between Bl ood Gl ucose

and Vcf

Individual observations on

Vcf vary quite a bit even

for very similar levels of

blood glucose.

It seems, however, that

higher blood glucose level

leads to a higher average

Vcf.

How can we make this

(23)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Si mpl e Li near Regressi on: Bl ood

Gl ucose & Vcf (1)

Assume that Vcf is normally distributed

with

N(

,

2

)

.

Assume a

linear regression model

:

the mean (average) value of Vcf

changes

linearly with the level of blood glucose:

=

α

+

β

·

(glucose level)

(24)

Li near Regressi on: Termi nol ogy (1)

The

dependent variable

Y

and the

c

ovariate

(

independent

,

explanatory

variable)

X

.

In our example, Vcf is

Y

, blood glucose level is

X

.

We assume that

Y

is normally distributed

with

N(

Y

,

2

)

.

We further postulate that, for

X = x

,

Y

=

Y

(x) =

α

+

β

·

x

α

and

β

are the

coefficients of the model

.

α

is called the

intercept.

(25)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

25

Si mpl e Li near Regressi on

The straight line

describes the increase

in the mean of the

dependent variable as

a function of the

covariate

level.

Individual observations

for the dependent

variable vary around

the

regression line

,

according to a normal

distribution with mean

0 and a constant

(26)

Li near Regressi on: Termi nol ogy (2)

For an individual observation of

Y

we can write that

Y =

α

+

β

·

x +

ε

,

where

ε

is normally distributed with

N(0 ,

2

)

.

Intepretation: an individual observation of

Y

can

randomly deviate from the mean, which is a linear

function of

x

.

ε

is called the

residual random error

(measurement

error).

Note that

2

is assumed constant for all

x

.

(27)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Li near Regressi on: The Intercept

Y

(x) =

α

+

β

·

x

For

x=0,

Y

(0) =

α

+

β

·

0 =

α

α

is the mean value of the dependent variable when

x

=

0.

But blood glucose level = 0 makes little sense...

Use “centered” covariate:

Y

(x) =

α

+

β

·

(x – x

0

)

Usually, one takes

x

0

=

sample mean of observed x

values.

For

x=x

0

,

Y

(x

0

) =

α

+

β

·

(x

0

-x

0

) =

α

+

β

·

0 =

α

α

is then the mean value when

x = x

0

.

Easier to interpret.

Can help in estimating the model.

(28)

Li near Regressi on: The Sl ope

Y

(x) =

α

+

β

·

x

Consider two values of the covariate:

x

and

x+1

.

For

x

:

Y

(x) =

α

+

β

· x

For (

x+1

) :

Y

(x+1) =

α

+

β

· (x+1) =

α

+

β

· x +

β

=

Y

(x) +

β

β

is the change in the mean value of the dependent

variable corresponding to a unit change in the

covariate.

β

> 0

: positive relationship (

x

increases, the mean

increases).

β

< 0

: negative relationship (

x

increases, the mean

decreases).

(29)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Li near Regressi on: Esti mati on

Y

(x) =

α

+

β

·

x

The equation describes a theoretical relationship.

In practice, we know neither

α

nor

β

.

We have to estimate them from the observed data.

This is often called

fitting a model

to data.

The estimated coefficients will be denoted by

a

and

b

.

How to estimate

α

and

β

?

(30)

Esti mati on of the Coeffi ci ents of

a Li near Regressi on Model

Least squares method:

select the line which

minimizes

the sum of squares of the

differences

between the observed

values and

the values predicted by

the model (line).

Result:

(31)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Li near Regressi on for Vcf & Bl ood

Gl ucose

Estimated model:

Vcf

(x) = 1.10 + 0.022

·

x

Interpretation: if the blood glucose level

increases by

1

mmol/l, the mean value of Vcf

increases by

0.022

%/s.

Positive association.

Note that the estimate

b

of the slope is close to

0

.

Perhaps it differes from 0 only by chance…

We need a CI for

β

.

(32)

Confi dence Interval for the Sl ope

CI for

β

:

b

±

t

n-2,1-

α

/2

·

SE(b)

(

t

n-2,1-

α

/2

is a percentile from Student’s

t

n-2

distribution).

In our case,

n

= 23 and SE(

b

) = 0.0105

95% CI for

:

[0.022

±

2.08

·

0.0105] = [0.0002, 0.0438]

99% CI for

:

[0.022

±

2.83

·0.0105] = [-0.0077, 0.0517]

For large n (

100), the standard normal distribution can be used.

95% CI does not include 0

we can reject H

0

:

= 0

.

(33)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Test of Si gni fi cance for the Sl ope

Alternatively, we could conduct a formal test.

H

0

:

β

=

0

H

A

:

β

0

Under the null hypothesis,

T

=

b

/ SE(

b

) should have

Student’s

t

distribution with

n-2

degrees of freedom.

For Vcf data,

T

= 0.022/0.0105 = 2.09.

p

=

P

(|

t

21

|

2.09) = 0.049

p

< 0.05

we can reject H

0

at the 5% significance level.

But not at the 1% level.

(34)

Predi cti on of the Mean Val ue Based on a

Li near Regressi on Model

The prediction would be of interest, e.g., for a

group of subjects with a particular value of

x

.

Example:

Estimated model:

Vcf

(x) = 1.10 + 0.022 · x

Take

x

= 10:

Vcf

(x) = 1.10 + 0.022 · 10 = 1.32

This

point prediction

is subject to an error, due to

the estimation of the coefficients of the model.

(35)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Predi cti on Li mi ts for the Mean Val ue

The prediction

limits get wider

the further we are

from the “center”

of the scatterplot.

I.e., precision of

the prediction

decreases if we

move further

away from the

mean of

x

.

(36)

Predi cti on of an Indi vi dual Observati on

One can also try to make a prediction for an

individual observation of the dependent variable.

The prediction would be of interest for, e.g., an

individual patient.

The problem here is that the individual

observation will randomly deviate from the

mean.

Point prediction makes thus no sense.

(37)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Predi cti on Li mi ts for an Indi vi dual

Observati on

The prediction limits

are wider than those

for the mean value.

The prediction error

contains two

components now:

the error due to the

prediction of the

mean value;

the error due to the

variability (

2

)

around the mean

value.

(38)
(39)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

ASSUMPTION AND HOW TO

CHECK

39

(40)

Li near Regressi on Model : Assumpti ons

The model is developed assuming that:

Y

as independently collected

the mean value of the dependent variable

Y

is a linear

function of the covariate

X

;

for each value of

α

+

β

·

X

, the dependent variable is

normally distributed with constant variance

2

.

These are assumptions: they need to be checked.

If not fulfilled, you may need to consider

using another form of the covariate;

(41)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Checki ng the Assumpti ons

Recall, according to the model,

Y =

α

+

β

·

x +

ε

,

where

ε

is normally distributed with

N(0 ,

2

)

.

We can estimate

ε

by

e = y – (a + b

·

x)

These estimates are called

residuals

Σ

e

2

/(n-1)

will give an estimate of

2

.

If the assumptions are correct, the residuals

should approximately have a normal

distribution with mean 0.

(42)

Anal ysi s of Resi dual s (1)

Plot the residuals against the observed

covariate values.

If the assumptions are met, the plot should be

(43)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

43

Anal ysi s of Resi dual s (2)

The plot of the

residuals may reveal

non-constant

variance

(

heteroscedasticity

).

It can also point towards a

non-linear (w.r.t. the covariate values)

relationship.

(44)

Bl ood Gl ucose & Vcf: Resi dual s

The plot looks

(45)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Bl ood Gl ucose and Vcf

23 patients with type

I diabetes.

Vcf seems to

(linearly) increase

with blood glucose.

How to describe the

relation?

It is not deterministic.

(46)

Bl ood Gl ucose and Vcf: The Test

Subject

Glucose

Vcf

1

15.3

1.76

(47)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Rel ati onshi p Between Bl ood Gl ucose

and Vcf

Individual observations on

Vcf vary quite a bit even

for very similar levels of

blood glucose.

It seems, however, that

higher blood glucose level

leads to a higher average

Vcf.

How can we make this

description more formal?

(48)

Esti mati on of the Coeffi ci ents of

a Li near Regressi on Model

Least squares method

:

select the line which

minimizes

the sum of squares of the

differences

between the observed

values and

the values predicted by

the model (line).

Result:

(49)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Predi cti on Li mi ts for the Mean Val ue

The prediction

limits get wider

the further we are

from the “center”

of the scatterplot.

I.e., precision of

the prediction

decreases if we

move further

away from the

mean of

x

.

(50)

Checki ng Normal i ty of Resi dual s

To this aim, the

normal

probability plot

is used.

Standardized residuals

(residual/st.error) are

ordered and plotted

against the values

expected from the

standard normal

distribution.

The graph should look

approximately linear.

One might have doubts

in our example…

0.00

(51)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Li near Regressi on for Log-Vcf

Let us use ln(Vcf) as the dependent

variable.

The model changes to

ln(Vcf)

=

α

+

β

·

(glucose level)

The estimated model is

ln(Vcf)

= 0.115 + 0.015

·

(glucose level)

(52)

Model for Log-Vcf: Resi dual s (1)

No major problems in the residual plot.

-.4

-.2

0

.2

.4

Residuals

(53)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

53

Model for Log-Vcf: Resi dual s

One might argue that the normal probability plot for

the residuals looks better than for untransformed Vcf.

0.00

0.25

0.50

0.75

1.00

Normal F[(lresid-m)/s]

0.00

0.25

0.50

0.75

1.00

Empirical P[i] = i/(N+1)

(54)

Interpretati on of the Model for Log-Vcf

The model implies that

ln(Vcf)

= 0.115 + 0.015

·

(glucose level)

It follows that, if blood glucose increases by 1 unit,

than the mean value of ln(Vcf) increases by 0.015.

Upon taking

Vcf

exp

(

ln(Vcf)

),

Vcf

= e

0.115

·

e

0.015

·

(glucose level)

= e

0.115

·

(1.015)

(glucose level)

We could conclude that the mean value of Vcf

(55)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

55

Choi ce of the Transformati on

Consider power

transformations

x

s

or

y

s

(

s

=...,-3,-2,-1,-½, 0(=ln),

½ ...)

The circle of powers.

Choose the quadrant,

which most closely

resembles the pattern

of the data.

Increase or decrease

the power of

x

or

y

(relative to 1) according

to the indications.

Example: for Quadrant II,

take

s

<1 for

x

or

s

>1 for

y

.

(56)

Choi ce of the Transformati on: Exampl e

Data resemble the

pattern of Quadrant

III.

We might want to use

s

<1 for

x

or

y

.

Let’s use ln(

x

) (i.e,

s

=

0).

The transformed

scatterplot looks

more linear.

(57)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

COEFFICIENT OF DETERMINATION R

2

57

(58)

Di fferent Sources of Vari ati on

“Explained” variation

X

i

Y

X

Y

Total variation

Residual variation

_

Add squares of a particular component for all points

(59)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

59

Coeffi ci ent of Determi nati on R

2

“Total” variation (sum of squares) = “Explained” + “Residual”

The larger “explained” variation, the better the model

“explains” the data.

R

2

= (“explained” sum of squares) / (“total” sum of squares)

Measures the proportion of variation that is explained

by the independent variable

X

in the regression model.

The closer to 1, the better the model “explains” the

data.

(60)

Total

” and “

explained

” sums of

squares are considerably

different.

Low

R

2

.

Coefficient of Determination

R

2

Total

” and “

explained”

sums of

squares are not very

different.

(61)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

61

Coeffi ci ent of Determi nati on (

R

2

) and

Correl ati on (

r

)

R

2

=

r

2

R

2

= 1

r

= ±1

Y

X

Y

X

(62)

Coeffi ci ent of Determi nati on (

R

2

) and

Correl ati on (

r

)

R

2

= .8,

r

= ±0.9

Y

(63)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

63

Coeffi ci ent of Determi nati on (

R

2

) and

Correl ati on (

r

)

R

2

= 0,

r

= 0

Y

X

(64)

Coeffi ci ent of Determi nati on

R

2

R

2

high

models fits well ?

WRONG!

R

2

can be high for badly-fitting models

R

2

can be similar for good and bad models

Fit of the model should be assessed from

residuals.

Model fits well

R

2

meaningful !

(65)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

65

Li near Regressi on for Log-Vcf

The estimated model:

ln(Vcf)

= 0.115 + 0.015

·

(glucose level)

Acceptable fit.

R

2

= 0.16

Only 16% of total variation in ln(Vcf) explained by glucose level

0

.2

.4

.6

.8

5 10 15 20

glucose

(66)

Si mpl e Li near Regressi on Model : Summary

Plot the data and check the relationship.

if not linear, transform

Fit the model & check the assumptions (residual

plots).

linear relationship; homoscedascisity; normality

If not fulfilled, consider

a transformation of the dependent variable;

another form of the covariate;

extra covariates (multiple regression).

(67)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

67

Mul ti pl e Regressi on

Model s

More Than One Covariate

Adjusted

R

2

Dummy Variables, Interactions

Polynomial Regression

(68)

Si mpl e Li near Regressi on

For an individual observation of

Y

we assume

that

Y =

α

+

β

·

x +

ε

,

where

ε

is normally distributed with

N(0 ,

2

)

.

Thus, we assume that the mean of

Y

is a

linear function of

X

:

(69)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Mul ti pl e Li near Regressi on

Simple linear regression involves one

covariate.

The model can be extended for the case

of more than one covariate.

It is then called a

multiple linear

regression

model.

(70)

Mul ti pl e Li near Regressi on

Assume we have two covariates:

X

and

Z

.

Let the dependent variable be normally distributed with

N(

(X,

Z),

2

),

where the mean is given by

(X, Z) =

α

+

β

X

·

X +

β

Z

·

Z

Consider

z

and

z

+1. We get

(X, z+

1

) =

α

+

β

X

·

X +

β

Z

·

(

z+

1)

=

α

+

β

X

·

X +

β

Z

·

z +

β

Z

=

(X, z) +

β

Z

Interpretation: the mean of

Y

changes linearly with

X

&

Z. For

a

(71)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

71

Mul ti pl e Li near Regressi on: Exampl e (1)

Length vs. gestational age

for a sample of 100 low birth

infants born in Boston,

Massachussets.

A linear increase of mean

length with the age may be

postulated.

Not much influence of

mother’s age.

Note a possible outlier.

(72)

Mul ti pl e Li near Regressi on: Exampl e (2)

The model using gestational age and mother’s age as

covariates:

length

= 9.0909 +0.9361

·

(gest. age) +0.0247

·

(mother’s age)

(SE=0.1093)

(SE=0.0463)

For two mothers of the same age, if children are born 1 week

apart, than the child born later is, on average, 0.9361 cm longer.

Inference for the coefficients:

Gestational age,95% CI :

[0.936

±

1.985

·0.109] = [0.720, 1.152]

T

= 0.936/0.109 = 8.562,

p

< 0.0005

Statistically significant (at the 5% significance level) effect of the age.

Mother’s age,

95% CI :

[0.025

±

1.985

·0.046] = [-0.066, 0.116]

(73)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Adj usted Coeffi ci ent of Determi nati on

R

2

The more covariates, the better the model will “explain” data.

R

2

will increase

Adjust it for the no. of covariates

For instance

Note that (n

– 1) / (n

q) < 1, so R

2

(74)

Mul ti pl e Li near Regressi on: Exampl e (3)

The model using gestational age and mother’s age as

covariates:

length

= 9.0909 +0.9361

·

(gest. age) +0.0247

·

(mother’s age)

(SE=0.1093)

(SE=0.0463)

R

2

= 0.4575

(75)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

STATA OUTPUT

75

(76)
(77)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Categori cal Covari ates

Often, a categorical variable is available as a covariate.

Ordinal: non-, light, medium, or heavy smoker.

Nominal: treatment group A, B or C.

To use it in a model, we would need to assign numerical scores

to its levels.

Problematic for a nominal variable.

Should we assign equidistant scores (e.g, non-smoker=1;

light=2; medium=3; heavy smoker=4)?

In such case,

dummy variables

can be used.

(78)

Dummy Vari abl es

For each of the

G

levels of the categorical covariate, define a binary

indicator equal 1 if the level is observed, and 0 otherwise.

We can then form the model with any

G-1

dummy variables, e.g.,

Y

(x) =

α

+

β

2

·

x

2

+

β

3

·

x

3

+ ... +

β

G

·

x

G

Note:

x

1

is not in the model; the first category becomes the

reference

.

β

i

will be the change of the mean of the dependent variable due to

(79)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Dummy Vari abl es: Exampl e

For “smoking”, we can

construct the following

dummy variables

(exemplary patient

data):

Original x

Dummy variables

x

1

x

2

x

3

x

4

non-smoker

0

0

0

0

light

0

1

0

0

medium

0

0

1

0

heavy

0

0

0

1

We can then form the model

Y

(x) =

α

+

β

2

·x

2

+

β

3

·x

3

+

β

4

·x

4

Note that : (

β

1

·x

1

=0) at all models

non-smokers:

Y

(x) =

α

+

β

2

·0+

β

3

·0 +

β

4

·0 =

α

light :

Y

(x) =

α

+

β

2

·1+

β

3

·0 +

β

4

·0 =

α

+

β

2

β

2

is the change in

Y

for light smokers, as

compared to non-smokers.

medium :

Y

(x) =

α

+

β

2

·0+

β

3

·1 +

β

4

·0 =

α

+

β

3

β3

is the change in

Y

for medium smokers, as

compared to non-smokers.

heavy :

Y

(x) =

α

+

β

2

·0+

β

3

·0 +

β

4

·1 =

α

+

β

4

β

4

is the change in

Y

for heavy smokers, as

compared to non-smokers.

β

3

-

β

2

is the change in

Y

for medium smokers,

compared to light smokers.

(80)

Mul ti pl e Li near Regressi on: Exampl e (4)

Adjusting the model for mother’s diagnosis of toxemia.

Using a dummy variable equal 1 for toxemia, and 0 for non-toxemia.

The fitted model:

length

= 6.2843

+1.0699

·

(gest. age)- 1.7774

·

(toxemia)

(SE=0.1121)

(SE=0.6940)

Inference for the coefficients:

Gestational age, 95% CI :

[1.070

±

1.985

·0.112] = [0.848, 1.292]

T

= 1.0699/0.1121 = 9.544,

p

< 0.0005

Statistically significant (at the 5% significance level) effect of the age.

Toxemia, 95% CI :

[1.777

±

1.985

·0.694] = [0.399, 3.154]

T

= -1.7774/0.6940 = -2.561,

p

= 0.012

(81)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

81

Mul ti pl e Li near Regressi on: Exampl e (5)

Mother wihout toxemia:

length

=6.2843+1.0699

·

(age)

Mother with toxemia:

length

=6.2843+1.0699

·

(age) -1.7774

The intercept is changed; the

slope for gestational age

remains the same.

The fitted regression lines are

parallel.

(82)

Mul ti pl e Li near Regressi on Wi th

Interacti on (1)

Interaction

means that the effect of

X

on the mean value of

Y

depends on the level of

Z

.

A model allowing for an interaction:

(83)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Mul ti pl e Li near Regressi on Wi th

Interacti on (2)

For

X=x

we get

(x,Z)

=

α

+

β

X

·

x

+

β

Z

·

Z

+

β

XZ

·

x

·

Z

For

X=x +

1 we have

(x+

1

,Z)=

α

+

β

X

·

(x+

1

)+

β

Z

·

Z

+

β

XZ

·

(x+

1

)

·

Z

=

α

+

β

X

·

x

+

β

X

+

β

Z

·

Z

+

β

XZ

·

x

·

Z

+

β

XZ

·

Z =

=

(x,Z)

+

β

X

+

β

XZ

·

Z

Conclusion: the change due to a unit increase for

X

depends on

Z

.

(84)

Mul ti pl e Li near Regressi on: Exampl e (6)

Model with the interaction between toxemia and gest. age.

length

= 6.2843 + 1.0584·

(age) -

3.4771

·

(tox) +

0.0559

·

(age)·

(tox)

(SE=0.1263)

(SE=8.5198)

(SE=0.2795)

Mother without toxemia:

length

=

6.2843

+

1.0584·(age)

Mother with toxemia:

length

=

(6.2843

-

3.4771

) +

(1.0584

+

0.0559

)·(age)

Interaction effect, 95% CI :

[0.056

±

1.985

·

0.279] = [-0. 498, 0.610]

T

= 0.0559/0.2795 = 0.200,

p =

0.842

Not significant at the 0.05 level of significance.

(85)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Mul ti pl e Li near Regressi on: Assumpti ons

The assumptions are similar to those for simple linear

regression:

a linear function of the covariate(s);

residual errors normally distributed with constant

variance

.

The assumptions should be checked.

Residual analysis as for simple linear regression

.

(86)

Mul ti pl e Li near Regressi on: Exampl e (7)

Length of low birth weight

infants.

The residual plot for the

model with gestational age

and toxemia.

Residual against fitted value

(i.e., the estimated linear

combination of covariates).

Homoscedasticity looks

plausible.

(87)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Pol ynomi al Regressi on

Linear regression means that we consider a linear combination

of covariates.

The covariates themselves can be non-linear functions.

For instance, consider the quadratic model

Y

(X) =

α

+

β

1

·

X +

β

2

·

X

2

Intepretation: the mean of

Y

changes as a quadratic function of

X.

(88)

Quadrati c Regressi on: Average Bi rth

Wei ght

Non-linear dependence of

the average birth weight

on gestational age.

Quadratic function gives a

very good result.

The fitted model

weight = -22.693

+1.2122

·

age

(89)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

General i zed Li near Model s

Linear regression is a member of the family of

generalized linear

models

.

In these models, we assume that the dependent variable

Y

has a

(not necessarily normal) distribution with mean

Y

.

It is further postulated that, for a particular function

g

g(

Y

) =

α

+

β

1

·

X

1

+ … +

β

k

·

X

k

In linear regression, we specify that

Y

is normally distributed, and

g(

)=

.

Other examples are logistic regression, Poisson regression etc.

(90)

Ockham’s Razor

William of Ockham, XIVth century:

''Pluralitas non est ponenda sine neccesitate''

''Entities should not be multiplied unnecessarily''

Implies the use of parsimonious models.

Containing as few parameters as possible.

(91)

sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

''All models are wrong; some models are

useful''

George Box

(apparently)

(92)

Al ternati ve summary: stati sti cs for

vari ous types of outcome data

Outcome Variable

Are the observations independent or

correlated?

Mixed models/ GEE modeling

Outcome is normally

distributed (important

for small samples).

Outcome and predictor

have a linear

relationship.

Binary or

categorical

(e.g. fracture yes/ no)

Difference in proportions

Relative risks

Chi-square test

Logistic regression

McNemar’s test

Conditional logistic regression

GEE modeling

Chi-square test

assumes sufficient

numbers in each cell

(> = 5)

Time-to-event

(e.g. time to fracture)

Kaplan-Meier statistics

Cox regression

n/ a

Cox regression

Referensi

Dokumen terkait

Department of Environmental and Occupational Health, Faculty of Medicine and Health Sciences, Universiti Putra Malaysia.. Research

Barangsiapa terlibat peristiwa kecelakaan lalu lintas pada waktu mengemudikan kendaraan bermotor di jalan dan tidak menghentikan kendaraannya, tidak menolong orang yang menjadi

What are the abnormal behaviour of John Nash in “A Beautiful Mind”. What are the causes of schizophrenia of John Nash as found in

Sehubungan dengan Hasil Evaluasi yang telah dilakukan Pokja V ULP Kabupaten Maluku Tengah pada tanggal 14 September 2017 atas Paket Pekerjaan Belanja Modal Gedung SD Al Hilaal

Variabel yang digunakam dalam penelitian adalah variabel bebas yaitu menggunakan posisi netral dan Abduction and External Rotation dengan sekuen Gradient Echo

Tidak ada pemisahan ruang dekontaminasi dengan ruang packing hasil monitoring di unit CSSD PPI dalam monitoringnya dan dari hasil ICRA Unit , menemukan ruang cssd belum

Compared the On-Demand (DSR and AODV) and Table-Driven (DSDV) routing protocols by varying the number of nodes and measured the metrics like end-end delay, dropped

The sample size required for sufficient power to detect linkage and association through the TDT is dependent on many factors, including the marker and disease allele