sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Biostatistics I: 2017-18
1
Lecture 11
Regression and
Correlation methods
Learni ng Obj ecti ves
1.
Describe the Linear Regression Model
2.
State the Regression Modeling Steps
3.
Explain Ordinary Least Squares
4.
Compute Regression Coefficients
5.
Understand and check model
assumptions
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Purpose of regressi on
Estimation
Estimate association between outcome
and exposure adjusted for other
covariates
Prediction
Use an estimated model to predict the
outcome given covariates in a new dataset
Adj usti ng for confounders
•
Not adjust
–
Cofactor is a collider
–
Cofactor is in causal path
•
May or may not adjust
–
Cofactor has missing
–
Cofactor has error
True value
True value
Unadjusted estimate
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Workfl ow
•
Scatterplots
•
Bivariate analysis
•
Regression
–
Model fitting
•
Cofactors in/out
•
Interactions
–
Test of assumptions
•
Independent errors
•
Linear effects
•
Constant error variance
–
Influence (robustness
)
–
Interactiom testing
Correl ati on vs Regressi on
Deterministic vs. Statistical
Relationship
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Determi ni sti c vs. Stati sti cal Rel ati onshi p
Body Mass Index (BMI)
Income (millions $) vs bank’s assets
(billions $)
BMI and Hei ght
BMI=
(body mass kg)/(height m)
2
Fix body mass = 80 kg.
Height from 1.5 to 2.0 m.
Deterministic relationship
Mass, height
BMI
BMI
wzrost (m)
1.5
1.6
1.7
1.8
1.9
2
20
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
9
Income vs. Assets
Income =
a
+
b
Assets
Assets 3.4 - 49 billion $
Income changes, even
for banks with the
same assets!
Statistical relationship
Descri pti on of Rel ati onshi ps
A detertministic relationship is easy to
describe:
by a formula
It allows for a
perfect prediction
:
body mass and height known
exact BMI
Perfect prediction of quantities subject to a
statistical relationship is not possible:
known assets
varying income
But:
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
11
Stati sti cal Rel ati onshi ps: Exampl es
0
Physical Health Score
Strength and Di recti on of a Li near
Associ ati on
How good a straight line fits the points on a
two-dimensional scatterplot?
Pearson’s correlation coefficient
(often simply
called a
correlation
):
r
.
A measure of a linear association: the stronger the
association, the larger value of
r
.
Gives the “direction” of the relationship:
•
positive
r
→
positive association
large values of one variable
→
large values of the other
variable
•
negative
r
→
negative association
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
13
Pearson’s Correl ati on Coeffi ci ent
Bl ood Gl ucose and Vcf
23 patients with type I
diabetes.
Velocity of circumferential of
the left ventricle
(Vcf)
seems
to (linearly) increase with
blood glucose.
How to describe the
relation?
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
15
Bl ood Gl ucose and Vcf: Correl ati on
Subject
Glucose
Vcf
1
15.3
1.76
mean glucose: 10.37; mean Vcf: 1.32
(15.3-10.37)
2
+…+ (9.5-10.37)
2
= 429.7
(1.76-1.32)
2
+…+ (1.70-1.32)
2
=1.19
(15.3-10.37)(1.76-1.32) +…+ (9.5-10.37)(1.70-1.32)
=9.43
Correl ati on Coeffi ci ent: Speci al Val ues
Perfect positive association when
r
= +1
.
Perfect negative association when
r
= -1
.
No
linear
association (can be non-linear!),
or linear asociation with a horiziontal line
when
r
= 0.
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Correl ati on Coeffi ci ents
r = -0.9 n = 100
Si gni fi cance Test for Pearson’s Correl ati on
Coeffi ci ent
The computed value of
r
will usually be
different from 0 due to sampling
variability.
One may want to test the null hypothesis
that the true value of the coefficient is 0.
2
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
19
Bl ood Gl ucose and Vcf: The Test
Subject
Glucose
Vcf
1
15.3
1.76
We can reject the null hypothesis that the true value
of the correlation coefficient is 0.
Further Remarks on Pearson’s Correl ati on
Coeffi ci ent
Reminder: the coefficient describes only a
linear association.
It is sensitive to outliers (i.e., the observations
which are away from the main bulk of data).
Often due to recording errors, but may be genuine
values.
A non-parametric version, Spearman’s rank
correlation coefficient, exists.
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
A SIMPLE LINEAR REGRESSION
21
Rel ati onshi p Between Bl ood Gl ucose
and Vcf
Individual observations on
Vcf vary quite a bit even
for very similar levels of
blood glucose.
It seems, however, that
higher blood glucose level
leads to a higher average
Vcf.
How can we make this
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Si mpl e Li near Regressi on: Bl ood
Gl ucose & Vcf (1)
Assume that Vcf is normally distributed
with
N(
,
2
)
.
Assume a
linear regression model
:
the mean (average) value of Vcf
changes
linearly with the level of blood glucose:
=
α
+
β
·
(glucose level)
Li near Regressi on: Termi nol ogy (1)
The
dependent variable
Y
and the
c
ovariate
(
independent
,
explanatory
variable)
X
.
In our example, Vcf is
Y
, blood glucose level is
X
.
We assume that
Y
is normally distributed
with
N(
Y
,
2
)
.
We further postulate that, for
X = x
,
Y
=
Y
(x) =
α
+
β
·
x
α
and
β
are the
coefficients of the model
.
α
is called the
intercept.
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
25
Si mpl e Li near Regressi on
The straight line
describes the increase
in the mean of the
dependent variable as
a function of the
covariate
level.
Individual observations
for the dependent
variable vary around
the
regression line
,
according to a normal
distribution with mean
0 and a constant
Li near Regressi on: Termi nol ogy (2)
For an individual observation of
Y
we can write that
Y =
α
+
β
·
x +
ε
,
where
ε
is normally distributed with
N(0 ,
2
)
.
Intepretation: an individual observation of
Y
can
randomly deviate from the mean, which is a linear
function of
x
.
ε
is called the
residual random error
(measurement
error).
Note that
2
is assumed constant for all
x
.
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Li near Regressi on: The Intercept
Y
(x) =
α
+
β
·
x
For
x=0,
Y
(0) =
α
+
β
·
0 =
α
α
is the mean value of the dependent variable when
x
=
0.
But blood glucose level = 0 makes little sense...
Use “centered” covariate:
Y
(x) =
α
+
β
·
(x – x
0
)
Usually, one takes
x
0
=
sample mean of observed x
values.
For
x=x
0
,
Y
(x
0
) =
α
+
β
·
(x
0
-x
0
) =
α
+
β
·
0 =
α
α
is then the mean value when
x = x
0
.
Easier to interpret.
Can help in estimating the model.
Li near Regressi on: The Sl ope
Y
(x) =
α
+
β
·
x
Consider two values of the covariate:
x
and
x+1
.
For
x
:
Y
(x) =
α
+
β
· x
For (
x+1
) :
Y
(x+1) =
α
+
β
· (x+1) =
α
+
β
· x +
β
=
Y
(x) +
β
β
is the change in the mean value of the dependent
variable corresponding to a unit change in the
covariate.
β
> 0
: positive relationship (
x
increases, the mean
increases).
β
< 0
: negative relationship (
x
increases, the mean
decreases).
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Li near Regressi on: Esti mati on
Y
(x) =
α
+
β
·
x
The equation describes a theoretical relationship.
In practice, we know neither
α
nor
β
.
We have to estimate them from the observed data.
This is often called
fitting a model
to data.
The estimated coefficients will be denoted by
a
and
b
.
How to estimate
α
and
β
?
Esti mati on of the Coeffi ci ents of
a Li near Regressi on Model
Least squares method:
select the line which
minimizes
the sum of squares of the
differences
between the observed
values and
the values predicted by
the model (line).
Result:
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Li near Regressi on for Vcf & Bl ood
Gl ucose
Estimated model:
Vcf
(x) = 1.10 + 0.022
·
x
Interpretation: if the blood glucose level
increases by
1
mmol/l, the mean value of Vcf
increases by
0.022
%/s.
Positive association.
Note that the estimate
b
of the slope is close to
0
.
Perhaps it differes from 0 only by chance…
We need a CI for
β
.
Confi dence Interval for the Sl ope
CI for
β
:
b
±
t
n-2,1-
α
/2
·
SE(b)
(
t
n-2,1-
α
/2
is a percentile from Student’s
t
n-2
distribution).
In our case,
n
= 23 and SE(
b
) = 0.0105
95% CI for
:
[0.022
±
2.08
·
0.0105] = [0.0002, 0.0438]
99% CI for
:
[0.022
±
2.83
·0.0105] = [-0.0077, 0.0517]
•
For large n (
≥
100), the standard normal distribution can be used.
95% CI does not include 0
we can reject H
0
:
= 0
.
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Test of Si gni fi cance for the Sl ope
Alternatively, we could conduct a formal test.
H
0
:
β
=
0
H
A
:
β
≠
0
Under the null hypothesis,
T
=
b
/ SE(
b
) should have
Student’s
t
distribution with
n-2
degrees of freedom.
For Vcf data,
T
= 0.022/0.0105 = 2.09.
p
=
P
(|
t
21
|
≥
2.09) = 0.049
p
< 0.05
→
we can reject H
0
at the 5% significance level.
But not at the 1% level.
Predi cti on of the Mean Val ue Based on a
Li near Regressi on Model
The prediction would be of interest, e.g., for a
group of subjects with a particular value of
x
.
Example:
Estimated model:
Vcf
(x) = 1.10 + 0.022 · x
Take
x
= 10:
Vcf
(x) = 1.10 + 0.022 · 10 = 1.32
This
point prediction
is subject to an error, due to
the estimation of the coefficients of the model.
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Predi cti on Li mi ts for the Mean Val ue
The prediction
limits get wider
the further we are
from the “center”
of the scatterplot.
I.e., precision of
the prediction
decreases if we
move further
away from the
mean of
x
.
Predi cti on of an Indi vi dual Observati on
One can also try to make a prediction for an
individual observation of the dependent variable.
The prediction would be of interest for, e.g., an
individual patient.
The problem here is that the individual
observation will randomly deviate from the
mean.
Point prediction makes thus no sense.
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Predi cti on Li mi ts for an Indi vi dual
Observati on
The prediction limits
are wider than those
for the mean value.
The prediction error
contains two
components now:
the error due to the
prediction of the
mean value;
the error due to the
variability (
2
)
around the mean
value.
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
ASSUMPTION AND HOW TO
CHECK
39
Li near Regressi on Model : Assumpti ons
The model is developed assuming that:
Y
as independently collected
the mean value of the dependent variable
Y
is a linear
function of the covariate
X
;
for each value of
α
+
β
·
X
, the dependent variable is
normally distributed with constant variance
2
.
These are assumptions: they need to be checked.
If not fulfilled, you may need to consider
using another form of the covariate;
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Checki ng the Assumpti ons
Recall, according to the model,
Y =
α
+
β
·
x +
ε
,
where
ε
is normally distributed with
N(0 ,
2
)
.
We can estimate
ε
by
e = y – (a + b
·
x)
These estimates are called
residuals
Σ
e
2
/(n-1)
will give an estimate of
2
.
If the assumptions are correct, the residuals
should approximately have a normal
distribution with mean 0.
Anal ysi s of Resi dual s (1)
Plot the residuals against the observed
covariate values.
If the assumptions are met, the plot should be
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
43
Anal ysi s of Resi dual s (2)
The plot of the
residuals may reveal
non-constant
variance
(
heteroscedasticity
).
It can also point towards a
non-linear (w.r.t. the covariate values)
relationship.
Bl ood Gl ucose & Vcf: Resi dual s
The plot looks
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Bl ood Gl ucose and Vcf
23 patients with type
I diabetes.
Vcf seems to
(linearly) increase
with blood glucose.
How to describe the
relation?
It is not deterministic.
Bl ood Gl ucose and Vcf: The Test
Subject
Glucose
Vcf
1
15.3
1.76
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Rel ati onshi p Between Bl ood Gl ucose
and Vcf
Individual observations on
Vcf vary quite a bit even
for very similar levels of
blood glucose.
It seems, however, that
higher blood glucose level
leads to a higher average
Vcf.
How can we make this
description more formal?
Esti mati on of the Coeffi ci ents of
a Li near Regressi on Model
Least squares method
:
select the line which
minimizes
the sum of squares of the
differences
between the observed
values and
the values predicted by
the model (line).
Result:
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Predi cti on Li mi ts for the Mean Val ue
The prediction
limits get wider
the further we are
from the “center”
of the scatterplot.
I.e., precision of
the prediction
decreases if we
move further
away from the
mean of
x
.
Checki ng Normal i ty of Resi dual s
To this aim, the
normal
probability plot
is used.
Standardized residuals
(residual/st.error) are
ordered and plotted
against the values
expected from the
standard normal
distribution.
The graph should look
approximately linear.
One might have doubts
in our example…
0.00
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Li near Regressi on for Log-Vcf
Let us use ln(Vcf) as the dependent
variable.
The model changes to
ln(Vcf)
=
α
+
β
·
(glucose level)
The estimated model is
ln(Vcf)
= 0.115 + 0.015
·
(glucose level)
Model for Log-Vcf: Resi dual s (1)
No major problems in the residual plot.
-.4
-.2
0
.2
.4
Residuals
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
53
Model for Log-Vcf: Resi dual s
One might argue that the normal probability plot for
the residuals looks better than for untransformed Vcf.
0.00
0.25
0.50
0.75
1.00
Normal F[(lresid-m)/s]
0.00
0.25
0.50
0.75
1.00
Empirical P[i] = i/(N+1)
Interpretati on of the Model for Log-Vcf
The model implies that
ln(Vcf)
= 0.115 + 0.015
·
(glucose level)
It follows that, if blood glucose increases by 1 unit,
than the mean value of ln(Vcf) increases by 0.015.
Upon taking
Vcf
≈
exp
(
ln(Vcf)
),
Vcf
= e
0.115
·
e
0.015
·
(glucose level)
= e
0.115
·
(1.015)
(glucose level)
We could conclude that the mean value of Vcf
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
55
Choi ce of the Transformati on
Consider power
transformations
x
s
or
y
s
(
s
=...,-3,-2,-1,-½, 0(=ln),
½ ...)
The circle of powers.
Choose the quadrant,
which most closely
resembles the pattern
of the data.
Increase or decrease
the power of
x
or
y
(relative to 1) according
to the indications.
•
Example: for Quadrant II,
take
s
<1 for
x
or
s
>1 for
y
.
Choi ce of the Transformati on: Exampl e
Data resemble the
pattern of Quadrant
III.
We might want to use
s
<1 for
x
or
y
.
Let’s use ln(
x
) (i.e,
s
=
0).
The transformed
scatterplot looks
more linear.
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
COEFFICIENT OF DETERMINATION R
2
57
Di fferent Sources of Vari ati on
“Explained” variation
X
i
Y
X
Y
Total variation
Residual variation
_
Add squares of a particular component for all points
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
59
Coeffi ci ent of Determi nati on R
2
“Total” variation (sum of squares) = “Explained” + “Residual”
The larger “explained” variation, the better the model
“explains” the data.
R
2
= (“explained” sum of squares) / (“total” sum of squares)
Measures the proportion of variation that is explained
by the independent variable
X
in the regression model.
The closer to 1, the better the model “explains” the
data.
“
Total
” and “
explained
” sums of
squares are considerably
different.
Low
R
2.
Coefficient of Determination
R
2
“
Total
” and “
explained”
sums of
squares are not very
different.
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
61
Coeffi ci ent of Determi nati on (
R
2
) and
Correl ati on (
r
)
R
2
=
r
2
R
2
= 1
→
r
= ±1
Y
X
Y
X
Coeffi ci ent of Determi nati on (
R
2
) and
Correl ati on (
r
)
R
2
= .8,
r
= ±0.9
Y
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
63
Coeffi ci ent of Determi nati on (
R
2
) and
Correl ati on (
r
)
R
2
= 0,
r
= 0
Y
X
Coeffi ci ent of Determi nati on
R
2
R
2
high
→
models fits well ?
WRONG!
R
2
can be high for badly-fitting models
R
2
can be similar for good and bad models
Fit of the model should be assessed from
residuals.
Model fits well
→
R
2
meaningful !
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
65
Li near Regressi on for Log-Vcf
The estimated model:
ln(Vcf)
= 0.115 + 0.015
·
(glucose level)
Acceptable fit.
R
2
= 0.16
Only 16% of total variation in ln(Vcf) explained by glucose level
0
.2
.4
.6
.8
5 10 15 20
glucose
Si mpl e Li near Regressi on Model : Summary
Plot the data and check the relationship.
if not linear, transform
Fit the model & check the assumptions (residual
plots).
linear relationship; homoscedascisity; normality
If not fulfilled, consider
•
a transformation of the dependent variable;
•
another form of the covariate;
•
extra covariates (multiple regression).
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
67
Mul ti pl e Regressi on
Model s
More Than One Covariate
Adjusted
R
2
Dummy Variables, Interactions
Polynomial Regression
Si mpl e Li near Regressi on
For an individual observation of
Y
we assume
that
Y =
α
+
β
·
x +
ε
,
where
ε
is normally distributed with
N(0 ,
2
)
.
Thus, we assume that the mean of
Y
is a
linear function of
X
:
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Mul ti pl e Li near Regressi on
Simple linear regression involves one
covariate.
The model can be extended for the case
of more than one covariate.
It is then called a
multiple linear
regression
model.
Mul ti pl e Li near Regressi on
Assume we have two covariates:
X
and
Z
.
Let the dependent variable be normally distributed with
N(
(X,
Z),
2
),
where the mean is given by
(X, Z) =
α
+
β
X
·
X +
β
Z
·
Z
Consider
z
and
z
+1. We get
(X, z+
1
) =
α
+
β
X
·
X +
β
Z
·
(
z+
1)
=
α
+
β
X
·
X +
β
Z
·
z +
β
Z
=
(X, z) +
β
Z
Interpretation: the mean of
Y
changes linearly with
X
&
Z. For
a
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
71
Mul ti pl e Li near Regressi on: Exampl e (1)
Length vs. gestational age
for a sample of 100 low birth
infants born in Boston,
Massachussets.
A linear increase of mean
length with the age may be
postulated.
Not much influence of
mother’s age.
Note a possible outlier.
Mul ti pl e Li near Regressi on: Exampl e (2)
The model using gestational age and mother’s age as
covariates:
length
= 9.0909 +0.9361
·
(gest. age) +0.0247
·
(mother’s age)
(SE=0.1093)
(SE=0.0463)
For two mothers of the same age, if children are born 1 week
apart, than the child born later is, on average, 0.9361 cm longer.
Inference for the coefficients:
Gestational age,95% CI :
[0.936
±
1.985
·0.109] = [0.720, 1.152]
•
T
= 0.936/0.109 = 8.562,
p
< 0.0005
•
Statistically significant (at the 5% significance level) effect of the age.
Mother’s age,
95% CI :
[0.025
±
1.985
·0.046] = [-0.066, 0.116]
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Adj usted Coeffi ci ent of Determi nati on
R
2
The more covariates, the better the model will “explain” data.
R
2
will increase
Adjust it for the no. of covariates
For instance
Note that (n
– 1) / (n
–
q) < 1, so R
2
Mul ti pl e Li near Regressi on: Exampl e (3)
The model using gestational age and mother’s age as
covariates:
length
= 9.0909 +0.9361
·
(gest. age) +0.0247
·
(mother’s age)
(SE=0.1093)
(SE=0.0463)
R
2
= 0.4575
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
STATA OUTPUT
75
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Categori cal Covari ates
Often, a categorical variable is available as a covariate.
Ordinal: non-, light, medium, or heavy smoker.
Nominal: treatment group A, B or C.
To use it in a model, we would need to assign numerical scores
to its levels.
Problematic for a nominal variable.
Should we assign equidistant scores (e.g, non-smoker=1;
light=2; medium=3; heavy smoker=4)?
In such case,
dummy variables
can be used.
Dummy Vari abl es
For each of the
G
levels of the categorical covariate, define a binary
indicator equal 1 if the level is observed, and 0 otherwise.
We can then form the model with any
G-1
dummy variables, e.g.,
Y
(x) =
α
+
β
2
·
x
2
+
β
3
·
x
3
+ ... +
β
G
·
x
G
Note:
x
1
is not in the model; the first category becomes the
reference
.
β
i
will be the change of the mean of the dependent variable due to
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Dummy Vari abl es: Exampl e
For “smoking”, we can
construct the following
dummy variables
(exemplary patient
data):
Original x
Dummy variables
x
1x
2x
3x
4non-smoker
0
0
0
0
light
0
1
0
0
medium
0
0
1
0
heavy
0
0
0
1
We can then form the model
Y(x) =
α
+
β
2·x
2+
β
3·x
3+
β
4·x
4Note that : (
β
1·x
1=0) at all models
non-smokers:
Y(x) =
α
+
β
2·0+
β
3·0 +
β
4·0 =
α
light :
Y(x) =
α
+
β
2·1+
β
3·0 +
β
4·0 =
α
+
β
2•
β
2is the change in
Yfor light smokers, as
compared to non-smokers.
medium :
Y(x) =
α
+
β
2·0+
β
3·1 +
β
4·0 =
α
+
β
3•
β3
is the change in
Yfor medium smokers, as
compared to non-smokers.
heavy :
Y(x) =
α
+
β
2·0+
β
3·0 +
β
4·1 =
α
+
β
4•
β
4is the change in
Yfor heavy smokers, as
compared to non-smokers.
β
3-
β
2is the change in
Yfor medium smokers,
compared to light smokers.
Mul ti pl e Li near Regressi on: Exampl e (4)
Adjusting the model for mother’s diagnosis of toxemia.
Using a dummy variable equal 1 for toxemia, and 0 for non-toxemia.
The fitted model:
length
= 6.2843
+1.0699
·
(gest. age)- 1.7774
·
(toxemia)
(SE=0.1121)
(SE=0.6940)
Inference for the coefficients:
Gestational age, 95% CI :
[1.070
±
1.985
·0.112] = [0.848, 1.292]
•
T
= 1.0699/0.1121 = 9.544,
p
< 0.0005
•
Statistically significant (at the 5% significance level) effect of the age.
Toxemia, 95% CI :
[1.777
±
1.985
·0.694] = [0.399, 3.154]
•
T
= -1.7774/0.6940 = -2.561,
p
= 0.012
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
81
Mul ti pl e Li near Regressi on: Exampl e (5)
Mother wihout toxemia:
length
=6.2843+1.0699
·
(age)
Mother with toxemia:
length
=6.2843+1.0699
·
(age) -1.7774
The intercept is changed; the
slope for gestational age
remains the same.
The fitted regression lines are
parallel.
Mul ti pl e Li near Regressi on Wi th
Interacti on (1)
Interaction
means that the effect of
X
on the mean value of
Y
depends on the level of
Z
.
A model allowing for an interaction:
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Mul ti pl e Li near Regressi on Wi th
Interacti on (2)
For
X=x
we get
(x,Z)
=
α
+
β
X
·
x
+
β
Z
·
Z
+
β
XZ
·
x
·
Z
For
X=x +
1 we have
(x+
1
,Z)=
α
+
β
X
·
(x+
1
)+
β
Z
·
Z
+
β
XZ
·
(x+
1
)
·
Z
=
α
+
β
X
·
x
+
β
X
+
β
Z
·
Z
+
β
XZ
·
x
·
Z
+
β
XZ
·
Z =
=
(x,Z)
+
β
X
+
β
XZ
·
Z
Conclusion: the change due to a unit increase for
X
depends on
Z
.
Mul ti pl e Li near Regressi on: Exampl e (6)
Model with the interaction between toxemia and gest. age.
length
= 6.2843 + 1.0584·
(age) -
3.4771
·
(tox) +
0.0559
·
(age)·
(tox)
(SE=0.1263)
(SE=8.5198)
(SE=0.2795)
Mother without toxemia:
length=
6.2843
+
1.0584·(age)
Mother with toxemia:
length=
(6.2843
-
3.4771
) +
(1.0584
+
0.0559
)·(age)
Interaction effect, 95% CI :
[0.056
±
1.985
·
0.279] = [-0. 498, 0.610]
T
= 0.0559/0.2795 = 0.200,
p =
0.842
Not significant at the 0.05 level of significance.
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Mul ti pl e Li near Regressi on: Assumpti ons
The assumptions are similar to those for simple linear
regression:
a linear function of the covariate(s);
residual errors normally distributed with constant
variance
.
The assumptions should be checked.
Residual analysis as for simple linear regression
.
Mul ti pl e Li near Regressi on: Exampl e (7)
Length of low birth weight
infants.
The residual plot for the
model with gestational age
and toxemia.
Residual against fitted value
(i.e., the estimated linear
combination of covariates).
Homoscedasticity looks
plausible.
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Pol ynomi al Regressi on
Linear regression means that we consider a linear combination
of covariates.
The covariates themselves can be non-linear functions.
For instance, consider the quadratic model
Y
(X) =
α
+
β
1
·
X +
β
2
·
X
2
Intepretation: the mean of
Y
changes as a quadratic function of
X.
Quadrati c Regressi on: Average Bi rth
Wei ght
Non-linear dependence of
the average birth weight
on gestational age.
Quadratic function gives a
very good result.
The fitted model
weight = -22.693
+1.2122
·
age
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
General i zed Li near Model s
Linear regression is a member of the family of
generalized linear
models
.
In these models, we assume that the dependent variable
Y
has a
(not necessarily normal) distribution with mean
Y
.
It is further postulated that, for a particular function
g
g(
Y
) =
α
+
β
1
·
X
1
+ … +
β
k
·
X
k
In linear regression, we specify that
Y
is normally distributed, and
g(
)=
.
Other examples are logistic regression, Poisson regression etc.
Ockham’s Razor
William of Ockham, XIVth century:
''Pluralitas non est ponenda sine neccesitate''
''Entities should not be multiplied unnecessarily''
Implies the use of parsimonious models.
Containing as few parameters as possible.
sawilopo@yahoo.com Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health