presentasi reg logistik

(1)

AN

INTRODUCTION

TO LOGISTIC

REGRESSION

ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA

(2)

OUTLINE



Introduction and

Description



Some Potential

(3)

INTRODUCTION AND DESCRIPTION



Why use logistic regression?



Estimation by maximum

likelihood



Interpreting coefficients



Hypothesis testing



Evaluating the performance of

(4)

WHY USE LOGISTIC REGRESSION?



There are many important research

topics for which the dependent

variable is "limited."



For example: voting, morbidity or

mortality, and participation data is not

continuous or distributed normally.



Binary logistic regression is a type of

regression analysis where the

dependent variable is a dummy

(5)

THE LINEAR PROBABILITY MODEL

In the OLS regression:

Y =



+



X + e ; where Y = (0, 1)



The error terms are heteroskedastic



e is not normally distributed

because Y takes on only two values



The predicted probabilities can be

(6)

You are a researcher who is interested in

understanding the effect of smoking and weight

upon resting pulse rate. Because you have

categorized the response-pulse rate-into low

and high, a binary logistic regression analysis is

appropriate to investigate the effects of

smoking and weight upon pulse rate.

(7)

THE DATA

RestingPulse Smokes Weight

Low No 140

Low No 145

Low Yes 160

Low Yes 190

Low No 155

Low No 165

High No 150

Low No 190

Low No 195

⁞ ⁞ ⁞

Low No 110

High No 150

(8)

OLS RESULTS

Results

Regression Analysis: Tekanan Darah versus Weight,

Merokok

The regression equation is

Tekanan Darah = 0.745 - 0.00392 Weight + 0.210 Merokok

Predictor Coef SE Coef T P

Constant 0.7449 0.2715 2.74 0.007

Weight -0.003925 0.001876 -2.09 0.039

Merokok 0.20989 0.09626 2.18 0.032

(9)

PROBLEMS:

Predicted Values outside the 0,1

range

Descriptive Statistics: FITS1

Variable N N* Mean StDev Minimum Q1 Median Q3

Maximum

(10)

HETEROSKEDASTICITY

Weight R ES I1 220 200 180 160 140 120 100 1.00 0.75 0.50 0.25 0.00 -0.25 -0.50

(11)

THE LOGISTIC REGRESSION

MODEL

The "logit" model solves these problems:

ln[p/(1-p)] =



+



X + e



p is the probability that the event Y

occurs, p(Y=1)



p/(1-p) is the "odds ratio"



ln[p/(1-p)] is the log odds ratio, or

(12)

More:



The logistic distribution constrains

the estimated probabilities to lie

between 0 and 1.



The estimated probability is:

p = 1/[1 + exp(-



-



X)]



if you let



+



X =0, then p = .50



as



+



X gets really big, p

approaches 1



as



+



X gets really small, p

(13)

(14)

COMPARING LP AND LOGIT MODELS

0

1

LP Model

(15)

MAXIMUM LIKELIHOOD

ESTIMATION (MLE)



MLE is a statistical method for

(16)

INTERPRETING COEFFICIENTS



Since:

ln[p/(1-p)] =



+



X + e

The slope coefficient (



) is interpreted

as the rate of change in the "log

(17)



An interpretation of the

logit coefficient which is

usually more intuitive is

the "odds ratio"

 _Since:

[p/(1-p)] = exp( + X)

exp(



) is the effect of the

(18)

FROM MINITAB OUTPUT:

**Although there is evidence that the estimated coefficient for

Weight is not zero, the odds ratio is very close to one (1.03),

indicating that a one pound increase in weight minimally

effects a person's resting pulse rate

**Given that subjects have the same weight, the odds ratio

can be interpreted as the odds of smokers in the sample

having a low pulse being 30% of the odds of non-smokers

having a low pulse.

Logistic Regression Table

Odds 95% CI

Predictor Coef SE Coef Z P Ratio Lower Upper

Constant -1.98717 1.67930 -1.18 0.237

Smokes

(19)

HYPOTHESIS TESTING

 _{The Wald statistic for the}__{coefficient is:}

Wald (Z)= [ /s.e.B]2

which is distributed chi-square with 1 degree of freedom.  _{The last Log-Likelihood from the maximum likelihood}

iterations is displayed along with the statistic G. This statistic tests the null hypothesis that all the coefficients associated with predictors equal zero versus these coefficients not all

(20)

EVALUATING THE PERFORMANCE OF THE

MODEL

Goodness-of-Fit Tests displays Pearson, deviance, and Hosmer-Lemeshow goodness-of-fit tests. If the p-value is less than

your accepted α-level, the test would reject the null hypothesis of an adequate fit.

(21)

MULTICOLLINEARITY



The presence of multicollinearity will not

lead to biased coefficients.



But the standard errors of the

coefficients will be inflated.



If a variable which you think should be

statistically significant is not, consult the

correlation coefficients.



If two variables are correlated at a rate

greater than .6, .7, .8, etc. then try

dropping the least theoretically