Estimation Method - Statistical tools for data analysis

3.8: Statistical tools for data analysis

3.8.5 Estimation Method

In this section, we introduce the Maximum Likelihood Estimation (MLE) technique used to estimate parameters in the survey logistic regression model. Due to the com- plex design sampling properties such as unequal probability of selection, clustering and stratification may not work properly.

Parameters estimation

In this sub-subsection we derive expressions for the maximum likelihood estimators in a typical survey logistic regression. Assuming that the outcomes variable yijh

follows Bernoulli distribution with density function

g(Yijh=yijh) =π_ijh^y^ijh(1−πijh)^1−y^ijh (3.78) the mean and variance ofy_hij are respectively,

µ_ijh = exp[x⁰_ijhβ]

1 + exp[x⁰_ijhβ] (3.79)

and

σ² =µ_ijh 1−µ_ijh

(3.80) andβ = (β1, β2, ...βp)⁰ denote a vector of parameters. The log-likelihood function that forms the basis for maximum likelihood estimation is given by

mhj

i=1 n_h

j=1 H

h=1

wijhϑijh[yijhlog(µijh+ (1−yijh)log(1−µijh)] (3.81)

Substituting the values of meanµijh into this expression we obtain

m_hj

i=1 nh

j=1 H

h=1

w_ijhϑ_ijh

y_ijhlog

exp[x⁰_ijhβ]

1 + exp[x⁰_ijhβ]

+(1−y_ijh)log

1− exp[x⁰_ijhβ]

1 + exp[x⁰_ijhβ]

(3.82) To obtain the unknown parameters we have to differentiate the log-likelihood with respect toβto get the following equation

∂`

∂β =

m_hj

i=1 nh

j=1 H

h=1

w_ijhϑ_ijh exp[x⁰_ijhβ]

1 + exp[x⁰_ijhβ]2

y_ijh

1− 1 + exp[x⁰_ijhβ]−1− 1−y_ijh 1 + exp[x⁰_ijhβ]

x⁰_ijh (3.83)

mhj

i=1 nh

j=1 H

h=1

ϑijhA⁰_ijh

σ²(yijh)−1

yijh−µijh

(3.84)

whereAijh =µijh 1−µijh

x⁰_ijh

The Fisher information matrix for the parameters of the Bernoulli model follows as ω =−E

∂²`

∂ββ⁰

, (3.85)

m_hj

i=1 nh

j=1 H

h=1

Dx⁰_ijh² bc²

y_ijh 1−2c⁻¹

− 1−y_ijh

1 +b−1

−3 1 +b−2

(3.86) whereD=w_ijhϑ_ijh,b= exp[x⁰_ijhβ]andc= 1 + exp[x⁰_ijhβ].

After simplifying the previous equation we get the following equation which is re- ferred to as the Fisher Information

ω =

mhj

i=1 nh

j=1 H

h=1

wijhϑijhA⁰_ijh

σ²(yijh)−1

Aijh (3.87)

Test for goodness of fit The Likelihood Ratio Test

The likelihood ratio (LR) test evaluates the significance of the joint effect of all the variables in the Survey logistic regression procedure. Likelihood ratio is used to compare the significance of the model with multiple parameters to just the intercept model. Suppose the model containss explanatory effects. For the i^th obser- vation, letπˆi be the estimated probability of the observed response. The statistic of -2loglikelihood is given by:

−2logL=−2X

wifilog( ˆπi (3.88)

where,wi andfi are weight and frequency values, respectively, of thei^th observa- tions. For binary response models that use the events/trials, this is equivalent to

−2logL=−2X

wifi

rilog( ˆπi) + (ni−ri)log(1−πˆi) (3.89)

wherer_i is the number of events,n_i is the number of trials, andπˆ_i is the estimated event probability. The likelihood ratio comparing the log likelihood of the two models and if the difference is statistically significant, then the restricted model works better than the full model. The significance of the likelihood ratio test means that the joint of the variables in the full model is more significant than just the intercept model. Under the global null and alternative likelihood ratio tests has the following hypothesis :

H0:β_i= 0 , f or i= 1,2, ..., p

H1: Not allβ_i= 0 , f or i= 1,2, ..., p

The test statistic of the likelihood ratio test follows a chi square distribution withp degrees of freedom according to Prempeh (2009). The log-likelihood ratio is defined as the difference between the deviance of the null model and model with explanatory variable(s).

Loglikelihood−Ratio=D^v_null−D_p−1^v (3.90) where D^v_null is the deviance of the model with just the constant and D^v_p−1 is the deviance of the model withp−1parameters.

Wald Test

The Wald test is the additional test that also can be used to evaluate the significance of the individual parameters in the Surverylogistic Regression procedure. The expression for calculating the Wald statistic is given by

W ald= βˆ_i

SE( ˆβi) (3.91)

where, βˆ_i is the regression coefficient estimates of the explanatory variables and SE( ˆβi) is the standard error of the corresponding regression coefficient estimate.

According to Rana et al. (2010), a squared value of the Wald statistics is chi-square

distributed with one degree of freedom.

(W ald)²=

βˆ_i²

SE( ˆβi)

2 (3.92)

The Wald statistic test has the following null and alternative hypotheses:

H0:β_i= 0 , f or i= 1,2, ..., p

H1: Not allβ_i= 0 , f or i= 1,2, ..., p

If testing one parameter at a time, Wald statistic follows theχ² distribution with 1 degree of freedom. The null hypothesis is rejected if the p-value<0.05 =α, where αis the level of significance. A regression coefficient estimates with a p-value of the Wald statistic<0.05 implies that the variable is important in the current model.

Akaikes Information Criterion (AIC)

Akaike’s Information Criterion (AIC) measures the quality of each model fit in the survey logistic regression, compared to other models based on the final model. The AIC can be used to select the best model. Akaike’s Information Criterion (AIC) is useless when it is used to the isolated model but the best way to use it is to compare different models. The formula for calculating AIC is:

AIC =−2L(β) + 2p (3.93)

whereβ is the number of parameters in the model,Lis the maximum value of the likelihood function andpis the number of parameter in the model.For generalized logit model,p=k(m+ 1), wherekis the total number of response levels subtracting one andmis the number of explanatory effects. The AIC method is used to select the best model in different set of data. A model with the smallest AIC value will be the most preferable model to use for analysis.

Dalam dokumen Bayesian data augmentation using MCMC: application to missing values imputation on cancer medication data. (Halaman 88-92)