Maximum likelihood method - Non-Bayesian data augmentation techniques

3.5: Non-Bayesian data augmentation techniques

3.5.1 Maximum likelihood method

parameter at iteration t which isθ^(t) of the original θthe data augmentation algorithm simulate values using the following two steps:

Imputation (I) Step:

Simulates multiple values of missing data of independent samplex¹_mv,x²_mv, ..,x^k_mv from the currenti^thapproximationPi(θ|xov)to the predictive distribution given by P(xmv |xov).

Posterior (P) Step:

Updating the currenti^th approximation distributionP(xmv | xov) can be approximately obtained as the average ofP(θ|xov,xmv)over missing data that was imputed in the imputation step.

P_i+1(θ|xov) = 1 k

j=1

P(θ|xov,x^j_mv) (3.14)

The missing datax¹_mv,x²_mv, ..,x^k_mv are generated multiple times,then are often called multiple imputation (D. B. Rubin, 1987b). This iterative procedure can be shown to eventually converge to a draw from the joint distribution ofP(xmv, θ|xov)asj→ ∞.

The value ofkneed not be very large, in fact withk = 1the DA algorithm reduces to a special case of the Gibbs sampler.

when the sample is large enough, it is efficient (small standard errors) and the estimates of the parameters approximately follow a normal distribution when we sample repeatedly (these estimates can be used for computing confidence intervals and p-values).

Maximum likelihood (ML) estimation procedure is commonly found in statistical software packages such as SPSS, SAS, S-plus, R, AMOS, EQS and more. The advan- tage of using ML procedure is that it handles missing data adequately and simply.

But, it holds when missing data mechanism are ignorable (i.e. MAR) and ML give unbiased estimates (Allison, 2002; Arbuckle, 1996).

The complete data model for covariates

Letxbe a full covariate data matrix withn×pdimension, which rows ofxare independent, identically distributed andxidenote thei^throw ofx,i= 1,2,3, ...., n. Since x1,x2, ...,xn are independent random samples and assuming that each probability distribution depends on the parameter of interest or unknown parameter given by θ. Given the independent observed valuesx1 =x1,x2 =x2, ...,xn =xngiven some parameter, then the probability density function isf(x₁, x₂, ..., x_n|θ). The probability density of complete covariate data can be given by

P(x|θ) =

i=1

f(xi |θ) (3.15)

wherexi are independent and identically distributed andf is the density function of a single row. This is equivalent to maximizing the log likelihood since logarithm is an increasing function

L(θ) =

i=1

logf(xi|θ) (3.16)

The distribution of the functionf is classified into three classes:

1. the multivariate normal distribution, see Schafer (1997);

2. the multinomial model for categorical variables, also include loglinear models;

3. a class of models for mixed normal and categorical variables, see review (Little

& Schluchter, 1985; W. J. Krzanowski, 1980; W. Krzanowski, 1982).

Since ML technique is restricted under some certain assumptions, however if these assumptions mentioned above hold, then the procedure for missing data will gen- erate parameter estimates that have the the following properties associated with maximum likelihood : consistency, asymptotic efficiency and asymptotic normality (Allison, 2002). Consistency is defined as: the larger the samples provided, the more likely it is to obtain unbiased parameter estimates. Asymptotic efficiency means that the computed parameter estimates have minimal standard errors. Asymptotic normality is a very useful property for maximum likelihood since it allows normal to approximate the given data set in order to obtain confidence intervals and p-values for the statistical analysis. Handling missing data requires data to be normally distributed so that the maximum likelihood method can produce adequate estimates of standard errors that take into account all data observations including unobserved data.

Maximum likelihood function with missing data

The maximum likelihood with missing data always assumes that missing data mechanism is ignorable and must be an MAR process (Allison, 2002). The ML method can easily be described or defined under the above assumptions. When some data is missing from the original data, the maximum likelihood can be obtained by summing the normal likelihood method over all possible values of missing observations in the original data. The maximum likelihood (ML) technique for missing values can be described mathematically as proceeded from literature in Graham (2009). Let x,ybe two discrete variables andf(x,y|θ)be a joint distribution fornindependent sample of observations, whereθ is the vector of unknown parameters that control the distribution of the given discrete variables. If the complete data (no missing data) and given observations are independent and identically distributed, the likelihood function is given by

L(θ) =

i=1

f(x_i, y_i |θ) (3.17)

To obtain good maximum likelihood estimates, we must estimate and find the values of unknown vector of parameter θ that makes the likelihood function as large as possible. Now, missingness is observed on the original data and to handle those missing values created, we can use method of maximum likelihood. Now, suppose that data are missing at random on variablex for the first k observations, bothx andyare observed variables and the(n−k)remaining observations. According to Allison (2002) only variableycan be measured.

To obtain marginal distribution onxby summing overyis denoted by g(x|θ) =X

f(x,y|θ) (3.18)

And also, the marginal distribution onyby summing overxis denoted by h(y|θ) =X

f(x,y|θ) (3.19)

If the variables missing in the original data are continuous, replace summation signs with integral signs. Therefore, the marginal distribution onxis obtained by integrating overydenoted as below :

g(x|θ) = Z

f(x,y|θ) (3.20)

And also, the marginal distribution (Likelihood function) onyby integrating overx

is denoted by

h(y|θ) = Z

f(x,y|θ) (3.21)

The likelihood function for the full data set becomes:

L(θ) =

i=1

g(x_i|θ)

i=k+1

h(y_i|θ) (3.22)

There are several methods available to solve this optimization problem above, to compute parameter estimates of the model and make inferences on the analysis.

To find the values of the unknown vector parameter given by θby making likelihood function (3.17) as large as possible. The maximum likelihood function (3.22) is separated into two parts (observed and missing data) that are equal to different missing data designs. The likelihood for each design can be found by summing the joint probability distributions over all expected values of the different variables with missing data. In case, where the given variables are continuous, replace summation signs with integrals signs.

To use maximum likelihood for missing data the joint distribution of the models must be known for all variables that are included in the model otherwise we cannot be able to do parameter estimation. Therefore, characteristics mentioned above is needed in order to implement maximum likelihood to handle missing data. For all categorical variables used in the model, two appropriate models are more likely to be used which is unrestricted multinomial model or a log-linear model that has some limitations depending on the given data (McKnight et al., 2007). For all continuous variables used in the model,it is typical to assume that the model is multivariate- normal (Allison, 2000). This implies that each variable in the model is normally distributed and each variable can be expressed as the linear function of the other variables, with homoscedasticity error terms and mean zero. These assumptions are commonly be found in the basis for multivariate analysis and linearstructural equa- tion modelling in statistics. Under the assumption of multivariate normal model, the maximum likelihood function can be maximized using the expectation-maximization (EM) algorithm or direct maximum likelihood (McKnight et al., 2007).

Dalam dokumen Bayesian data augmentation using MCMC: application to missing values imputation on cancer medication data. (Halaman 54-59)