3.11: Imputation Analysis
3.11.1 Missing values imputation for mixed data
assumptions of ignorability; that is, missing observations occur at random (MAR).
The descriptions of this model can be found in J. L. Schafer (1997). In this study, we applied the unrestricted general location model since our variables are both categor- ical and continuous variables.
To apply unrestricted general location model to incomplete mixed data above, the software started by using the EM algorithm to compute the maximum likelihood estimates for each cell probabilities, the cell means, and the covariances. EM al- gorithm may be used as starting points simulation step iteration in the algorithm.
Thus, the software starts to apply the iterative simulation technique in the loop to reproduce one or more iterations of a single Markov chain. The iterations simulated consists two steps which is I-step and P-step. In the I-step the random imputation for both missing in categorical and continuous data are drawn from the predicted multinomial distribution and multivariate normal distribution, respectively, with the current estimate of the parameter in the model. The restricted general location model is very useful when n is large compared to number cells in the given data set. The unrestricted general location model has(C−1) +Cp+ p(p+1)2 free param- eters and it became difficult to computeC×pestimate as a number of categorical variablespincrease in the model. The Bayesian Iterative Proportional Fitting (BIPF) algorithm is used to reduce the number of the parameters needed to estimated in the general location model (J. L. Schafer, 1997) . In application, the packagemixis used inRsoftware that was developed by J. L. Schafer (1997), which can be downloaded from cran. In themixlibrary, the general model location data augmentation uses the functionda.mixand BIPF algorithm for the restricted general location model use dabipf.mixfunction.
when data is missing at random, missing completely at random or missing not at random on the covariates in the regression models. The MCMC data augmenta- tion algorithm simulates pseudo-random values and replaces each missing value by adding missing data on the observed data, so that will be easier to analyze data with missing values. The data augmentation can be implemented in solving or simulat- ing missing values so that we can be able to compute full posterior distributions ofθ (Tanner&Wong, 1987). There are many statistical software packages on which data augmentation analysis can have done such as R, SPSS, STATA, and SAS. The unre- stricted general location model to incomplete mixed data can be applied to cancer medication data since it consists of categorical and continuous variables. Then, the software package mix in R applies the iterative simulation method to reproduce one or more iterations of a single Markov chain algorithm. The iteration is made up two steps, which is I-step and P-step(J. L. Schafer, 1997). In I-step the random imputa- tion for both missing categorical and continuous data are drawn from the predicted multinomial distribution and multivariate normal distribution, respectively, using the current parameter estimate in the model.
The data augmentation model for mixed data including the analysis model covari- ates and dependent variables of interest, the variables were associated with miss- ingness on the explanatory variables to be imputed and the dependent variables are not allowed to have any missing values (dependent variable must be complete). The binary regression models with no missing data were fitted and the results were com- pared to the models estimated using data sets with imputed missing data and the completed (observed + imputed) data sets using data augmentation method.
To impute this, explanatory variables with missing data problems using data aug- mentation algorithm, the Tanner&Wong (1987) method for mixed data set was uti- lized. That is, the dependent variable must be completed and they must be coded with consecutive positive integers starting with 1. For example, a binary variable must be coded as 1,2 rather than 0,1 before being imputed. Thus, the levels variables of Working for a wage (Yes, No) were included in the imputation model, treating No as a reference category. Assuming that WW denotes the levels of Working for a wage to be imputed, the following logistic regression model was estimated for each working for wage dichotomised category:
W W =β0+β1D+β2E+β3G+ (3.136) whereβi denote the survey logistic regression estimates andD, E, G anddenote
gender, living area, age and error term respectively. The WW can be imputed in- clude the effect of gender, living Area, age. For variable Age was included in the imputation model, this explanatory variable is a continuous variable. Let that AG denote the Age to be imputed, the following logistic regression model was estimated to determine the parameter estimates:
AG=β0+β1H+β2E+ (3.137)
where βi denote the survey logistic regression estimates and H, E and denote gender,living area and error term respectively. The MAR assumption, the missing values were created such that missingness depended on whether a cancer patient was using a cancer medication or not. Although this variable is excluded from the analysis model but is related to missingness such as Gender, Living Area, and Age are included in data augmentation model as missing auxiliary variables. When data under the data augmentation method were defined as MCAR, then all these vari- ables are related to the dependent variable (cancer medication intake) that was used in the model.
In the above we mentioned that, this study was utilized by using a South African (GHS2015) data set, which is a survey that is complicated with a complicated sam- pling design and weighting procedure of the data involved, that must be taken into consideration during the analysis, since it has some effect on the results. Generally, in survey sampling, all the units do not have the same probabilities to be included in the sample. With complicated survey data sets, these probabilities are computed and used to calculate the sample weights. As an example, consider a population for which the estimator of the totalV is as follows:
ˆ vw =
n
X
i=1
wivi (3.138)
wherevi represent the observed values ofV andwi is the weight that depends on the probability that the uniti will be included in the sample. The weight, in this case, is the inverse selection probability and it represents the number of individuals in the target population represented by sample uniti(Levy and Lemeshow, 2013).
As noted by Levy and Lemeshow (2013), ignoring the sample weights during the analysis results in more biased estimators than weighted estimators.
The sample has a small portion of the respondent population in a survey of inter-
est. The weighted sample data, it improved estimators of the model to be more closer to the target population estimators. There are several studies that reveal that when survey datasets contain weight variables, weighted sampled outputs are cho- sen mostly since they produce fewer bias estimates than unweighted outputs (Korn
& Graubard, 1995). ). The outputs of the survey logistic regression model in this study were based on the weighted data sets to give parameters of the model that has unequal probabilities of selection for each sample unit in the cancer medication data. The data augmentation algorithm was performed using theR, with the pack- age calledmixfor mixed data set was used to perform imputation of missing data.
The other software such as SAS 9.4 and R were used to do data preparation like creating the proportion rates of missingness and imputation analysis. Data analysis was done by SAS 9.4 for model fitting and graphics were determined in Power BI, R and SAS 9.4.