Bayesian data augmentation using MCMC: application to missing values imputation on cancer medication data.

The data augmentation technique can be used to impute missing data in Bayesian and classical statistics. Bayesian data augmentation performs very well in improving cancer drug modeling affected by missing data.

Background

Missing data mechanisms must be one of these mechanisms: missing completely at random (MCAR), missing not at random (MNAR), or missing at random (MAR). The data augmentation is an MCMC procedure that improves the quality of data by adding the missing data in the model.

Statement of the problem

The performance of Bayesian imputation is evaluated when missing values are missing at random, missing completely at random, or not missing at random under different percentages of missing data for the variables of interest. Data imputation performance is evaluated by comparing the baseline dataset and imputed datasets according to different percentages of missing data and missing mechanisms.

Aim and objective

To investigate whether the percentage of missing data in the cancer drug data can affect the performance of the MCMC data augmentation algorithm. To investigate the performance of data augmentation techniques under different models (MCAR, MNAR and MAR) with different degrees of missing material.

Significance of study

Limitations of study

Research questions

Hypotheses

Thesis outline

Theoretical Foundation for the study of missing data

The distribution of missing data indicators can be modeled conditionally on x,P(R|x, ϕ), where ϕ is a vector of unknown parameters. The missing data mechanism is said to be overlooked when both MAR and resolution are present ( Little & Rubin, 1987 ).

Figure 2.1 – Monotone missing-data pattern

General Review for existing Simple missing data methods

The researchers find that using this method can introduce biases into the estimation of model parameter estimates. The single imputation method is another method of imputing missing values in a data set with a missing data problem.

Review for existing advanced missing data techniques

Rubin (2002), the expectation maximization algorithm is a common iterative method for maximum likelihood estimation in missing data problems. The basic idea of EM is to address the problem of missing data and solve the estimation complications associated with maximum likelihood estimation.

Related work

Description of the General Households Survey 2015 Data

The data contains the total N=74450 number of observations for people who responded to the survey in South Africa. The survey does not include collective housing such as student residences, retirement homes etc.

Study Population and Sampling Procedure

The cleaning process

Data from the South African General Household Survey (2015) did not arrive in a ready-to-use format (pure format) for analysis. The data contains N=74450, which is equal to the number of observations of people who responded to the survey in South Africa.

Selection of dependent and independent variables

Bayesian methodology

Bayesian methods

Posterior distribution

Posterior predictive distributions

P(xmv |xov, θ)P(θ|xov)dθ (3.5) where xov are observed data, xmv represent unobserved values, and θ are unknown model parameters.

The Basic Principles of Data augmentation

The Bayesian Data Augmentation algorithm

The iterative steps of data augmentation are repeated until we reach the convergence stage of the algorithm. A(θ, ϕ)Pi(ϕ|xov)dϕ, i∈N (3.12) On the equation (3.12) above, we can implement the method of successive substitution to approximate the solution. The function Pi calculated above will always converge to the posterior distribution of P(θ|xov) under mild conditions.

General steps of data augmentation algorithm

Equation (3.11) can be solved by the method of successive substitution and this suggests a method to determine P(θ |xov). In the seminal paper, Tanner & Wong (1987) implement the Monte Carlo method to determine or calculate such integration in (3.9).

Non-Bayesian data augmentation techniques

Maximum likelihood method
Direct Maximum likelihood method
Expectation Maximization (EM) method
The ECM algorithm
Weighting methods

The EM algorithm is an iterative technique commonly used to estimate model parameters using maximum likelihood estimation when the model has missing data problems (Dempster et al., 1977). The EM algorithm comes up with a basic idea for determining parameter estimates of the model to augment the observed data with missing data using their maximum likelihood distributions. The E step finds the existence of the current expected log-likelihood for the complete data.

Bayesian data augmentation technique

Introduction
Markov Chain Monte Carlo (MCMC)
Gibbs Sampling
Markov Chain Monte Carlo method in the presence of missing data . 61
Disadvantages of data augmentation

Schafer (1997), Bayesian data augmentation (DA) is closely related to Gibbs sampling method, more explanation in the literature J. A posterior distribution can be used for parameter estimation in the Bayesian analysis so that we can make proper inferences. The most commonly used method for applying MCMC is Gibbs sampling as it is most commonly available in the statistical software.

Simulation of missing data procedure

Preparation of data for data augmentation algorithm

Reasonable diagnostics can then be performed to determine that the data augmentation method improves data analysis by comparing the produced means and standard deviations for imputed data on the dependent variable Age for different percentages of missing information. The packet mix depends on the size of the data and the amount of missing information in terms of computation time.

Figure 3.2 – Diagram of the data augmentation and process flow/procedure Data preparation principles

Statistical tools for data analysis

Descriptive Statistics
Inferential Statistics
Multivariable analysis
Odds ratio
Estimation Method

The Wilcoxon signed rank test is based on the magnitude of the difference between the pairs of sample values. The likelihood ratio (LR) test evaluates the significance of the joint effect of all the variables in the Survey logistic regression. The Wald test is the additional test that can also be used to assess the significance of the individual parameters in the Surverylogistic Regression procedure.

Table 3.2 Kappa Test for Agreement Between Two Raters Rater B

Statistical Computation Packages in R

By commanding the function da.norm for multivariate normal models, da.cat for saturated multinomial models, dabipf.cat for constrained loglinear models, da.mix for unconstrained general location models, or dabipf.mix for models of limited generalizations, to reproduce single (one) or multiple (more than one) iterations of a single Markov chain in a normal inverted Wishart. The parameter estimates generated in this step will be used by step (4) below to simulate missing values for the imputations. By commanding the function imp.norm for multivariate normal models, imp.cat for saturated multinomial models and constrained loglinear models, or imp.mix for general location models (constrained and unconstrained models), to impute the observations missing data matrix x and its use, parameter estimates obtained from the previous step (3).

Mixed data methods of data augmentation

Introduction

The general location model

Let Ovi(k) represent the subset of observed data corresponding to categorical variables and let Mvi(k) represent the subset of missing values that correspond. Consider a matrix of sufficient statistics Z1, then the (j, l) elements in the matrix Z1 are given by Pn. The first step: we draw dt+1i from the predictive distribution of the given value vectors of the observed data, which is an element of the observed parts of the categorical variables in the Ki(k) model.

The Restricted general location model

The general restricted location model is very useful when n is large compared to the number cells in the given data set. The general unconstrained location model has (C−1) +Cp+ p(p+1)2 free parameters and became difficult to calculateC×pestification as the number of categorical variables increases in the model. The Bayesian Iterative Proportional Fitting (BIPF) algorithm is used to reduce the number of parameters needed to be estimated in the overall location model (J. L. Schafer, 1997).

Imputation Analysis

Missing values imputation for mixed data

Thus, the level variables of Work for a wage (Yes, No) are included in the imputation model, treating No as a reference category. For variable Age is included in the imputation model, this explanatory variable is a continuous variable. Generally, in survey sampling, all the units do not have the same probabilities of being included in the sample.

Measurement of model performance

In addition, the mean squared error of the parameter estimate is bias for each missing percentage. The first section of this chapter (Section 4.1) provides an overview of the results of this chapter. The second section (Section 4.2) presents the results of the analysis of complete data (without missing data) such as descriptive statistics, fitting of the logistic regression model, data visuals and some tests (Normality test, Chi-square test).

Analysis of complete-data

Descriptive statistics for complete-data
Bivariate analysis of cancer-medication intake
Multivariable statistics for complete-data
Assessment of normality in the variable (Age)

The result of the binary survey logistic regression model presented below in Table 4.5 for original data estimation of cancer medication intake in South Africa. The Survey logistic regression model was used to model the likelihood that a patient would take cancer medication. The effect of the living area on cancer medication is insignificant as (p-value=0.8178>0.05).

Table 4.1 Descriptive statistics and marginal distribution for complete-data

Analysis of missing data imputation

Assessment of normality of the variable (Age)

The distribution of the variable (Age) below 1% MNAR imputed to the data is quite symmetrical as shown by the histogram above and the Q-Q suggests that the data point fits a straight line except in the tails. According to MCAR: p-value=0.0225 less than 0.05 level of significance, therefore we can conclude that the Age variable is not normally distributed. According to MNAR: p-value= 4.245e-16 less than 0.05 level of significance, so we can conclude that the Age variable is not normally distributed.

Figure 4.4 – 1% MNAR Imputation: The Q-Q Plot and Histogram for variable Age

Hypothesis Testing

Therefore, we can conclude that the paired observations are not significantly different and that the imputed data mean is close to the original mean. We can conclude that the paired observations are not significantly different and that the imputed data mean is close to the original mean. We can conclude that the paired observations are not significantly different and that the imputed data mean is close to the original mean under the MAR assumption.

Table 4.10 Test Statistics : Difference in means of Original Age and imputed Age

Comparing Estimates and Standard errors of the Survey Logistic

Root MSE and coefficient of determination were the closest to the original with MCAR mechanisms of missing data method as can be seen from Table 4.13. As seen above, The Root calculated MSE with 5% missing data for MCAR mechanism closest to the original, MNAR and MAR overestimated it. While parameter estimates and standard errors for sex, work for wage, area of life and age were either larger or smaller depending on the missing data mechanism and proportions.

Table 4.12 Regression Table: Estimates of Covariates (and standard errors) using Complete case model and 1% propotions for MCAR, MNAR and MAR.

Assessment of adequacy of the models

The Likelihood Ratio test was used to evaluate the significance of the joint effect of all variables in the Survey logistic regression procedure. To assess the suitability, we compare the likelihood ratio test, the Wald test of the imputed data sets with the likelihood ratio test, the Wald test and with the analysis of the full data set. The model summary shows that the -2logLikelihood statistic in MCAR1% and MCAR5% of the analysis is also slightly closer to the -2logLikelihood statistic of the full data set.

Table 4.20 Survey Logistic Regression Model Summary

Kappa Test for agreement in wage

We can conclude that the agreement between observed and imputed wage is a substantial agreement under MAR missing mechanism. Two other takeaways from this plot are that agreement is lower for Yes (Work for a wage) and Partially Yes (Work for a wage) than for not working for wage patients and that the distribution is skewed, with a large of not work for wage patients. Furthermore, potential advantages and shortcomings of the study and directions for future research are also given in this chapter.

Figure 4.15 – The plot of both the agreement and distribution of working for wage under MAR

Summary

Limitations and Recommendations for Future Research

The Bayesian data augmentation methods were applied to deal with missing data problems by using different models such as multivariate normal model etc. The relative effectiveness of procedures commonly used in multiple regression analysis to deal with missing values. The American Statistician. The application of missing data estimation models to the problem of unknown victim/offender relationships in homicide cases.