The data augmentation technique can be used to impute missing data in Bayesian and classical statistics. Bayesian data augmentation performs very well in improving cancer drug modeling affected by missing data.
Background
Missing data mechanisms must be one of these mechanisms: missing completely at random (MCAR), missing not at random (MNAR), or missing at random (MAR). The data augmentation is an MCMC procedure that improves the quality of data by adding the missing data in the model.
Statement of the problem
The performance of Bayesian imputation is evaluated when missing values are missing at random, missing completely at random, or not missing at random under different percentages of missing data for the variables of interest. Data imputation performance is evaluated by comparing the baseline dataset and imputed datasets according to different percentages of missing data and missing mechanisms.
Aim and objective
To investigate whether the percentage of missing data in the cancer drug data can affect the performance of the MCMC data augmentation algorithm. To investigate the performance of data augmentation techniques under different models (MCAR, MNAR and MAR) with different degrees of missing material.
Significance of study
Limitations of study
Research questions
Hypotheses
Thesis outline
Theoretical Foundation for the study of missing data
The distribution of missing data indicators can be modeled conditionally on x,P(R|x, ϕ), where ϕ is a vector of unknown parameters. The missing data mechanism is said to be overlooked when both MAR and resolution are present ( Little & Rubin, 1987 ).
General Review for existing Simple missing data methods
The researchers find that using this method can introduce biases into the estimation of model parameter estimates. The single imputation method is another method of imputing missing values in a data set with a missing data problem.
Review for existing advanced missing data techniques
Rubin (2002), the expectation maximization algorithm is a common iterative method for maximum likelihood estimation in missing data problems. The basic idea of EM is to address the problem of missing data and solve the estimation complications associated with maximum likelihood estimation.
Related work
Description of the General Households Survey 2015 Data
The data contains the total N=74450 number of observations for people who responded to the survey in South Africa. The survey does not include collective housing such as student residences, retirement homes etc.
Study Population and Sampling Procedure
The cleaning process
Data from the South African General Household Survey (2015) did not arrive in a ready-to-use format (pure format) for analysis. The data contains N=74450, which is equal to the number of observations of people who responded to the survey in South Africa.
Selection of dependent and independent variables
Bayesian methodology
Bayesian methods
Posterior distribution
Posterior predictive distributions
P(xmv |xov, θ)P(θ|xov)dθ (3.5) where xov are observed data, xmv represent unobserved values, and θ are unknown model parameters.
The Basic Principles of Data augmentation
The Bayesian Data Augmentation algorithm
The iterative steps of data augmentation are repeated until we reach the convergence stage of the algorithm. A(θ, ϕ)Pi(ϕ|xov)dϕ, i∈N (3.12) On the equation (3.12) above, we can implement the method of successive substitution to approximate the solution. The function Pi calculated above will always converge to the posterior distribution of P(θ|xov) under mild conditions.
General steps of data augmentation algorithm
Equation (3.11) can be solved by the method of successive substitution and this suggests a method to determine P(θ |xov). In the seminal paper, Tanner & Wong (1987) implement the Monte Carlo method to determine or calculate such integration in (3.9).
Non-Bayesian data augmentation techniques
- Maximum likelihood method
- Direct Maximum likelihood method
- Expectation Maximization (EM) method
- The ECM algorithm
- Weighting methods
The EM algorithm is an iterative technique commonly used to estimate model parameters using maximum likelihood estimation when the model has missing data problems (Dempster et al., 1977). The EM algorithm comes up with a basic idea for determining parameter estimates of the model to augment the observed data with missing data using their maximum likelihood distributions. The E step finds the existence of the current expected log-likelihood for the complete data.
Bayesian data augmentation technique
- Introduction
- Markov Chain Monte Carlo (MCMC)
- Gibbs Sampling
- Markov Chain Monte Carlo method in the presence of missing data . 61
- Disadvantages of data augmentation
Schafer (1997), Bayesian data augmentation (DA) is closely related to Gibbs sampling method, more explanation in the literature J. A posterior distribution can be used for parameter estimation in the Bayesian analysis so that we can make proper inferences. The most commonly used method for applying MCMC is Gibbs sampling as it is most commonly available in the statistical software.
Simulation of missing data procedure
Preparation of data for data augmentation algorithm
Reasonable diagnostics can then be performed to determine that the data augmentation method improves data analysis by comparing the produced means and standard deviations for imputed data on the dependent variable Age for different percentages of missing information. The packet mix depends on the size of the data and the amount of missing information in terms of computation time.
Statistical tools for data analysis
- Descriptive Statistics
- Inferential Statistics
- Multivariable analysis
- Odds ratio
- Estimation Method
The Wilcoxon signed rank test is based on the magnitude of the difference between the pairs of sample values. The likelihood ratio (LR) test evaluates the significance of the joint effect of all the variables in the Survey logistic regression. The Wald test is the additional test that can also be used to assess the significance of the individual parameters in the Surverylogistic Regression procedure.
Statistical Computation Packages in R
By commanding the function da.norm for multivariate normal models, da.cat for saturated multinomial models, dabipf.cat for constrained loglinear models, da.mix for unconstrained general location models, or dabipf.mix for models of limited generalizations, to reproduce single (one) or multiple (more than one) iterations of a single Markov chain in a normal inverted Wishart. The parameter estimates generated in this step will be used by step (4) below to simulate missing values for the imputations. By commanding the function imp.norm for multivariate normal models, imp.cat for saturated multinomial models and constrained loglinear models, or imp.mix for general location models (constrained and unconstrained models), to impute the observations missing data matrix x and its use, parameter estimates obtained from the previous step (3).
Mixed data methods of data augmentation
Introduction
The general location model
Let Ovi(k) represent the subset of observed data corresponding to categorical variables and let Mvi(k) represent the subset of missing values that correspond. Consider a matrix of sufficient statistics Z1, then the (j, l) elements in the matrix Z1 are given by Pn. The first step: we draw dt+1i from the predictive distribution of the given value vectors of the observed data, which is an element of the observed parts of the categorical variables in the Ki(k) model.
The Restricted general location model
The general restricted location model is very useful when n is large compared to the number cells in the given data set. The general unconstrained location model has (C−1) +Cp+ p(p+1)2 free parameters and became difficult to calculateC×pestification as the number of categorical variables increases in the model. The Bayesian Iterative Proportional Fitting (BIPF) algorithm is used to reduce the number of parameters needed to be estimated in the overall location model (J. L. Schafer, 1997).
Imputation Analysis
Missing values imputation for mixed data
Thus, the level variables of Work for a wage (Yes, No) are included in the imputation model, treating No as a reference category. For variable Age is included in the imputation model, this explanatory variable is a continuous variable. Generally, in survey sampling, all the units do not have the same probabilities of being included in the sample.
Measurement of model performance
In addition, the mean squared error of the parameter estimate is bias for each missing percentage. The first section of this chapter (Section 4.1) provides an overview of the results of this chapter. The second section (Section 4.2) presents the results of the analysis of complete data (without missing data) such as descriptive statistics, fitting of the logistic regression model, data visuals and some tests (Normality test, Chi-square test).
Analysis of complete-data
- Descriptive statistics for complete-data
- Bivariate analysis of cancer-medication intake
- Multivariable statistics for complete-data
- Assessment of normality in the variable (Age)
The result of the binary survey logistic regression model presented below in Table 4.5 for original data estimation of cancer medication intake in South Africa. The Survey logistic regression model was used to model the likelihood that a patient would take cancer medication. The effect of the living area on cancer medication is insignificant as (p-value=0.8178>0.05).
Analysis of missing data imputation
Assessment of normality of the variable (Age)
The distribution of the variable (Age) below 1% MNAR imputed to the data is quite symmetrical as shown by the histogram above and the Q-Q suggests that the data point fits a straight line except in the tails. According to MCAR: p-value=0.0225 less than 0.05 level of significance, therefore we can conclude that the Age variable is not normally distributed. According to MNAR: p-value= 4.245e-16 less than 0.05 level of significance, so we can conclude that the Age variable is not normally distributed.
Hypothesis Testing
Therefore, we can conclude that the paired observations are not significantly different and that the imputed data mean is close to the original mean. We can conclude that the paired observations are not significantly different and that the imputed data mean is close to the original mean. We can conclude that the paired observations are not significantly different and that the imputed data mean is close to the original mean under the MAR assumption.
Comparing Estimates and Standard errors of the Survey Logistic
Root MSE and coefficient of determination were the closest to the original with MCAR mechanisms of missing data method as can be seen from Table 4.13. As seen above, The Root calculated MSE with 5% missing data for MCAR mechanism closest to the original, MNAR and MAR overestimated it. While parameter estimates and standard errors for sex, work for wage, area of life and age were either larger or smaller depending on the missing data mechanism and proportions.
Assessment of adequacy of the models
The Likelihood Ratio test was used to evaluate the significance of the joint effect of all variables in the Survey logistic regression procedure. To assess the suitability, we compare the likelihood ratio test, the Wald test of the imputed data sets with the likelihood ratio test, the Wald test and with the analysis of the full data set. The model summary shows that the -2logLikelihood statistic in MCAR1% and MCAR5% of the analysis is also slightly closer to the -2logLikelihood statistic of the full data set.
Kappa Test for agreement in wage
We can conclude that the agreement between observed and imputed wage is a substantial agreement under MAR missing mechanism. Two other takeaways from this plot are that agreement is lower for Yes (Work for a wage) and Partially Yes (Work for a wage) than for not working for wage patients and that the distribution is skewed, with a large of not work for wage patients. Furthermore, potential advantages and shortcomings of the study and directions for future research are also given in this chapter.
Summary
Limitations and Recommendations for Future Research
The Bayesian data augmentation methods were applied to deal with missing data problems by using different models such as multivariate normal model etc. The relative effectiveness of procedures commonly used in multiple regression analysis to deal with missing values. The American Statistician. The application of missing data estimation models to the problem of unknown victim/offender relationships in homicide cases.