Model Fitting Details for Bayesian Probit Regression with Spatial Variability

(1)

eAppendix

e1. Model Fitting Details

We use Markov chain Monte Carlo sampling techniques to draw samples from the posterior distribution of the model parameters. These techniques include a combination of Gibbs and Metropolis sampling algorithms.^1-2 The probit regression model is most naturally presented in the Bayesian setting through the introduction of a latent variable for each observed response such that where

and It is straightforward to show that a latent variable is connected with the corresponding probability of interest such that where is the probability that event occurs. For the majority of model parameters,

introduction of these latent variables results in a conjugate form for their respective full conditional distributions.

The latent responses can be written more concisely using an indicator variable such that where is the indicator function equal to one if the input statement is true and zero otherwise.

These individual latent variables can then be combined into a single vector and described as where is an by matrix where the row entry is given by is the number of included covariates (length of ), is an by matrix where the row entry is given by

(2)

and the column location of the potentially non-zero entries depends on the ZCTA for individual , and . We use the notation and to emphasize the dependency of the matrices on the spatially varying change points.

We begin the posterior sampling by updating the individual latent variables. The full conditional distribution for a single latent variable has a known form useful for Gibbs sampling such that where “rest” indicates all remaining model parameters and observed data and represents the truncated normal distribution with original mean and original variance . The truncation point is determined by the observed binary outcome such that the distribution is truncated below by zero if and truncated above by zero if .

Next, we update the vector of non-spatial regression coefficients, . The full conditional distribution for has a multivariate normal form, also useful for Gibbs sampling, with

covariance matrix equal to and mean vector equal to

The non-spatial component of the spatially varying change points, , is updated next. A known form for the full conditional distribution of is unavailable, so we rely on the Metropolis algorithm to update the parameter. The log of the full conditional distribution needed for

Metropolis sampling is proportional to

(3)

The vectors of ZCTA specific spatial deviations, are also updated using the Metropolis algorithm. The log of the full conditional distribution is proportional to

We impose a sum-to-zero constraint for each entry in the vector to ensure identifiability for the MCAR specification.³

Finally, we update the inverse of the cross-covariance matrix, . The full conditional distribution has a known form useful for Gibbs sampling such that

where is the entry of , is a matrix with entry equal to , is a diagonal matrix with the entry equal to , and is the number of spatially isolated regions or “islands” in the spatial domain.⁴ There is only a single spatial region when working with the ZCTA map in CT, resulting in . We iterate through these sampling steps while updating the parameters, resulting in a collection of samples from the posterior distribution of interest.

(4)

e2. Simulation Study

We begin by generating data from four different designs, each with a unique setting for the true change point parameter(s), and then apply competing methods to each generated dataset.

Multiple pieces of information are collected during each model fit which allow us to explore the performances of the competing methods under the different designs. The four considered designs for data generation include

 Design 1: The standard change point model which includes a single change point common to all spatial locations,

 Design 2: The spatially independent change points model which allows for varying change points across spatial locations but assumes independence between the locations,

 Design 3: The spatially correlated change points model that allows for a moderate level of spatial correlation between change points from different locations, and

 Design 4: The spatially correlated change points model that allows for strong spatial correlation between change points from different locations.

The data generation process for each design is selected based on our CT dataset, ensuring that the created scenarios are realistic and useful for drawing conclusions about our application.

Each dataset is simulated from a modified version of equation (1) (main text) where the intercepts and slopes are assumed to be constant across space. We also remove the covariates when simulating the datasets in order to focus on estimation of the change points alone.

Therefore, the data generating model for each design has the form Bernoulli , where

(e1)

(5)

Our interests in this simulation study are in the inference from, and performance of, the spatially varying change points model with respect to competing methods. Spatially varying regression coefficient models have been well studied previously.⁵ Each generated dataset has the same sample size as our data application ( ), uses the same ZCTAs, and has similar settings for the intercept ( ) and post change point slope ( ) in order to observe a similar proportion of vaccine-targeted IPD cases at each time period.

In Design 1, we set months (early 2001; see Fig. 1 from the main text) for all , resulting in a single change point in the model. In Design 2, we simulate the transformed

change points independently such that . We back-transform to create for all , resulting in independently varying change points centered around 39 months and ranging from 33 to 45 months. Similar findings of variability in change points are estimated in the application of the full model to our CT dataset (see Fig. 2A from the main text).

In Designs 3-4, we simulate the transformed change points from the proper version of the conditional autoregressive model (PCAR) centered at -2.00 with a variance parameter of 0.61.⁶ The PCAR model represents a proper prior distribution and therefore, allows us to simulate spatially correlated areal data needed to specify the change points. It introduces an additional parameter, , which when equal to zero results in spatial independence between regions and when equal to one is equivalent to the intrinsic conditional autoregressive model (ICAR).

In Design 3, we set , indicating moderate spatial correlation. The resulting change points are centered around 39 months and range from 30 to 65 months. In Design 4, we set . This allows us to closely approximate the ICAR distribution while still

maintaining propriety of the distribution and leads to strong spatial correlation between change

(6)

points from neighboring spatial regions. The resulting change points are centered around 39 months and range from 26 to 73 months.

In total, we simulate 200 datasets from each design and apply competing methods to each dataset. The considered methods for comparison include

 Method 1: The newly developed spatial probit regression model in equation (S1) with spatially varying change points and non-spatial intercepts/slopes,

 Method 2: A probit regression model with a single non-spatial change point,

 Method 3: A probit regression model with time included using cubic regression splines with five degrees of freedom, and

 Method 4: A probit regression model with a linear time trend.

The methods become increasingly simplistic in order to determine at what stage in the process the additional levels of sophistication are no longer necessary in order to obtain similar model fit results and inference. Method 4 represents the most basic model form where we assume a linear trend in IPD proportions over time. This method is included as a baseline model so that we can determine if considering a more flexible relationship over time represents an improvement.

Method 3 relaxes the linearity assumption of Method 4 by incorporating regression splines. This model should be more flexible than Method 4 and we expect gains in model fit as a result.

Method 2 also relaxes the linearity assumption by including a non-spatial change point. We expect similar results from Methods 2 and 3 given the flexibility in time that both offer. In Method 1, we now include spatially varying change points in order to relax the assumption that each spatial region has the same change point. We expect this model to outperform the

competing methods when there is true spatial variation in the simulated change points.

(7)

After applying a method to a simulated dataset, we collect multiple pieces of information.

These include an estimate of the average mean squared error (MSE) of the change point

estimators, , where is the posterior mean estimate of the change point for region ; the average coverage proportion of the 95% credible intervals (CIs) for the change points, ; and DIC and .

The average MSE and average CI coverage proportion metrics allow us to determine how the inference for the change points obtained using the different methods compares and changes under different data simulation designs. DIC and allow us to understand what type of differences in model comparison metrics we are likely to see under the different designs. This will allow us to determine which design our actual CT dataset belongs to and therefore, which model is most appropriate in our application.

In eTable 1, we display the simulation study results. Recall that Methods 3 and 4 do not directly estimate a change point and therefore, we cannot calculate average MSE and CI

coverage for these models. Under Design 1, Methods 1 and 2 display similar average CI

coverage, both near the advertised level of 0.95. However, Method 2 has a smaller average MSE estimate, indicating that it is able to more efficiently estimate the true change point parameter.

The model comparison metrics suggest that Method 2 is preferred by DIC while Method 1 is preferred by . The average CI coverage, average MSE, and DIC results are expected since Method 2 matches Design 1. appears to prefer the increased flexibility allowed by Method 1 since prediction is the main focus of the metric.

For Design 2, the average CI coverage for Method 2 decreases significantly below the advertised level. This is because a single change point parameter is not adequate to describe the

(8)

variability in the change point parameters under this design. Method 1 is still able to produce CIs with adequate coverage. The average MSE estimates from Methods 1 and 2 are now more comparable, with Method 2 producing slightly smaller values. DIC still favors Method 2, while the gap between the values has increased in favor of Method 1. Overall, these results are expected, because although we have variability in the change point parameters in Design 2, it is not due to spatial proximity.

Under the moderate spatial correlation between change points of Design 3, Method 1 now begins to outperform the competing methods in terms of DIC. Method 2 has dismal coverage for the CIs while Method 1 is still performing as desired. The average MSE results are nearly identical while the difference in has increased even more in favor of Method 1.

Finally, under the strong spatial correlation of Design 4, Method 1 is clearly preferred over the competitors in terms of average CI coverage, average MSE, DIC, and . These results are intuitive since Design 4 nearly matches Method 1. Method 2 is unable to efficiently estimate the true change point parameters due to the increased variability under this design. Overall, Method 1 consistently provides adequate coverage of the change points under any design, and with any form of spatial correlation between the change point parameters, provides an improved fit to the data. The predictive performance of Model 1 is also superior under all designs, as indicated by .

When we apply Method 1 to our actual dataset, we see a difference in DIC of 3.44 in favor of Method 1 when compared to Method 2 (see Table 3 of the main text). In Design 3, this average difference is 0.92 and is 14.88 in Design 4, as shown in eTable 1. This suggests that there may be true spatial variability between the change points in the CT dataset, somewhere between moderate (Design 3) and strong (Design 4). It also suggests that a model accounting for

(9)

the spatial variability in the change points is needed in order to efficiently estimate the change points across CT.

eTable 1. Simulation study results. Average coverage of the 95% credible intervals (coverage), average mean squared error (MSE), deviance information criterion (DIC), and posterior predictive loss criterion ( ) are displayed. The average of the standard errors for the presented coverage, MSE, DIC, and estimates are 0.008, 0.008, 3.9, and 3.4,

respectively. The lowest DIC and values for each design are presented in bold font.

Design Method Coverage MSE (x100) DIC

1 1 0.98 0.13 3274.79 2354.36

2 0.92 0.06 3270.36 2363.17

3 --- --- 3273.91 2362.08

4 --- --- 3384.53 2438.92

2 1 0.97 0.16 3268.34 2352.76

2 0.78 0.12 3265.94 2362.89

3 --- --- 3269.15 2362.22

4 --- --- 3374.74 2433.89

3 1 0.93 0.32 3290.13 2364.60

2 0.59 0.31 3291.05 2379.85

3 --- --- 3294.45 2379.53

4 --- --- 3404.34 2455.10

4 1 0.92 0.41 3308.45 2375.01

2 0.46 0.60 3323.33 2404.57

3 --- --- 3326.31 2404.71

4 --- --- 3433.73 2478.38

(10)

e3. Additional Tables and Figures

eTable 2. Posterior summaries for the multivariate conditional autoregressive model cross- covariance parameters ( ).

Quantiles

Parameter Mean SD 0.025 0.975

Variance: Change Points 0.39 0.34 0.09 1.31

Variance: Intercepts 0.14 0.05 0.07 0.26

Variance: Slopes 0.25 0.13 0.09 0.60

Correlation: Change Points, Intercepts

-0.12 0.29 -0.65 0.45

Correlation: Change Points, Slopes -0.03 0.37 -0.70 0.64 Correlation: Intercepts, Slopes 0.02 0.27 -0.51 0.52

Results suggest that none of the parameters display significant cross-correlation.

However, the estimated posterior mean correlation between the change point and intercept parameters is slightly negative (-0.12), suggesting that when a ZCTA has a larger baseline rate of vaccine-targeted IPD cases it also has a change point which occurs earlier in time.

(11)

eFigure 1. Study sample sizes. Number of invasive pneumococcal disease cases by ZIP Code Tabulation Area in Connecticut, 1998-2009.

(12)

eFigure 2. Posterior standard deviations of spatial parameters. Posterior standard deviation estimates of the spatially varying (A) change point, (B) intercept, and (C) slope

parameters.

(13)

eFigure 3. Posterior standard deviations for predictions over time. Posterior standard deviation estimates for predictions from (A) January 1998, (B) March 2001, (C) January 2005, and (D) December 2009.

(14)

eFigure 4. Temporal variability comparisons. Comparison of each model's ability to explain temporal variability in IPD proportions. Yearly observed vs. predicted proportions of IPD cases caused by a vaccine-targeted serotype, averaged across all ZCTAs, are displayed. Each point represents a year in the analysis (1998-2009). CP: change point.

(15)

eFigure 5. Spatial variability comparisons. Comparison of each model's ability to explain spatial variability in IPD proportions. ZCTA specific observed vs. predicted proportions of IPD cases caused by a vaccine-targeted serotype, averaged across all years, are

displayed. Each point represents a unique ZCTA. Proportions based on ≤ 15 observations are omitted to ensure stability in the estimates. CP: change point.

(16)

eFigure 6. Sources of variability comparison. Comparison of different sources of variability in the model predictions. Predictions from three different ZCTAs are shown in each panel. In panel (A) the ZCTAs differ only by their estimated spatial parameters. In panel (B) the ZCTAs differ only by their vaccine uptake information. The posterior standard deviations for the predictions ranged from 0.01 to 0.13 with an average value of 0.07 for (A) and ranged from 0.02 to 0.12 with an average value of 0.06 for (B).

eFigure 6A displays predictions at each month for three ZCTAs. These three ZCTAs are identified as having the largest (high risk), the smallest (low risk), and an average combination of estimated spatial parameters as indicated by the combined ranking metric in Figure 2D. Each of the regions have the same, average proportion of children receiving three and four doses of PCV7 at each time point. eFigure 6B displays monthly predictions for a different set of ZCTAs.

These three ZCTAs have the largest, the smallest, and an average combination of three and four dose proportions among children at each time point. Each of the regions have the same, average setting of spatial parameters. The differences in predictions seen in eFigure 6 should be

examined at individual months due to the changing proportion of children receiving three and four doses in each ZCTA across time. The results indicate that a majority of the variability in predictions between ZCTAs is due to unexplained spatial variability.

(17)

eFigure 7. Sensitivity analysis results. Comparison of estimated posterior means from all model parameters between (i) the model that specifies ,

(y-axis) and (ii) the model that specifies (x- axis). Panel (A) displays all model parameters while panel (B) shifts focus to the bulk of distribution seen in panel (A). A line through the origin is included for reference.

(18)

References:

1. Gelfand AE, Smith AFM. Sampling-based approaches to calculating marginal densities.

Journal of the American Statistical Association. 1990;85:398-409.

2. Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. Equation of state calculations by fast computing machines. The Journal of Chemical Physics. 1953;21:1087- 1092.

3. Gelfand AE, Vounatsou P. Proper multivariate conditional autoregressive models for spatial data analysis. Biostatistics. 2003;4:11-25.

4. Lavine ML, Hodges JS. On rigorous specification of ICAR models. The American Statistician. 2012:66;42-49.

5. Gelfand AE, Kim HJ, Sirmans CF, Banerjee S. Spatial modeling with spatially varying coefficient processes. Journal of the American Statistical Association. 2003;98:387-396.

6. Cressie N. Statistics for Spatial Data. New York: John Wiley & Sons; 1993.