Corollary 1. Under the conditions of Theorem1, further require the consistency of Intact- VAE. Then, in the limit of infinite data, we have µt(X) = ft(ht(X))where f,h are the optimal parameters learned by the VAE.
6 10 14 18 22 non-overlap (omega) 0.2
0.3 0.4 0.5 0.6
PEHE
1 2 3 4 5
dim(w) cfr
beta=1.0
beta=1.5 beta=2.0
beta=2.5 beta=3.0 FIGURE 4.2: √
ϵpehe on synthetic datasets. Each error bar is on 10 random DGPs.
FIGURE4.3: Plots of recovered - true latent. Blue: T = 0, Orange:
T=1.
byω that multiplies the logit value. See Sec.4.5.1 for details and more results on synthetic datasets.
With the same (dim(W),ω), we evaluate our method and CFR on 10 random DGPs, with different sets of functions f,g,h,k,lin (5.4). For each DGP, we sample 1500 data points, and split them into 3 equal sets for training, validation, and test- ing. We show our results for different hyperparameterβ. For CFR, we try different balancing parameters and present the best results (see Sec.4.5.1for detail).
In each panel of Figure4.2, we adjust one ofω, dim(W), with the other fixed to the lowest. As implied by our theory, our method, with only1-dimensional Z, per- forms much better in the left panel (where dim(W) = 1 satisfies (G2)) than in the
right panel (when dim(W) > 1). Although CFR uses 200-dimensional representa- tion, in the left panel our method performs much better than CFR; moreover, in the right panel CFR is not much better than ours. Further, our method is much more ro- bust against different DGPs than CFR (see the error bars). Thus, the results indicate the power of identification and recovery of scores. (see Figure4.3also).
Under the lowest overlap level (ω=22), largeβ(=2.5, 3)shows the best results, which accords with the intuition and bounds in Sec.4.3. When dim(W) > 1, ft in (4.12) is non-injecitve and learning of prognostic score is necessary, and thus, largerβ has a negative effect. In fact,β=1 is significantly better thanβ=3 when dim(W)>
2. We note that our method, with a higher-dimensionalZ, outperforms or matches CFR also under dim(W) > 1 (see Figure 4.7). Thus, the performance gap under dim(W) > 1 in Figure4.2 should be due to the capacity of NNs in β-Intact-VAE.
In Figure 4.9 for ATE error, CFR drops performance w.r.t overlap levels. This is evidence that CFR and its unconditional balance overly focus on PEHE (see Sec.4.4.2 for more explicit comparison).
Experiments for the score recovery When dim(W) =1, there are no better prog- nostic scores thanW, because ft is invertible and no information can be dropped fromW. Thus, our method stably learnsZas an approximate affine transformation of the trueW, showing identification. An example is shown in Figure4.3, and more plots are in Figure A.1. For comparison, we run CEVAE, which is also based on VAE but without identification; CEVAE shows much lower quality of recovery. As expected, both recovery and estimation are better with the balanced prior pΛ(z|x), and we can see examples of bad recovery usingpλ(z|x,t)in FigureA.7.
To show quantitative evidence for the score recovery, we first fit a simple linear regressionW = aZbetween the standardized true and learned score. Then we ex- amine the linear regression in two ways–by goodness of fit through thecoefficient of determination R2and model specification through theRamsey regression equation speci- fication error test (RESET)(Ramsey,1969). Specifically,R2=1−Σi(wi−wˆi)2/Σi(wi−
¯
w)2measures how much variation ofW is explained byZin the regression, and the nearer to 1 the R2 is, the tighter the linear fit is. Moreover, Ramsey RESET tests the null hypothesis of linearity by examine whether the combinations of ˆW2, ..., ˆWk
where ˆW = aZˆ help explain the response variable W (We setk = 5). Linearity is rejected if the p-value of the test is lower than a significant levelα.
The Ramsey RESET test can catch the cases where the R2 is near 1 but a small portion of data causes notable non-linearity, see Figure4.4 left for an example. In fact, we observe that the RESET test is too sensitive to non-linearity when theR2 is high. An example is shown in Figure4.4right where the non-linearity is barely no- table and is possibly due to the several outliers on both sides. However, the RESET test gives a rather low p-values as 0.004. Thus, we decide thatα=0.01 is reasonable for our purpose.
FIGURE 4.4: Examples of low p-values of RESET. Left: a notable non-linearity, and the p-value is practically 0. Right: tiny to no non-
linearity, but the p-value is very low.
The histograms of the R2 values and the RESET p-values on the 100 synthetic datasets are shown in Figure4.5. We see linear regression often gives good fits and is not misspecified. Specifically, R2 is higher than 0.75 on 83 datasets and higher than 0.8 on 76 datasets, and RESET p-value is higher thanα = 0.01 on 82 datasets and higher than 0.05 on 72 datasets. Finally, the two criteria taken together, there are 66 datasets whereR2 is higher than 0.75 and RESET p-value is higher thanα = 0.01–an impressive result because the two conditions tend to be mutually exclusive and many cases like those in Figure4.4 are excluded. Thus, we conclude that the experiment quantitatively confirms the theoretical result that Intact-VAE recovers the true score up to an affine transformation.
0.0 0.2 0.4 0.6 0.8 1.0 0
5 10 15 20 25 30
0.0 0.2 0.4 0.6 0.8 1.0
0 5 10 15 20 25
FIGURE4.5: The histograms ofR2(left) and RESET p-values (right) for linear regressions between the true and learned score.
4.4.2 IHDP Benchmark Dataset
This experiment shows our conditional BRL matches state-of-the-art BRL methods and does not overly focus on PEHE. The IHDP (Hill,2011) is a widely used bench- mark dataset; while it is less known, its covariates are limited-overlapping, and thus it is used in Johansson et al. (2020) which considers limited overlap. The dataset is based on an RCT, butRaceis artificially introduced as a confounder by removing all treated babies with nonwhite mothers in the data. Thus, Race is highly limited- overlapping, and other covariates that have high correlation to Race, e.g, Birth weight(Kelly et al.,2009), are also limited-overlapping. See Sec.4.5.2for detail and more results.
There is a linear balanced prognostic score (linear combination of the covariates).
However, most of the covariates are binary, so the support of the balanced prognos- tic score is often on small and separated intervals. Thus, the Gaussian latentZ in our model is misspecified. We use higher-dimensionalZto address this, similar to Louizos et al. (2017). Specifically, we set dim(Z) = 50, together with NNs of 50∗2 hidden units in the prior and encoder. We setβ= 1 since it works well on synthetic datasets with limited overlap.
As shown in Table4.1,β-Intact-VAE outperforms or matches the state-of-the-art methods; it has the best performance measured by bothϵate andϵpehe and matches CF and CFR respectively. Also notably, our method outperforms other generative models (CEVAE and GANITE) by large margins.
To show our conditional balance is preferable, we also modify our method and add two components forunconditionalbalance from CFR (see the Sec. 4.5.1), which
is based on bounding PEHE and is controlled by another hyperparameter γ. In the modified version, the over-focus on PEHE of the unconditional balance is seen clearly–with differentγ, it significantly affects PEHE, but barely affects ATE error.
In fact, the unconditional balance, with largerγ, only worsens the performance. See also Figure4.9where CFR gives larger ATE errors with less overlap.
TABLE 4.1: Errors on IHDP over 1000 random DGPs. “Mod. *” in- dicates the modified version with unconditional balance of strength γ = ∗. Italic indicates where the modified version is significantly worse than the original. Boldindicates method(s) which is signifi- cantly better than others. The results of other methods are taken from Shalit, Johansson, and Sontag,2017, except for GANITE and CEVAE,
the results of which are taken from original works.
Method TMLE BNN CFR CF CEVAE GANITE Ours Mod. 1 Mod. 0.2 Mod. 0.1 Mod. 0.05 Mod. 0.01
ϵate .30±.01 .37±.03 .25±.01 .18±.01 .34±.01 .43±.05 .180±.007 .185±.008 .185±.008 .186±.009 .183±.008 .181±.008 pϵpehe 5.0±.2 2.2±.1 .71±.02 3.8±.2 2.7±.1 1.9±.4 .709±.024 1.175±.046 .797±.030 .748±.028 .732±.028 .719±.027