Treatment Effect Estimation and Bivariate Causal Discovery via Nonlinear ICA

We compare this method with other recent methods on artificial and real performance benchmark datasets, and our method shows state-of-the-art performance. Note that in the last two cases, reversing the arrow between X,Z does not change any of the independence relationships, and the causal interpretations of the graphs remain the same.

Causality and Machine Learning

This can be seen in the name "Journal of Causal Inference"3 and in the title of Peters, Janzing and Schölkopf, 2017, which mainly discusses (bivariate) causal inference. In what follows, we will use the term "causal inference" in the former sense, especially when we want to include a study of causal effect identification, but we avoid the latter use because it would confuse some readers.

Research Problems

Treatment Effect Estimation

The fundamental difficulty with causal inference is that we never observe counterfactual outcomes that would have occurred if we had made the different decision (treatment or control). Identification is even more important for causal inference; because, unlike the usual (non-causal) misspecification of models, causal assumptions are often not verifiable through observable data (White and Chalak, 2013).

Bivariate Causal Discovery

VAEs (Kingma, Welling, et al., 2019) are suitable for causal estimation thanks to its probabilistic nature. Thirdly, there are a few methods – e.g. CGNN (Goudet et al., 2018) – which uses more flexible models and achieves better performance, but without theoretical justifications.

Contributions

Treatment Effect Estimation

The goal of our method is to recover the prognostic score (Hansen, 2008) adjusted to take into account both as in definition 2. Comparing the definition of balanced score and prognostic score, we can say that the balanced score is sufficient to treatT(T |=V|b( V)), while the prognostic score (Pt- score as in Section 5.1.2) is sufficient for the possible outcomes Y(t)(Y(t) |=V|pt(V)).

Bivariate Causal Discovery

They complement each other; The conditioning in each deconflicts possible outcomes from the treatment, with the former focusing on the treatment side, the latter on the outcomes side.

Nonlinear ICA

VAE from the Viewpoint of Nonlinear ICA
Nonlinear ICA and Causal Discovery
Detailed Comparisons
Injectivity, Invertibility, Monotonicity, and Overlap

Machine learning studies on this topic have focused on finding overlapping regions (Oberst et al., 2020). The results are exploited in causal discovery (Wu and Fukumizu, 2020a) and out-of-distribution (OOD) generalization (Sun et al. al., 2020).

Causal Discovery

Causal Structure Learning

Bivariate Causal Discovery

Stegle et al., 2010 do not restrict the class of causal models, and in particular the noise need not be additive. Hajage et al., 2017 generally show that prognostic scoring methods are more efficient – or as good as propensity scoring methods.

Identification under Generative Prognostic Model

Model, Architecture, and Identifiability

M1) i) injectable phtis, and ii) differentiable phtis. D1) λt(X) is non-degenerate, i.e., the linear field of its support is 2n-dimensional. The essence of the result is that ft′ = ft◦ At; that is, ft can be identified (learned) up to an affine transformationAt.

Details and Explanations on Intact-VAE

In our case, the generative model is built as a way to learn the scores through the correspondence with (5.3). The violation of causal fidelity is also not caused by the generative model (shown in Figure 4.1), because the representation is learned by the encoder, and Z |=T|X is enforced by β.

Identifications under Limited-overlapping Covariate

However, the existence of a low-dimensional balanced prognostic score is uncertain in practice when our knowledge of the DGP is limited. Thus, we depend on Proposition 1, based on model identifiability to work under the prognostic score that typically exists.

Estimation by β-Intact-VAE

Prior as balanced prognostic score, Posterior as prognostic score,

With β we control the compromise between the first and second terms: the first is the deviation of the posterior from the balanced prior, and the second is the reconstruction of the outcome. Under Gaussian models, the consistency of the posterior estimate can be demonstrated, as shown in Bonhomme and Weidner (2021).

Conditionally Balanced Representation Learning

The assignment does not apply to the encoder that learns from the fact X,Y,T (see also the explanation of εCF,t in section 4.3.2). The general steps of the algorithm are i) train the VAE using (4.7) and ii) infer the CATE ˆτ(x). Accordingly, the KL term in ELBO (4.7) is symmetric for t and balances qt(z|x) encouragingZ |=T|Xfor the posterior. VY(x) reflects the internal variance in the DGP and cannot be controlled.

Consistency of VAE and Prior Estimation

We do not need two hyperparameters, since G is implicitly governed by the third term, a norm constraint, in ELBO. Then, in the limit of infinite data, we have µt(X) = ft(ht(X))where f,h are the optimal parameters learned by VAE.

Experiments

Synthetic Dataset

Thus, the performance gap under dim(W) > 1 in Figure 4.2 should be due to the NN performance in β-Intact-VAE. FIGURE 4.5: Histograms of R2 (left) and RESET p-values (right) for linear regressions between true and learned scores.

IHDP Benchmark Dataset

In the modified version, the excessive focus on the PEHE of the unconditional balance is clearly seen - with different γ, it significantly affects the PEHE, but barely affects the ATE error. Results of other methods are taken from Shalit, Johansson and Sontag, 2017, except GANITE and CEVAE,.

Details and Additions of Experiments

Synthetic Data

Note that the above observations about dim(Z) are not caused by fixinggt(W) =1 (compare Figure 4.7 with Figure 5.3 below). Compared to Figure 4.2 in the main text, where gt(W) in DGPs is not fixed, our method works worse here, especially for large β, because now noise modeling (g,k in the ELBO) just adds unnecessary complexity .

IHDP

Thus, this trial also shows the importance of VAE, even when there is an apparently balanced prognostic result. Compared to our CFR with its unconditional balancing, it does not improve the ATE score, it can improve the PEHE results with a fine-tuned parameter, but possibly at the cost of a worse ATE score.

Table 4.3 shows pre-treatment results, All methods gives reasonable results.

Empirical Validation of the Bounds in Sec. 4.3.2

Proofs

D2') limits the discrepancy betweenλ0,λ1on 2n+1 values ofX, and is therefore relatively easy to satisfy with high-dimensionalX. D2') is general despite (or because of) the formulation in question. The above proof also shows that (D2') is a necessary and sufficient condition for A0=A1, if other conditions of Theorem 1 are given.

Detailed Explanations and Discussions

List of Assumptions
Discussions and Examples of (G2)
Complementarity between the two Identifications
Ideas and Connections behind the ELBO (4.7)
Additional Notes on Novelties of the Bounds in Sec. 4.3.2

On the other hand, it has been shown that birth weight increases slightly (by about 100 g) in the same age group in a studied population (Wang et al., 2020). Both are bad because it is shown in (Bonhomme and Weidner,2021) that posterior only helps under limited (small) misspecification and posterior estimator has higher variance than prior estimator (see below for an extreme case).

Unobserved Confounding

Identification

The covariate(s) Xmaynot have subgroups in any category in the graph.is unobserved exogenous noise onY. However, if it is not an observed confounder, the naive regression E[Y|X = x,T = t] based on observable variables is not equal to µt(x).

Prognostic Score with U

Here are Xc,Xiv,Xpa,Xpd,Xy covariates which are:. observed) confounder, IV, antecedent proxy (that is, antecedent of Z), descendant proxy, and antecedent of Y, respectively. In Section 5.4 we discuss how our model relates to and could learn PtS relaxations when they are not observed.

Experiments

Synthetic Dataset

Figure 5.3 shows that our method significantly outperforms CEVAE in all cases. Both methods work best under insolence (“ig”), as expected. Thus, as shown in Figure 5.4, our method learns the representation as an approximate affine transformation of the actual latent value, as a result of the identifiability of our model.

Figure 5.3 shows our method significantly outperforms CEVAE on all cases Both methods work the best under unconfoundedness (“ig.”), as expected

Pokec Social Network Dataset

Here, the true latent Z is a PS and there are no better candidate PSs than Z because ft is invertible and no information can be dropped from Z. Under the IV setting while estimating treatment effects as well as for the proxy setting, the relationship with the true latent significantly obscured because the true latent is correlated to IV X only given T while we model it by p( Z|X ).

VAEs for Treatment Effect Estimation: a Critical Examination

To extract information from the network structure, we use the Graph Convolutional Network (GCN) (Kipf and Welling, 2017) in Intact-VAE's pre-encoder. We show in the following subsections how our model and its identifiability inspire theoretical developments in treatment effect identification.

Theoretical Ideas under Unobserved Confounding

As mentioned in the introduction, we encounter a great variety of causal relationships in nature. We must keep in mind that systems that appear to have different mechanisms may share the same mechanism.

Learning the Shared Mechanism by TCL

In practice, under the assumption that direct causal effects exist, we can only compare the values of an independence measure, as we will explain in Sect.6.3. In Section 6.4, we use ensemble method to exploit the imperfect TCLs trained on those loosely satisfying sets mentioned in the previous paragraph.

Theoretical Results

Separation of Training and Testing
Inference Methods and Identifiability
Choice of Independence Test
Structural MLP

The first rule of inference assumes that we know the causal direction for each of the learning pairs, so that they can be trivially matched. This can be easily implemented as shown in Figure 6.2 (right): we build an MLP with a single output node for g1 and g2, then combine the outputs together.

Assembling Causal Mosaic

Preparing Materials
Choosing Tesserae
From Tesserae to Causal Mosaic
Alternative Ensemble Scorings

This requires knowledge of the causal directions of the training pairs, so it can only be used within the ferrule1. First, we can use each TCL to infer the causal directions of its own learning pairs (Algorithm 3, line 2,3) and select TCLs that produce accuracies higher than the ThreT threshold.

Experiments

Artificial Data

Assuming Direct Causal Effect Our method and NonSENS6 formally require direct causal effects to exist between pairs, and this is our main experimental setting. As shown in Figure 4, in the multi-environment setting, our method outperforms NonSENS, especially when the even number is large.

Real World Dataset

For NCC, we infer each pair by training the method on the rest of the pairs. The performance of ANM is worse than reported in Mooij et al., 2016, possibly due to the different implementation of the independence test.

Details and Notes for Artificial Experiments

The slight drop in performance when setting up multiple pairs should come from the two required input attempts.

Proofs

Discussions

Combining Graphical Search Methods

Invertibility Requirement in Definition 4

However, before turning to larger perspectives, summaries of two lines of our work are first provided. Our method outperforms or matches state-of-the-art methods under various conditions, including unobserved confounding.

Future Work

We believe that this line of work will also pave the way to principled causal effect estimation by other deep architectures, given the rapid advances in deep identifiable models. For example, Khemakhem et al., 2020a provides identifiability to deep-energy models, and Roeder, Metz, and Kingma, 2020 extend the result to a wide class of modern deep-discriminating models.

On Causal Mosaic

Future Work on Hidden Confounding

For example, it would be interesting to see if Bayesian learning (Law et al., 2018) could bring up something new, e.g. integration of previous knowledge. Using KME, we may be able to combine distance of distributions (such as MMD) with a complexity metric (as in Chen et al., 2014).

Prospects at the Intersection of Causality and Machine Learning

Finally, we take a step back and briefly look at the intersection of causality and machine learning as a whole. First, in the above overview and outlook, the "causality for machine learning" side is basically left out.

Empirical Validation of the Error Bound of Intact-VAE

This means that given T= t, the latent representation can be identified up to an invertible elementally affine transformation. This prediction can be arbitrarily far from the truth (1) = f1(z0), due to the difference between A1 and A0.

Balancing Covariate and its Two Special Cases

A technical detail is that,z,z′ may not always be connected to A, because we used the joint support of Y,Y′ in the test.