On Causal Mosaic - Treatment Effect Estimation and Bivariate Causal Discovery via Nonlinear ICA

on true generating process and our current assumptions in Theorem1can berelaxed to large extent. Similarly to current f, we may have identification for a general class of noises.

Also, our causal theory does not in principle require continuous latent distributions, though in Theorem1, differentiability of f is inherited from iVAE. Given the fact that currently all nonlinear ICA based identifiability requires differentiable mapping between the latent and observables, directly based on it, theoretical extensions to discrete latent variable would be challenging. However, what is essential for CATE identification is thesametransformation between true and recovered score distribution for botht, but the transformation needsnotto be affine, and, possibly, neither injective. This opens directions for future extensions, based not necessarily on nonlinear ICA.

7.2.1 Future Work on Hidden Confounding

Tell Exactly Where the Correlations Come From

Generally, the relationship between two variables can be categorized into one of the four cases: 1) purely causal (no confounder between them), 2) totally confounded (none of them causes another), 3) causal relation and confounder both exist, 4) neither causal nor confounded. By the existence of statistical dependence, we can elim- inate the last case. Before we could determine the causal direction, the question nat- urally arises: which case we are confronting? However, to the author’s knowledge, no work has addressed this question explicitly. Most research only asks whether it is purely causal or not and, consequentially, cannot distinguish between 2) and 3).

For example, as mentioned before, Zhang, Zhang, and Schölkopf,2015infer the existence of confounder if exogeneity holds for neither directions. While this is reason- able, exogeneity might be invalidated because of the confounder, and causal relation might exist at the same time. On the other hand, it is noteworthy that some, though much less, work assumes dependence is purely due to confounders, and derives necessary condition (Chaves et al.,2014) or infers the latent causal structure (Kela et al.,2019). Under similar lines of reasoning, they would mistake the above case 1) and 3). Therefore, a possible solution would be to combine the two approaches, and we might know it is the mixed case if test for purely causal relation and test for purely confounding both fail.

Extend FCMs to Confounded Case

Perhaps this is the most obvious approach pointed out by current research. Ideally, it would be a remarkable contribution to make ANMs work under confounders.

However, over the years, there is still only LiNGAM that can handle confounders.

This fact possibly suggests that we should take an entirely different path from FCMs, which would be a great endeavor. Other types of constraint that work under confounders (see Peters, Janzing, and Schölkopf,2017, Chapter 9) could be explored and possibly exploited. A more achievable goal might be to work mainly under linear SEM. First, we could relax the assumptions on noise. Some special cases of Gaus- sian noise could be considered (see e.g. Peters and Bühlmann, 2014, but without

confounders). And we might also consider non-additive noise. Second, we might extend the functional class to some extent, such as allowing GLMs with deliberately defined basis functions.

Follow the Path of Distribution Classification

The main difficulty is how to extend the method (e.g., Lopez-Paz et al., 2015) to multivariate case since the class number grows super-exponentially w.r.t variable number. A possible approach is to embed graph into RKHS (e.g. using graph ker- nel (Ghosh et al.,2018)), then exploit distribution regression methods (Szabó et al., 2016). Training data is another problem, since human labelling of causal structure involving hundreds of variables will be too expensive, if not impossible. To address it, we could resort to some data synthesis method. Another, perhaps more prac- tical, research direction might be to introduce some recent advance of distribution learning into causal discovery and improve accuracy, efficiency and scalability. For example, it would be interesting to see if Bayesian learning (Law et al.,2018) could bring up something new, e.g. the integration of prior knowledge.

Leverage Implicit Generative Models

Confounders could be treated as hidden variables from which the observed distribution is generated. In Goudet et al.,2018, we have already seen that 1) the loss does not really penalize anti-causal learning, and 2) the hill-climbing-like procedure is separated into artificial phases and has no guarantee to reach the global optimum.

For the former, we ask the question: how to design a loss from first principles re- garding causality? For example, can we define a discrepancy metric, that could also take into account the complexity of conditional distributions? Using KME, we might combine distance of distributions (such as MMD) with a complexity metric (like in Chen et al.,2014). Another possible way is to explicitly penalize anti-causal learning, by integrate the result of causal detection, like in the ’Causal Regularization’ (Janz- ing, 2019; Bahadori et al.,2017). For the latter, a research question would be how to design a coherent training procedure driving the discovery of underlying causal structure? Here we may try hierarchical implicit models Tran and Blei,2018; Tran, Ranganath, and Blei, 2017, which is more powerful than deep generative models,

in that they are more scalable, and can place prior on parameters and quantify the uncertainty of causal relations. Combining graph structure learning into generative models is also a possible solution.

7.3 Prospects at the Intersection of Causality and Machine

Dalam dokumen Treatment Effect Estimation and Bivariate Causal Discovery via Nonlinear ICA (Halaman 123-126)