Nonlinear ICA and Causal Discovery

2.2 Nonlinear ICA

2.2.2 Nonlinear ICA and Causal Discovery

The following definition formally states the connection between SCM and nonlinear ICA:

Definition 4. An SCM (2.2) isanalyzableif there exists a differentiable and invertible²functionf:Rⁿ →Rⁿ, such thatX=f(E).

Obviously, an analyzable SCM is a special case of nonlinear ICA’s generative model, with particular structure between the variables. For example, in bivariate SCM (2.3), let f₃(E₁,E₂) = f₂(f₁(E₁),E₂) andf = (f₁,f₃), the SCM can be written as(X₁,X₂) = f(E₁,E₂). Now if fis differentiable and invertible onR², the SCM is analyzable.

For analyzable SCM, if we can solve the corresponding nonlinear ICA problem, we obtain the hidden variablesE = g(X). In bivariate case, given E₁ and E₂, under causal Markov and faithfulness assumptions (Spirtes and Zhang,2016), we can conclude:

X₁→X₂ifX₁ |=E₂, X₂→X₁ifX₂ |=E₁

(2.8)

This criteria was exploited by many classical methods, e.g. LiNGAM and ANM, and can be easily understood as the independence of noise and cause.

2This does not imply such a strong restriction as it would seem. See Sec.6.8.2

Literature Review

The broadness of causality studies is well represented by Pearl (2009)’s encyclopedic monograph, which spans causal discovery, causal identification, interventional analysis, and counterfactual reasoning. Given the importance and broadness of Pearl’s work, a brief review is given here as a general starting point for this section. Read- ers of historical interest are referred to Geffner, Dechter, and Halpern,2022, with an annotated bibliography and four introductions by Pearl himself.

Pearl’s interests in causality started from his work on Bayesian networks (Pearl, 1988) which later found applications in causality (Verma and Pearl,1988; Pearl and Verma,1991). His work started from causal discovery (Verma and Pearl,1990) and was influenced by the work of some computational-oriented philosophers (Gly- mour, Scheines, and Spirtes,1987). Later, his work touched identification of causal effects, i.e., the famous back-door criterion (Pearl,1993), semantics of counterfactu- als (Balke and Pearl, 1994), and mediation analysis (Pearl, 2001), all of which are based on a graphical language. By far, his work has influenced statistics (Drton and Maathuis,2017), biostatistics (Greenland, Pearl, and Robins,1999), econometrics (Imbens,2020), and, of course, machine learning (Schölkopf et al.,2021; Kaddour et al.,2022).

Below, works on the two problems tackled in this thesis are reviewed specifically.

3.1 Treatment Effect Estimation

Under unconfoundedness assumption, the problem of covariate imbalance is tradi- tionally addressed bybalancing methods, including matching and re-weighting (Stu- art,2010; Rosenbaum, 2020), because adjusting for imbalance in treatment assign- ment controls bias in treatment effect estimation. In matching methods, similar subjects in the control (treatment) group are found–that is “matched to”–a subject in the treatment (control) group and used as a sample to infer the treated (controlled) subject’s potential outcome. There are, to name a few, Mahalanobis matching (Ru- bin, 1979), propensity score matching (Rosenbaum and Rubin, 1983), full matching (Rosenbaum, 1991), fine balancing (Rosenbaum, Ross, and Silber, 2007), and adaptive hyper-box matching (Morucci et al.,2020). Re-weighting methods balance treatment and control groups by weighting subjects of both groups. The seminal method is inverse propensity weighting (IPW) (Rosenbaum,1987). To avoid extreme weight values, there are also stabilized weighting (Cole and Hernán,2008), trimmed weighting (Lee, Lessler, and Stuart,2011), and overlap weighting (Li, Morgan, and Zaslavsky,2018).

With the nonparametric statistics and machine learning, there comes regression methods. Recall the motivation behind balancing methods is that, in an RCT, or if the propensity score is properly estimated, we can avoid, or relax the assumptions on, modeling the response surfacesµt(X)which might be arbitrary nonlinear and multivariate functions. Conversely, regression methods aim to model the response surfaces precisely, using flexible regression models, without propensity score estimation. This is why nonparametric or machine learning models are considered, e.g., regression trees (Hill,2011; Athey and Imbens,2016) and random forests (Wager and Athey,2018).

Mixed (double) methods combine balancing and regression, because, as we have seen, both are useful for controlling the bias in treatment effect estimation. While many machine learning methods, including ours, fall into this category because they have flexible outcome regressions, the line of work exists in fact long before

the coming of machine learning, for example, in the form of regression with propensity adjustment (Rosenbaum and Rubin,1983). Also, doubly robust estimators (Cas- sel, Särndal, and Wretman,1976; Robins, Rotnitzky, and Zhao, 1994) are consistent if either the propensity estimation or the outcome estimation is consistent and can possibly use machine learning for both estimators (Chernozhukov et al.,2018). An- other benefit of the combination is to debias machine learning regressions and get

√

N-consistency, possibly without propensity estimation (Athey, Imbens, and Wa- ger,2018). Further, double/debiased machine learning (DML) (Chernozhukov et al., 2018) provides a semi-parametric framework, not limited to causal effects as target parameters, exploiting machine learning in estimating nuisance parameters while obtaining√

N-consistency. We note that, for machine learning methods, balancing is often achieved by a regularization term penalizing the imbalance, and this is true for most of the BRL methods mentioned below, including ours.

Below, we focus on several lines of work that are particularly related to aspects of our method.

Limited overlap.Under limited overlap, Luo, Zhu, and Ghosh (2017) estimate the ATE by reducing covariates to a linear prognostic score. Farrell (2015) estimates a constant treatment effect under a partial linear outcome model. D’Amour and Franks (2021) study the identification of ATE by a general class of scores, given the (linear) propensity score and prognostic score. Machine learning studies on this topic have focused on finding overlapping regions (Oberst et al., 2020; Dai and Stultz,2020), or indicating possible failure under limited overlap (Jesson et al.,2020), but not remedies. An exception is Johansson et al. (2020), which provides bounds under limited overlap. To the best of our knowledge, our method is the first machine learning method that provides identification under limited overlap.

Prognostic scores have been recently combined with machine learning approaches, mainly in the biostatistics community. For example, Huang and Chan (2017) estimate individualized treatment effect by reducing covariates to a linear score which is a joint propensity-prognostic score. Tarr and Imai (2021) use SVM to minimize the worst-case bias due to prognostic score imbalance. However, in the machine learning community, few methods consider prognostic scores; Zhang, Liu, and Li (2020) and Hassanpour and Greiner (2019) learn outcome predictors, without mentioning

prognostic score––while Johansson et al. (2020) conceptually, but not formally, con- nects BRL to prognostic score. Our work is the first to formally connect generative learning and prognostic scores for treatment effect estimation.

Identifiable representation. Recently, independent component analysis (ICA) and representation learning––both ill-posed inverse problems––meet together to yield nonlinear ICA and identifiable representation; for example, using VAEs (Khemakhem et al., 2020b), and energy models (Khemakhem et al., 2020a). The results are exploited in causal discovery (Wu and Fukumizu,2020a) and out-of-distribution (OOD) generalization (Sun et al.,2020). This study is the first to explore identifiable repre- sentations in treatment effect identification.

BRL and related methods amount to a major direction. Early BRL methods include BLR/BNN (Johansson, Shalit, and Sontag,2016) and TARnet/CFR (Shalit, Johansson, and Sontag,2017). In addition, Yao et al.,2018exploit the local similarity between data points. Shi, Blei, and Veitch,2019use similar architecture to TARnet, considering the importance of treatment probability. There are also methods that use GAN (Yoon, Jordon, and Schaar,2018, GANITE) and Gaussian processes (Alaa and Schaar,2017). Our method shares the idea of BRL, and further extends to conditional balance––which is natural for individualized treatment effect.

Causal inference with auxiliary structures. CEVAE (Louizos et al.,2017) relies on the strong assumption that the true confounder distribution can be recovered from proxies. Our method is quite different in motivation, applicability, architecture.

Detailed comparisons are given in Sec. 3.1.1. Also with proxies, Kallus, Mao, and Udell, 2018use matrix factorization to infer the confounders, and Mastouri et al., 2021use kernel methods to solve the underlying Fredholm integral equation. IVs are also exploited in machine learning, there are methods using deep NNs (Hartford et al.,2017) and kernels (Singh, Sahani, and Gretton,2019; Muandet et al.,2019).

Our work lays conceptual and theoretical foundations of VAE methods for treatment effects (e.g., CEVAE Louizos et al.,2017; Lu et al., 2020), see Section 5.3. In Section 3.1.1, we also make detailed comparisons to CFR and CEVAE, which are well-known machine learning methods. In addition, some studies consider mono- tonicity, which is injectivity on R, together with overlap, and this is discussed in detail below.

Dalam dokumen Treatment Effect Estimation and Bivariate Causal Discovery via Nonlinear ICA (Halaman 37-43)