Bivariate Causal Discovery - Causal Discovery

3.2 Causal Discovery

3.2.2 Bivariate Causal Discovery

In recent years, a line of research emerges that is particularly motivated to solve the problem of distinguishing cause from effect. Intuitively, cause-effect relation- ship is asymmetric in the sense that recovering the cause from effect is more com- plex than modeling the physical process that generates effect from cause. All the methods exploit this asymmetry to identify causal direction (Mooij et al.,2016). For

example, many methods define certain simple, restricted class of functional forms (sometimes referred to as "functional causal models" (FCMs) (Hyvärinen and Zhang, 2016)). Typical FCMs are LiNGAM (Shimizu et al.,2006), ANM (Hoyer et al.,2009) and PNL (Zhang and Hyvärinen,2009); each has more general functional form than the former, and, thus, is more widely applicable. In particular, ANM is the first to use nonlinear additive noise model in this problem. These methods advance to deal with some harder cases, such as cyclic (Mooij et al.,2011), multivariate (Peters et al., 2014), and discrete (Peters, Janzing, and Scholkopf,2011).

Other methods exist, and most of them exploits the asymmetry in another way:

the so-called "principle of independent causal mechanism (ICM)", which postulates that the process generating cause distribution is in some way "independent" to the causal mechanism generating conditional distribution of the effect given the cause.

For example, Janzing et al.,2012, (IGCI) use orthogonality in information space to ex- press the independence between two distributions. It is applicable when the causal relation is deterministic (noise-free). Blöbaum et al.,2018, (RECI) extend IGCI to the setting with small noise, and proceeds by comparing the regression errors in both possible directions. Stegle et al.,2010do not restrict the class of causal models, and particularly, the noise need not be additive. It explicitly models the “noise” as a latent variable that summarizes unobserved causes, but still assumes the independence of the noise and the observed cause. Both Mitrovic, Sejdinovic, and Teh,2018, (KCDC) and Budhathoki and Vreeken,2017base on the invariance of Kolmogorov complexity on the value of the cause. As computing the Kolmogorov complexity is intractable, Mitrovic, Sejdinovic, and Teh, 2018 use, after kernel mean embed- ding (KME), the variability in RKHS norm as a proxy for it, while Budhathoki and Vreeken,2017leverage stochastic complexity as an approximation. Unfortunately, all these methods assume no confounder, since the asymmetry may become invalid under hidden common causes. Identifiability in the presence of confounders using purely observational data is only well studied in the linear, non-Gaussian noise case (Shimizu,2014; Hoyer et al.,2008).

The most related to our work in Chapter6might be the following. RCC (Lopez- Paz et al.,2015) and its follow-up NCC (Lopez-Paz et al.,2017) also use training data, but they require large numbers of labeled pairs and thus rely on synthetic pairs for

training. There is work which takes related viewpoints: KCDC uses majority voting, the simplest ensemble method; ANM-MM treats mechanism as a mixture. NonSENS (Monti, Zhang, and Hyvärinen,2019) also employs the same nonlinear ICA method as ours, but needs samples of a causal system available over different environments, which requires interventions or even experiments. We should note that all the above methods neither take a mosaic view explicitly nor use ensemble method as a main building block.

Advances regarding Hidden Confounding

Until recently, there is few work that achieves identifiability under confounders, and this paragraph provides a brief review. Goudet et al.,2018use deep generative neural networks to learn multivariate causal models, minimizing a maximum mean dis- crepancy (MMD) loss based on KME. And it models confounders by adding cor- related noise between adjacent observed variables. But the loss is to some extent ad-hoc and has a hyperparameter. Zhang, Zhang, and Schölkopf,2015test for exogeneity with bootstrap and infers the causal direction, or the existence of confounder if exogeneity holds for neither directions. But it gives non-identifiable result if exogeneity holds for both directions. Moreover, exogeneity is at best a necessary con- dition of direct causal relation. Chalupka, Eberhardt, and Perona,2016work in the discrete bivariate case. In the presence of confounder, it trains a neural network by synthesized data drawn from uninformative Dirichlet prior. Both Rothenhäusler, Bühlmann, and Meinshausen, 2019 and Rothenhäusler et al., 2015 exploit the invariance of causal mechanism under specific type of additive intervention and need data from multiple environments with distinct, but maybe unknown, interventions.

And the causal models are both assumed to be linear, with additive noise. They allow non-diagonal covariance matrix of the noise, which indicates possible latent variables (including confounders). Lopez-Paz et al.,2015 do not use common assumptions from causal inference. Instead, it casts causal inference as classification of probability distributions, in RKHS after KME. The train data are samples from joint distributions of pairs of variables, labeled by their true causal relations (which may include “confounded”). Each of these methods is based on at least one of the following: ad-hoc devices, rather than, preferably theoretical, justification by general

principles; too strong or dubious assumptions that limit its application; additional information other than observational data, such as labels of true causal relations, or data from multiple environments involving interventions. In Section7.2.1we dis- cuss some ideas to avoid these weaknesses.

Chapter 4

Intact-VAE: Treatment Effect

Estimation under Limited Overlap

4.1 Intuition and Data Generating Process

We use balanced prognostic score or prognostic score to construct representations for CATE estimation. Why not balancing scores? While balancing scores b(X) have been widely used in causal inference, prognostic scores are more suitable for discussing overlap. Our purpose is to recover an overlapping score for limited- overlappingX. It is known that overlappingb(X)implies overlappingX(D’Amour et al.,2020), which counters our purpose. In contrast, overlapping balanced prognostic score does not imply overlappingb(X). Example. Let T = _I(X+ϵ > 0) andY = f(|X|,T) +e, where I is the indicator function, ϵ ande are exogenous zero-mean noises, and the support ofXis on the entire real line whileϵis bounded.

Now,Xitself is a balancing score and|X|is a balanced prognostic score; and|X|is overlapping butXis not. Moreover, with theoretical and experimental evidence, it is recently conjectured that prognostic scores maximize overlap among a class of sufficient scores, includingb(X)(D’Amour and Franks,2021). In general, Hajage et al., 2017show that prognostic score methods perform better––or as well as––propensity score methods.

Below is a corollary of Proposition 5 in Hansen, 2008; note that p_t(X) satisfies exchangeability.

Proposition 4(Identification via prognostic score). If p_t(_X)is a prognostic score and Y|p_t_ˆ(X),T ∼ p_Y_|_p_ˆ

t,T(y|P,t)wheretˆ ∈ {0, 1}is a counterfactual assignment, then CATE

and ATE are identified, using(2.1)and

µ_ˆ_t(x) =_E(Y(tˆ)|p_t_ˆ(X)_,X= x) =_E(Y|p_t_ˆ(x)_,T =tˆ) =R p_Y_|_p_ˆ

t,T(y|p_ˆ_t(x)_{, ˆ}t)ydy (4.1) With the knowledge ofp_t andp_Y_|_p_ˆ

t,T, we choose one ofp₀,p₁and sett =tˆin the density function, w.r.t theµ_ˆ_tof interest. This counterfactual assignment resolves the problem of non-overlap atx. Note that a sample point with X = xmay not have T= ˆt.

We consider additive noise models forY(t), which ensures the existence of prognostic scores.

(G1)¹(Additive noise model) the data generating process (DGP) forYisY= f^∗(m(X,T),T) + ewhere f^∗,mare functions andeis a zero-mean exogenous (external) noise.

The DGP is causal and defines potential outcomes byY(t) := f_t^∗(m_t(X)) +e, and specifies m(X,T), T, ande as the only direct causes of Y. Particularly, m_t(X) is a sufficient statistics ofXforY(t). For example, 1)m_t(X)can be the component(s) of X that affectY(t)directly, or 2) ifY(t)|X follows a generalized linear model, then m_t(X)can be the linear predictor ofY(t).

Under(G1), 1)m_t(X)is a prognostic score; 2)µt(X) = f_t^∗(m_t(X))is a prognostic score; 3)Xis a (trivial) balanced prognostic score; and 4)u(X):= (µ₀(X),µ₁(X))is a balanced prognostic score. Theessence of our methodis to recover the prognostic scorem_t(X)as a representation, assumingm_t(X)is not higher-dimensional thanY and approximately balanced. Note thatµ_t(X), our final target, is a low-dimensional prognostic score but not balanced, and we estimate it conditioning on the approxi- mate balanced prognostic scorem_t(X).

Dalam dokumen Treatment Effect Estimation and Bivariate Causal Discovery via Nonlinear ICA (Halaman 46-52)