Identification under Generative Prognostic Model

and ATE are identified, using(2.1)and

µ_ˆ_t(x) =_E(Y(tˆ)|p_t_ˆ(X)_,X= x) =_E(Y|p_t_ˆ(x)_,T =tˆ) =R p_Y_|_p_ˆ

t,T(y|p_ˆ_t(x)_{, ˆ}t)ydy (4.1) With the knowledge ofp_t andp_Y_|_p_ˆ

t,T, we choose one ofp₀,p₁and sett =tˆin the density function, w.r.t theµ_ˆ_tof interest. This counterfactual assignment resolves the problem of non-overlap atx. Note that a sample point with X = xmay not have T= ˆt.

We consider additive noise models forY(t), which ensures the existence of prognostic scores.

(G1)¹(Additive noise model) the data generating process (DGP) forYisY= f^∗(m(X,T),T) + ewhere f^∗,mare functions andeis a zero-mean exogenous (external) noise.

The DGP is causal and defines potential outcomes byY(t) := f_t^∗(m_t(X)) +e, and specifies m(X,T), T, ande as the only direct causes of Y. Particularly, m_t(X) is a sufficient statistics ofXforY(t). For example, 1)m_t(X)can be the component(s) of X that affectY(t)directly, or 2) ifY(t)|X follows a generalized linear model, then m_t(X)can be the linear predictor ofY(t).

Under(G1), 1)m_t(X)is a prognostic score; 2)µt(X) = f_t^∗(m_t(X))is a prognostic score; 3)Xis a (trivial) balanced prognostic score; and 4)u(X):= (µ₀(X),µ₁(X))is a balanced prognostic score. Theessence of our methodis to recover the prognostic scorem_t(X)as a representation, assumingm_t(X)is not higher-dimensional thanY and approximately balanced. Note thatµ_t(X), our final target, is a low-dimensional prognostic score but not balanced, and we estimate it conditioning on the approximate balanced prognostic scorem_t(X).

FIGURE4.1: CVAE, iVAE, and Intact-VAE: Graphical models of the decoders.

(i.e., prior and decoder), but not the encoder. The encoder is not part of the generative model and is involved as an approximate posterior in the estimation, which is studied in Sec.4.3.

4.2.1 Model, Architecture, and Identifiability

Our goal is to build a model that can be learned by VAE from observational data to obtain a prognostic score, or better, a balanced prognostic score, via the latent vari- ableZ. The generative prognostic model of the proposed method is in (4.2), where θ:= (f,h,k)contains the functional parameters. The first factor p_f(y|z,t), our decoder, models p_Y_|_p_t_,T(y|P,t)in (5.3) and is an additive noise model, with ϵ ∼ p_ϵ as the exogenous noise. The second factor p_λ(z|x,t), our conditional prior, models p_T(X)and is a factorized Gaussian, withλT(X) := _diag⁻¹(k_T(X))(h_T(X)_,−¹₂)^T _as its natural parameter in the exponential family, where diag()gives a diagonal matrix from a vector.

p_θ(y,z|x,t) = p_f(y|z,t)p_λ(z|x,t),

p_f(y|z,t) = p_ϵ(y− f_t(z)), p_λ(z|x,t)∼ N(z;h_t(x), diag(k_t(x))).

(4.2)

We denoten :=dim(Z). For inference, the ELBO is given by the standard variational lower bound

logp(y|x,t)≥_E_z_∼_qlogp_f(y|z,t)−D_KL(q(z|x,y,t)∥p_λ(z|x,t)). (4.3) Note that the encoder qconditions on all the observables (X,Y,T); this fact plays an important role in Sec.4.3.1. Full parameterization of the encoder and decoder is also given in Sec.4.3.1. This architecture is calledIntact-VAE(Identifiabletreatment- conditional VAE). See Figure4.1for comparison in terms of graphical models (which havenotcausal implications here). See Sec.4.2.2for more expositions and Sec.2.2.1 for basics of VAEs.

Our model identifiability extends the theory of iVAE, and the following conditions are inherited.

(M1) i) f_tis injective, and ii) f_tis differentiable.

(D1) λ_t(X)is non-degenerate, i.e., the linear hull of its support is 2n-dimensional.

Under(M1) and(D1), we obtain the following identifiability of the parameters in the model: ifpθ(y|x,t) = p_θ^′(y|x,t), we have, for anyytin the image of ft:

f_t⁻¹(yt) =diag(a)f_t^′⁻¹(yt) +b=:A_t(f_t^′⁻¹(yt)) (4.4) where diag(a)is an invertiblen-diagonal matrix andbis ann-vector, both of which depend onλt(x)and λ^′_t(x). The essence of the result is that f_t^′ = ft◦ A_t; that is, f_t can be identified (learned) up to an affine transformationA_t_{. See Sec.}_4.6_{for the} proof and a relaxation of(D1). In Chapter4&5, symbol^′ (prime) always indicates another parameter (variable, etc.):θ^′ = (f^′,λ^′).

4.2.2 Details and Explanations on Intact-VAE

Our goal is to build a model that can be learned by VAE from observational data to obtain a prognostic score, or more ideally balanced prognostic score, via the latent variableZ. That is, a generative prognostic model. Generative models are useful to solve the inverse problem of recovering prognostic scores.

With the above goal, the generative model of our VAE is built as (4.2). Condition- ing onXin the joint model p(y,z|x,t)reflects that our estimand is CATE given X.

Modeling the score by a conditional distribution rather than a deterministic function is more flexible.

The ELBO of our model can be derived from standard variational lower bound as following:

logp(y|x,t)≥logp(y|x,t)−D_KL(q(z|x,y,t)∥p(z|x,y,t))

=_E_z∼qlogp(y|z,t)−D_KL(q(z|x,y,t)∥p(z|x,t)).

(4.5)

We naturally have an identifiable conditional VAE (CVAE), as the name sug- gests. Note that (4.2) has a similar factorization with the generative model of iVAE (Khemakhem et al.,2020b), that isp(y,z|x) = p(y|z)p(z|x); the first factor does not depend onX. Further, since we have the conditioning on T in both the factors of (4.2), our VAE architecture is a combination of iVAE and CVAE (Sohn, Lee, and Yan, 2015; Kingma et al.,2014), withTas the conditioning variable. See Figure4.1for the comparison in terms of graphical models. The core idea of iVAE is reflected in our model identifiability (see Lemma1).

Please do not confuse the DGP (G1) and the generative model (4.2) of Intact- VAE. The former is the causal model, but the latter is not (at least before we show the treatment effect identifications in Sec.4.2.3). In our case, the generative model is built as a way to learn the scores through the correspondence to (5.3).

In particular, note that conditionally balanced representationZ |=T|Xis possible under the generative model. This requires a violation ofcausal faithfulness, so that there are other conditional independence relations, which are not generally implied by the graphical model. Our method, based on iVAE, which achieves ICA, performs nonlinear ICA to recover the scores. In fact, ICA procedures often violate causal faithfulness, because it requires finding causes from effects. Also, the violation of causal faithfulness is not caused by the generative model (which is shown in Figure 4.1), because the representation is learned by the encoder, andZ |=T|X is enforced byβ.

4.2.3 Identifications under Limited-overlapping Covariate

In this subsection, we present two results of CATE identification based on the recovery of equivalent balanced prognostic score and prognostic score, respectively.

Since prognostic scores are functions ofX, the theory assumes a noiseless prior for simplicity, i.e.,k(X) =0; the priorZ_λ,t ∼ p_λ(z|x,t)degenerates to functionht(X).

prognostic scores with dimensionality lower than or equal to d = _dim(Y)_are essential to address limited overlapping, as shown below. We setn=dbecauseµ_tis a prognostic score of the same dimension asYunder(G1). In practice,n =dmeans that we seek a low-dimensional representation ofX. We introduce

(G1’) (Low-dimensional prognostic score)(G1)is true, and µt = jt◦p_t for somep_t and injective j_t,

which is equivalent to(G1)becauseµt = jt◦p_tis trivially satisfied with jt is iden- tity andp_t = µ_t. (G1’)is used instead in this subsection. First, it explicitly restricts dim(p_t) via injectivity, which ensures that n = dim(Y) ≥ dim(p_t). Second, it re- minds us that, possibly, the decomposition is not unique; and, clearly, all p_t that satisfy(G1’)are prognostic scores. For example, if f_t^∗ is injective, then j_t = f_t^∗ and p_t =m_tsatisfiesµ_t = j_t◦p_t. Finally, it is then natural to introduce

(G2) (Low-dimensional balanced prognostic score)(G1)is true, andµ_t = j_t◦p for somepand injectivej_t,

which is stronger than(G1), gives balanced prognostic scorep(X), and ensures that n≥dim(p). (G2)is satisfied if f_t^∗ is injective andm₀= m₁. (G2)impliesµ₁ =i◦µ₀ wherei := j₁◦j₀⁻¹; in words, CATEs are given byµ₀and an invertible function. See Sec.4.7.2for real-world examples and more discussions.

With(G1’)or(G2), overlappingXcan be relaxed to overlapping balanced prognostic score or prognostic score plus the following:

(M2) (Score partition preserving) For anyx,x^′ ∈ X, ifp_t(x) =p_t(x^′), thenh_t(x) =h_t(x^′). Note that(M2)is only required for the optimalhspecified in Proposition5or Theo- rem1. The intuition is thatp_tmaps each non-overlappingxto an overlapping value, andht preserves this property through learning. This is non-trivial because, for a

givent, some values ofXare unobserved due to limited overlap. Thus,(M2)can be seen as a weak form of OOD generalization: the NNs forhcan learn the OOD score partition. While unnecessary for us, linearp_tandhttrivially imply(M2)and are often assumed, e.g., in Huang and Chan,2017; Luo, Zhu, and Ghosh,2017; D’Amour and Franks,2021.

Our first identification, Proposition5, relies on(G2)and our generative model, withoutmodel identifiability (so differentiable ftis not needed).

Proposition 5(Identification via recovery of balanced prognostic score). Suppose we have DGP(G2)and model(4.2)with n= d. Assume(M1)-i) and(M3)(PS matching) let h₀(X) =h₁(X)andk(X) =0.Then, ifEpθ(Y|X,T) =_E(Y|X,T), we have

1) (Recovery of balanced prognostic score)z_λ,t =h_t(x) =v(p(x))on overlappingx, wherev:P →_Rⁿis an injective function, andP := {p(x)|overlappingx}; 2) (CATE identification) ifp(X)in(G2)is overlapping, and(M2)is satisfied, then

µ_t(x) =µˆ_t(x):=_E_p

λ(Z|x,t)Epf(Y|Z,t) = f_t(h_t(x)), for any t∈ {0, 1}andx∈ X. In essence, i) the true DGP is identified up to an invertible mappingv, such that f_t = j_t◦v⁻¹ and h = v◦p; and ii)p_t is recovered up tov, andY(t) |=X|p_t(X) is preserved––withsamevfor botht. Theorem1below also achieves the essence i) and ii), underp₀̸=p₁.

The existence of balanced prognostic score is preferred, because it satisfies overlap and (M2) more easily than prognostic score which requires the conditions for each of the two functions of prognostic score. However, the existence of low-dimensional balanced prognostic score is uncertain in practice when our knowledge of the DGP is limited. Thus, we depend on Theorem1based on the model identifiability to work under prognostic score which generally exists.

Theorem 1(Identification via recovery of prognostic score). Suppose we have DGP (G1’)and model(4.2)with n= d. For the model, assume(M1)and(M3’)(Noise matching) letp_e = p_ϵ andk(X) = kk^′(X),k → 0. Assume further that(D1)and(D2)(Bal- ance from data)A₀ = A₁ in (4.4). Then, if p_θ(y|x,t) = p(y|x,t); conclusions 1) and 2) in Proposition 5 hold with p replaced with p_t in (G1’); and the domain of v becomes P := {p_t(x)|p(t,x)>₀}.

Theorem1implies that, without balanced prognostic score, we need to know or learn the distribution of hidden noiseϵto havep_e= p_ϵ. Proposition5and Theorem 1achieve recovery and identification in a complementary manner; the former starts from the prior byp₀ = p₁ andh₀ = h₁, while the latter starts from the decoder by A₀ = A₁ and p_e = p_ϵ. We see that A₀ = A₁ acts as a kind of balance because it replacesp₀ = p₁ in Proposition5. We show in Sec.4.6 a sufficient and necessary condition(D2’) on data that ensures A₀ = A₁. Note that the singularities due to k→0 (e.g.,λ→0) cancel out in (4.4). See Sec.4.7.3for more on the complementarity between the two identifications.

Dalam dokumen Treatment Effect Estimation and Bivariate Causal Discovery via Nonlinear ICA (Halaman 52-58)