Proofs - Treatment Effect Estimation and Bivariate Causal Discovery via Nonlinear ICA

We restate our model identifiability formally.

Lemma 1(Model identifiability). Given model(4.2)under(M1), for T =t, assume (D1’) (Non-degenerated data forλ) there exist 2n+1 pointsx0, ...,x2n ∈ X such that

the 2n-square matrix L_t := [_γ_t,1_{, ...,}_γ_t,2n]is invertible, where γ_t,k := _λ_t(x_k)− λ_t(x₀).

Then, given T = t, the family is identifiable up to an equivalence class. That is, if pθ(y|x,t) = _p_θ′(y|x,t), we have the relation between parameters: for anyy_t in the image of f_t,

f_t⁻¹(y_t) =diag(a)f_t^′⁻¹(y_t) +b=:A_t(f_t^′⁻¹(y_t)) (4.14) wherediag(a)is an invertible n-diagonal matrix andbis a n-vector, both depend onλ_tand λ^′_t.

Note, (D1) in the main text implies (D1’), see Sec. B.2.3 in Khemakhem et al., 2020b. The main part of our model identifiability is essentially the same as that of Theorem 1 in Khemakhem et al.,2020b, but now adapted to include the dependency on t. Here we give an outline of the proof, and the details can be easily filled by referring to Khemakhem et al.,2020b. In the proof, subscriptstare omitted for con- venience.

Proof of Lemma1. Using(M1)i) and ii) , we transformp_f,λ(y|x,t) = p_f^′_,λ^′(y|x,t)into equality of noiseless distributions, that is,

q_f^′_,λ^′(y) =q_f,λ(y):= p_λ(f⁻¹(y)|x,t)vol(J_f⁻1(y))_I_Y(y) (4.15)

where p_λis the Gaussian density function of the conditional prior defined in (4.2) andvol(A):= √

detAA^T.q_f^′_,λ^′is defined similarly toq_f_,λ.

Then, apply model (4.2) to (4.15), plug the 2n+1 points from (D1’)into it, and re-arrange the resulting 2n+1 equations in matrix form, we have

F^′(Y) =F(Y):= L^Tt(f⁻¹(Y))−β (4.16) wheret(Z) := (Z,Z²)^T is the sufficient statistics of factorized Gaussian, and βt := (α_t(x₁)−α_t(x₀), ...,α_t(x_2n)−α_t(x₀))^Twhereα_t(X;λt)is the log-partition function of the conditional prior in (4.2).F^′ is defined similarly toF, but with f^′,λ^′,α^′

SinceLis invertible, we have

t(f⁻¹(Y)) = At(f^′−¹(Y)) +c (4.17) whereA= L⁻^TL^′^T andc= L⁻^T(β−β^′).

The final part of the proof is to show, by following the same reasoning as in Appendix B of Sorrenson, Rother, and Köthe,2019, that Ais a sparse matrix such that







diag(a) O diag(u) diag(a²)





 (4.18)

whereAis partitioned into fourn-square matrices. Thus

f⁻¹(Y) =diag(a)f^′−¹(Y) +b (4.19)

wherebis the first half ofc.

Proof of Proposition5. Under(G2), and(M3), we have

Ep_θ(Y|X,T) =_E(Y|X,T) =⇒ f_t◦h(x) = j_t◦p(x)_on(x,t)_{such that}p(t,x)>_0.

(4.20) We show the solution set of (4.20) onoverlappingxis

{(f,h)|f_t= j_t◦_∆⁻¹,h=_∆◦p,∆:P →_Rⁿis injective}. (4.21)

By(G2)(M1), and with injective f_t,j_tand dim(Z) =dim(Y)≥dim(p), for any∆ above, there exists a functional parameter f_t such thatj_t = f_t◦∆. Thus, set (4.21) is non-empty, and any element is indeed a solution because ft◦h= jt◦_∆⁻¹◦_∆◦p= j_t◦p.

Any solution of (4.20) should be in (4.21). A solution should satisfyh(x) = f_t⁻¹◦ j_t◦p(x)for botht sincex is overlapping. This means theinjectivefunction f_t⁻¹◦j_t shouldnotdepend ont, thus it is one of the∆in (4.21).

We proved conclusion 1) withv := _∆. And, on overlappingx, conclusion 2) is quickly seen from

µ_t(x) = f_t(h(x)) = j_t◦v⁻¹(v◦p(x)) =j_t(p(x)) =µ_t(x). (4.22) We rely on overlappingpto work for non-overlappingx. For anyx_twith p(₁− t|x_t) = 0, to ensure p(1−t|p(x_t)) > 0, there should existx₁−t such thatp(x₁−t) = p(x_t)and p(1−t|x₁−t)> 0. And we also haveh(x₁−t) = h(x_t)due to(M2). Then, we have

µ₁−t(xt) = f₁−t(h(xt)) = f₁−t(h(x₁−t)) =j₁−t(p(x₁−t)) =j₁−t(p(xt)) =µ₁−t(xt). (4.23) The third equality uses (4.20) on(x₁−t, 1−t).

Below we prove Theorem1with(D2)replaced by

(D2’) (Spontaneous balance) there exist2n+1pointsx₀, ...,x_2n ∈ X,2n-square matrixC, and2n-vectord, such that L⁻₀¹L₁ = Candβ₀−C⁻^Tβ₁ = d/k for optimalλ_t (see below), where L_t is defined in(D1’), β_t := (α_t(x₁)−α_t(x₀), ...,α_t(x_2n)−α_t(x₀))^T, andαt(X;λt)is the log-partition function of the prior in(4.2).

(D2’)restricts the discrepancy betweenλ₀,λ₁on 2n+1 values ofX, thus is relatively easy to satisfy with high-dimensionalX. (D2’)is general despite (or thanks to) the involved formulation. Let us see its generality even under a highly special case:

C = cIandd= 0. Then, L⁻₀¹L₁ =cI requires that,h₁(x_k)−ch₀(x_k)is the same for 2n+1 pointsx_k. This is easily satisfied except forn≫ mwheremis the dimension ofX, whichrarelyhappens in practice. And,β0−C⁻^Tβ1 =dbecomes justβ1 =cβ₀.

This is equivalent toα₁(x_k)−cα₀(x_k)same for 2n+1 points, again fine in practice.

However, the high generality comes with price. Verifying(D2’) using data is chal- lenging, particularly with high-dimensional covariate and latent variable. Although we believe fast algorithms for this purpose could be developed, the effort would be nontrivial. This is another motivation to use the extreme caseλ₀ = λ₁in Sec.4.3.1, which corresponds toC= I andd=0.

Proof of Theorem1. By(M1)and(G1’), for any injective function∆ : P → _Rⁿ, there exists a functional parameter f_t^∗ such thatjt = f_t^∗◦_{∆. Let}h^∗_t = _∆◦p_t, then, clearly from(M3’), such parametersθ^∗ = (f^∗,h^∗)are optimal: p_θ^∗(y|x,t) = p(y|x,t)_.

Since have all assumptions for Lemma1, we have

∆◦j⁻¹(y) = f^∗−¹(y) =A ◦ f⁻¹(y)|_t_{, on}(y,t)∈ {(j_t◦p_t(x)_,t)|p(t,x)>₀}_{, (4.24)}

where f isanyoptimal parameter, and “|_t” collects all subscriptst. Note, except for

∆, all the symbols should have subscriptt.

Nevertheless, using(D2’), we can further proveA₀= A₁.

We repeat the core quantities from Lemma 1 here: At = L⁻_t ^TL^′_t^T and ct = L⁻_t ^T(_β_t−β^′_t)_.

From(D2’), we immediately have

L₀⁻¹L₁ =L₀^′−¹L^′₁=C ⇐⇒ A₀ = A₁ (4.25)

And also,

L⁻₀¹L₁=C ⇐⇒ L⁻₀^TC⁻^T = L⁻₁^T

β₀−C⁻^Tβ₁= β^′₀−C⁻^Tβ^′₁ =d/k ⇐⇒ C^T(β₀−β^′₀) =β₁−β^′₁

(4.26)

Multiply right hand sides of the two lines, we havec₀ = c₁. Now we haveA₀ = A₁:= A. Apply this to (4.24), we have

ft = jt◦v⁻¹, v:= A⁻¹◦_∆ (4.27)

foranyoptimal parametersθ= (f,h). Again, from(M3’), we have

p_θ(y|x,t) =p(y|x,t) =⇒ p_ϵ(y− f_t(h_t(x))) = p_e(y−j_t(p_t(x))) _(4.28) where pϵ = pe. And the above is only possible when ft◦ht = jt◦p_t. Combined with f_t = j_t◦v⁻¹, we have conclusion 1).

And conclusion 2) follows from the same reasoning as Proposition5, applied to bothp₀andp₁.

Note, when multiplying the two lines of (4.26), the effects ofk → 0 cancel out, andctis finite and well-defined. Also, it is apparent from above proof that(D2’)is a necessary and sufficient condition forA₀=A₁, if other conditions of Theorem1are given.

Below, we prove the results in Sec.4.3.2. The definitions and results work for the prior; simplyreplace qt(x|x)with pt(z|x):= p_λ(z|x,t)in definitions and statements, and the proofs below hold as the same. The dependence on f prevail, and the superscripts are omitted. The argumentsxare sometimes also omitted.

Lemma 2(Counterfactual risk bound). Assume|L_f(z,t)| ≤M, we have

ϵ_CF(x)≤ _∑_tq(₁−t|x)ϵ_F,t(x) +MD(x) _(4.29) whereϵ_CF(x):=_∑_tp(1−t|x)ϵ_CF,t(x), andD(x):=_∑_t^pD_KL(qt∥q₁−t)/2.

Proof of Lemma2.

ϵ_CF−

∑

p(1−t|x)ϵ_F,t

= p(₀|x)(ϵ_CF,1−ϵ_F,1) +p(₁|x)(ϵ_CF,0−ϵ_F,0)

= p(0|x)

L_f(z, 1)(q0(z|x)−q₁(z|x))dz+p(1|x)

L_f(z, 0)(q₁(z|x)−q0(z|x))dz

≤_2MTV(q₁,q₀)≤ MD.

TV(p,q) := ¹₂_E|p(z)−q(z)|is the total variance distance between probability densityp,q. The last inequality uses Pinsker’s inequalityTV(p,q)≤^pD_KL(p∥q)/2 twice, to get the symmetricD.

Theorem2is a direct corollary of Lemma2and the following.

Lemma 3. DefineϵF =_∑_tp(t|x)ϵF,t. We have

ϵ_f ≤2(G(ϵ_F+ϵ_CF)−V_Y). (4.30) Simply boundϵ_CF in (4.30) by Lemma2, we have Theorem2. To prove Lemma3, we first examine a bias-variance decomposition ofϵ_Fandϵ_CF.

ϵ_CF,t =_E_q₁₋_t₍_z_|_x₎g_t(z)_E_p

Y(t)|pt(y|z)(y− f_t(z))²

≥GEq₁−t(z|x)Ep_Y(t)|pt(y|z)(y− f_t(z))²

=_GE_q₁₋_t₍_z_|_x₎_E_p

Y(t)|pt(y|z)((y−j_t(z))²+ (j_t(z)− f_t(z))²)

(4.31)

The second line uses|g_t(z)| ≤ G, and the third line is a bias-variance decomposition. Now we can defineV_CF,t(x):=_E_q₁₋_t₍_z_|_x₎_E_p

Y(t)|pt(y|z)(y−j_t(z))²andBCF,t(x):= E_q₁₋_t(z|x)(jt(z)− ft(z))², and we have

ϵ_CF,t≥ G(V_CF,t(x) +_B_CF,t(x)) =⇒ ϵ_CF ≥G(V_CF(x) +_B_CF(x)) (4.32) whereV_CF :=_∑_tp(1−t|x)V_CF,t=_∑_t_E_q₍_z,1₋_t_|_x₎_E_p

Y(t)|pt(y|z)(y−j_t(z))²and similarly BCF = _∑_t_E_q₍_z,1₋_t_|_x₎(jt(z)− ft(z))². Repeat the above derivation forϵ_F, we have

ϵ_F ≥G(V_F(x) +_B_F(x)) (4.33) whereV_F = _∑_t_E_q₍_z,t_|_x₎_E_p

Y(t)|pt(y|z)(y−j_t(z))² andBF = _∑_t_E_q₍_z,t_|_x₎(j_t(z)− f_t(z))². Now, we are ready to prove Lemma3.

Proof of Lemma3.

ϵ_f =_E_q₍_z_|_x₎((f₁− f0)−(j₁−j0))²

=_E_q((f₁−j₁) + (j₀− f₀))²

≤_2E_q((f₁−j₁)²+ (j₀− f₀)²)

=2 Z

[(f₁−j₁)²q(z, 1|x) + (j₀− f₀)²q(z, 0|x)+

(f₁−j₁)²q(z, 0|x) + (j₀− f₀)²q(z, 1|x)]dz

=2(_B_F+_B_CF)≤2(G(ϵF+ϵ_CF)−V_Y)

The first inequality uses(a+b)² ≤2(a²+b²). The next equality splitsq(z|x)into q(z, 0|x)andq(z, 1|x)and rearranges to getBFandBCF. The last inequality uses the two bias-variance decompositions, andV_Y=V_F+V_CF.

Dalam dokumen Treatment Effect Estimation and Bivariate Causal Discovery via Nonlinear ICA (Halaman 75-81)