• Tidak ada hasil yang ditemukan

We restate our model identifiability formally.

Lemma 1(Model identifiability). Given model(4.2)under(M1), for T =t, assume (D1’) (Non-degenerated data forλ) there exist 2n+1 pointsx0, ...,x2n ∈ X such that

the 2n-square matrix Lt := [γt,1, ...,γt,2n]is invertible, where γt,k := λt(xk)− λt(x0).

Then, given T = t, the family is identifiable up to an equivalence class. That is, if pθ(y|x,t) = pθ(y|x,t), we have the relation between parameters: for anyyt in the image of ft,

ft1(yt) =diag(a)ft1(yt) +b=:At(ft1(yt)) (4.14) wherediag(a)is an invertible n-diagonal matrix andbis a n-vector, both depend onλtand λt.

Note, (D1) in the main text implies (D1’), see Sec. B.2.3 in Khemakhem et al., 2020b. The main part of our model identifiability is essentially the same as that of Theorem 1 in Khemakhem et al.,2020b, but now adapted to include the dependency on t. Here we give an outline of the proof, and the details can be easily filled by referring to Khemakhem et al.,2020b. In the proof, subscriptstare omitted for con- venience.

Proof of Lemma1. Using(M1)i) and ii) , we transformpf,λ(y|x,t) = pf,λ(y|x,t)into equality of noiseless distributions, that is,

qf,λ(y) =qf,λ(y):= pλ(f1(y)|x,t)vol(Jf1(y))IY(y) (4.15)

where pλis the Gaussian density function of the conditional prior defined in (4.2) andvol(A):= √

detAAT.qf,λis defined similarly toqf,λ.

Then, apply model (4.2) to (4.15), plug the 2n+1 points from (D1’)into it, and re-arrange the resulting 2n+1 equations in matrix form, we have

F(Y) =F(Y):= LTt(f1(Y))−β (4.16) wheret(Z) := (Z,Z2)T is the sufficient statistics of factorized Gaussian, and βt := (αt(x1)−αt(x0), ...,αt(x2n)−αt(x0))Twhereαt(X;λt)is the log-partition function of the conditional prior in (4.2).F is defined similarly toF, but with f,λ,α

SinceLis invertible, we have

t(f1(Y)) = At(f′−1(Y)) +c (4.17) whereA= LTLT andc= LT(ββ).

The final part of the proof is to show, by following the same reasoning as in Appendix B of Sorrenson, Rother, and Köthe,2019, that Ais a sparse matrix such that

A=

diag(a) O diag(u) diag(a2)

 (4.18)

whereAis partitioned into fourn-square matrices. Thus

f1(Y) =diag(a)f′−1(Y) +b (4.19)

wherebis the first half ofc.

Proof of Proposition5. Under(G2), and(M3), we have

Epθ(Y|X,T) =E(Y|X,T) =⇒ fth(x) = jt◦p(x)on(x,t)such thatp(t,x)>0.

(4.20) We show the solution set of (4.20) onoverlappingxis

{(f,h)|ft= jt1,h=◦p,∆:P →Rnis injective}. (4.21)

By(G2)(M1), and with injective ft,jtand dim(Z) =dim(Y)≥dim(p), for any∆ above, there exists a functional parameter ft such thatjt = ft◦∆. Thus, set (4.21) is non-empty, and any element is indeed a solution because fth= jt1◦p= jt◦p.

Any solution of (4.20) should be in (4.21). A solution should satisfyh(x) = ft1jt◦p(x)for botht sincex is overlapping. This means theinjectivefunction ft1jt shouldnotdepend ont, thus it is one of the∆in (4.21).

We proved conclusion 1) withv := . And, on overlappingx, conclusion 2) is quickly seen from

ˆ

µt(x) = ft(h(x)) = jtv1(v◦p(x)) =jt(p(x)) =µt(x). (4.22) We rely on overlappingpto work for non-overlappingx. For anyxtwith p(1− t|xt) = 0, to ensure p(1−t|p(xt)) > 0, there should existx1t such thatp(x1t) = p(xt)and p(1−t|x1t)> 0. And we also haveh(x1t) = h(xt)due to(M2). Then, we have

ˆ

µ1t(xt) = f1t(h(xt)) = f1t(h(x1t)) =j1t(p(x1t)) =j1t(p(xt)) =µ1t(xt). (4.23) The third equality uses (4.20) on(x1t, 1−t).

Below we prove Theorem1with(D2)replaced by

(D2’) (Spontaneous balance) there exist2n+1pointsx0, ...,x2n ∈ X,2n-square matrixC, and2n-vectord, such that L01L1 = Candβ0CTβ1 = d/k for optimalλt (see below), where Lt is defined in(D1’), βt := (αt(x1)−αt(x0), ...,αt(x2n)−αt(x0))T, andαt(X;λt)is the log-partition function of the prior in(4.2).

(D2’)restricts the discrepancy betweenλ0,λ1on 2n+1 values ofX, thus is relatively easy to satisfy with high-dimensionalX. (D2’)is general despite (or thanks to) the involved formulation. Let us see its generality even under a highly special case:

C = cIandd= 0. Then, L01L1 =cI requires that,h1(xk)−ch0(xk)is the same for 2n+1 pointsxk. This is easily satisfied except forn≫ mwheremis the dimension ofX, whichrarelyhappens in practice. And,β0CTβ1 =dbecomes justβ1 =cβ0.

This is equivalent toα1(xk)−cα0(xk)same for 2n+1 points, again fine in practice.

However, the high generality comes with price. Verifying(D2’) using data is chal- lenging, particularly with high-dimensional covariate and latent variable. Although we believe fast algorithms for this purpose could be developed, the effort would be nontrivial. This is another motivation to use the extreme caseλ0 = λ1in Sec.4.3.1, which corresponds toC= I andd=0.

Proof of Theorem1. By(M1)and(G1’), for any injective function∆ : P → Rn, there exists a functional parameter ft such thatjt = ft∆. Letht = ◦pt, then, clearly from(M3’), such parametersθ = (f,h)are optimal: pθ(y|x,t) = p(y|x,t).

Since have all assumptions for Lemma1, we have

∆◦j1(y) = f∗−1(y) =A ◦ f1(y)|t, on(y,t)∈ {(jt◦pt(x),t)|p(t,x)>0}, (4.24)

where f isanyoptimal parameter, and “|t” collects all subscriptst. Note, except for

∆, all the symbols should have subscriptt.

Nevertheless, using(D2’), we can further proveA0= A1.

We repeat the core quantities from Lemma 1 here: At = Lt TLtT and ct = Lt T(βtβt).

From(D2’), we immediately have

L01L1 =L0′−1L1=C ⇐⇒ A0 = A1 (4.25)

And also,

L01L1=C ⇐⇒ L0TCT = L1T

β0CTβ1= β0CTβ1 =d/k ⇐⇒ CT(β0β0) =β1β1

(4.26)

Multiply right hand sides of the two lines, we havec0 = c1. Now we haveA0 = A1:= A. Apply this to (4.24), we have

ft = jtv1, v:= A1 (4.27)

foranyoptimal parametersθ= (f,h). Again, from(M3’), we have

pθ(y|x,t) =p(y|x,t) =⇒ pϵ(yft(ht(x))) = pe(yjt(pt(x))) (4.28) where pϵ = pe. And the above is only possible when ftht = jt◦pt. Combined with ft = jtv1, we have conclusion 1).

And conclusion 2) follows from the same reasoning as Proposition5, applied to bothp0andp1.

Note, when multiplying the two lines of (4.26), the effects ofk → 0 cancel out, andctis finite and well-defined. Also, it is apparent from above proof that(D2’)is a necessary and sufficient condition forA0=A1, if other conditions of Theorem1are given.

Below, we prove the results in Sec.4.3.2. The definitions and results work for the prior; simplyreplace qt(x|x)with pt(z|x):= pλ(z|x,t)in definitions and statements, and the proofs below hold as the same. The dependence on f prevail, and the superscripts are omitted. The argumentsxare sometimes also omitted.

Lemma 2(Counterfactual risk bound). Assume|Lf(z,t)| ≤M, we have

ϵCF(x)≤ tq(1−t|x)ϵF,t(x) +MD(x) (4.29) whereϵCF(x):=tp(1−t|x)ϵCF,t(x), andD(x):=tpDKL(qt∥q1t)/2.

Proof of Lemma2.

ϵCF

t

p(1−t|x)ϵF,t

= p(0|x)(ϵCF,1ϵF,1) +p(1|x)(ϵCF,0ϵF,0)

= p(0|x)

Z

Lf(z, 1)(q0(z|x)−q1(z|x))dz+p(1|x)

Z

Lf(z, 0)(q1(z|x)−q0(z|x))dz

2MTV(q1,q0)≤ MD.

TV(p,q) := 12E|p(z)−q(z)|is the total variance distance between probability densityp,q. The last inequality uses Pinsker’s inequalityTV(p,q)≤pDKL(p∥q)/2 twice, to get the symmetricD.

Theorem2is a direct corollary of Lemma2and the following.

Lemma 3. DefineϵF =tp(t|x)ϵF,t. We have

ϵf ≤2(G(ϵF+ϵCF)−VY). (4.30) Simply boundϵCF in (4.30) by Lemma2, we have Theorem2. To prove Lemma3, we first examine a bias-variance decomposition ofϵFandϵCF.

ϵCF,t =Eq1t(z|x)gt(z)Ep

Y(t)|pt(y|z)(yft(z))2

≥GEq1t(z|x)EpY(t)|pt(y|z)(yft(z))2

=GEq1t(z|x)Ep

Y(t)|pt(y|z)((yjt(z))2+ (jt(z)− ft(z))2)

(4.31)

The second line uses|gt(z)| ≤ G, and the third line is a bias-variance decomposi- tion. Now we can defineVCF,t(x):=Eq1t(z|x)Ep

Y(t)|pt(y|z)(yjt(z))2andBCF,t(x):= Eq1t(z|x)(jt(z)− ft(z))2, and we have

ϵCF,t≥ G(VCF,t(x) +BCF,t(x)) =⇒ ϵCF ≥G(VCF(x) +BCF(x)) (4.32) whereVCF :=tp(1−t|x)VCF,t=tEq(z,1t|x)Ep

Y(t)|pt(y|z)(yjt(z))2and similarly BCF = tEq(z,1t|x)(jt(z)− ft(z))2. Repeat the above derivation forϵF, we have

ϵF ≥G(VF(x) +BF(x)) (4.33) whereVF = tEq(z,t|x)Ep

Y(t)|pt(y|z)(yjt(z))2 andBF = tEq(z,t|x)(jt(z)− ft(z))2. Now, we are ready to prove Lemma3.

Proof of Lemma3.

ϵf =Eq(z|x)((f1f0)−(j1j0))2

=Eq((f1j1) + (j0f0))2

2Eq((f1j1)2+ (j0f0)2)

=2 Z

[(f1j1)2q(z, 1|x) + (j0f0)2q(z, 0|x)+

(f1j1)2q(z, 0|x) + (j0f0)2q(z, 1|x)]dz

=2(BF+BCF)≤2(G(ϵF+ϵCF)−VY)

The first inequality uses(a+b)2 ≤2(a2+b2). The next equality splitsq(z|x)into q(z, 0|x)andq(z, 1|x)and rearranges to getBFandBCF. The last inequality uses the two bias-variance decompositions, andVY=VF+VCF.