We restate our model identifiability formally.
Lemma 1(Model identifiability). Given model(4.2)under(M1), for T =t, assume (D1’) (Non-degenerated data forλ) there exist 2n+1 pointsx0, ...,x2n ∈ X such that
the 2n-square matrix Lt := [γt,1, ...,γt,2n]is invertible, where γt,k := λt(xk)− λt(x0).
Then, given T = t, the family is identifiable up to an equivalence class. That is, if pθ(y|x,t) = pθ′(y|x,t), we have the relation between parameters: for anyyt in the image of ft,
ft−1(yt) =diag(a)ft′−1(yt) +b=:At(ft′−1(yt)) (4.14) wherediag(a)is an invertible n-diagonal matrix andbis a n-vector, both depend onλtand λ′t.
Note, (D1) in the main text implies (D1’), see Sec. B.2.3 in Khemakhem et al., 2020b. The main part of our model identifiability is essentially the same as that of Theorem 1 in Khemakhem et al.,2020b, but now adapted to include the dependency on t. Here we give an outline of the proof, and the details can be easily filled by referring to Khemakhem et al.,2020b. In the proof, subscriptstare omitted for con- venience.
Proof of Lemma1. Using(M1)i) and ii) , we transformpf,λ(y|x,t) = pf′,λ′(y|x,t)into equality of noiseless distributions, that is,
qf′,λ′(y) =qf,λ(y):= pλ(f−1(y)|x,t)vol(Jf−1(y))IY(y) (4.15)
where pλis the Gaussian density function of the conditional prior defined in (4.2) andvol(A):= √
detAAT.qf′,λ′is defined similarly toqf,λ.
Then, apply model (4.2) to (4.15), plug the 2n+1 points from (D1’)into it, and re-arrange the resulting 2n+1 equations in matrix form, we have
F′(Y) =F(Y):= LTt(f−1(Y))−β (4.16) wheret(Z) := (Z,Z2)T is the sufficient statistics of factorized Gaussian, and βt := (αt(x1)−αt(x0), ...,αt(x2n)−αt(x0))Twhereαt(X;λt)is the log-partition function of the conditional prior in (4.2).F′ is defined similarly toF, but with f′,λ′,α′
SinceLis invertible, we have
t(f−1(Y)) = At(f′−1(Y)) +c (4.17) whereA= L−TL′T andc= L−T(β−β′).
The final part of the proof is to show, by following the same reasoning as in Appendix B of Sorrenson, Rother, and Köthe,2019, that Ais a sparse matrix such that
A=
diag(a) O diag(u) diag(a2)
(4.18)
whereAis partitioned into fourn-square matrices. Thus
f−1(Y) =diag(a)f′−1(Y) +b (4.19)
wherebis the first half ofc.
Proof of Proposition5. Under(G2), and(M3), we have
Epθ(Y|X,T) =E(Y|X,T) =⇒ ft◦h(x) = jt◦p(x)on(x,t)such thatp(t,x)>0.
(4.20) We show the solution set of (4.20) onoverlappingxis
{(f,h)|ft= jt◦∆−1,h=∆◦p,∆:P →Rnis injective}. (4.21)
By(G2)(M1), and with injective ft,jtand dim(Z) =dim(Y)≥dim(p), for any∆ above, there exists a functional parameter ft such thatjt = ft◦∆. Thus, set (4.21) is non-empty, and any element is indeed a solution because ft◦h= jt◦∆−1◦∆◦p= jt◦p.
Any solution of (4.20) should be in (4.21). A solution should satisfyh(x) = ft−1◦ jt◦p(x)for botht sincex is overlapping. This means theinjectivefunction ft−1◦jt shouldnotdepend ont, thus it is one of the∆in (4.21).
We proved conclusion 1) withv := ∆. And, on overlappingx, conclusion 2) is quickly seen from
ˆ
µt(x) = ft(h(x)) = jt◦v−1(v◦p(x)) =jt(p(x)) =µt(x). (4.22) We rely on overlappingpto work for non-overlappingx. For anyxtwith p(1− t|xt) = 0, to ensure p(1−t|p(xt)) > 0, there should existx1−t such thatp(x1−t) = p(xt)and p(1−t|x1−t)> 0. And we also haveh(x1−t) = h(xt)due to(M2). Then, we have
ˆ
µ1−t(xt) = f1−t(h(xt)) = f1−t(h(x1−t)) =j1−t(p(x1−t)) =j1−t(p(xt)) =µ1−t(xt). (4.23) The third equality uses (4.20) on(x1−t, 1−t).
Below we prove Theorem1with(D2)replaced by
(D2’) (Spontaneous balance) there exist2n+1pointsx0, ...,x2n ∈ X,2n-square matrixC, and2n-vectord, such that L−01L1 = Candβ0−C−Tβ1 = d/k for optimalλt (see below), where Lt is defined in(D1’), βt := (αt(x1)−αt(x0), ...,αt(x2n)−αt(x0))T, andαt(X;λt)is the log-partition function of the prior in(4.2).
(D2’)restricts the discrepancy betweenλ0,λ1on 2n+1 values ofX, thus is relatively easy to satisfy with high-dimensionalX. (D2’)is general despite (or thanks to) the involved formulation. Let us see its generality even under a highly special case:
C = cIandd= 0. Then, L−01L1 =cI requires that,h1(xk)−ch0(xk)is the same for 2n+1 pointsxk. This is easily satisfied except forn≫ mwheremis the dimension ofX, whichrarelyhappens in practice. And,β0−C−Tβ1 =dbecomes justβ1 =cβ0.
This is equivalent toα1(xk)−cα0(xk)same for 2n+1 points, again fine in practice.
However, the high generality comes with price. Verifying(D2’) using data is chal- lenging, particularly with high-dimensional covariate and latent variable. Although we believe fast algorithms for this purpose could be developed, the effort would be nontrivial. This is another motivation to use the extreme caseλ0 = λ1in Sec.4.3.1, which corresponds toC= I andd=0.
Proof of Theorem1. By(M1)and(G1’), for any injective function∆ : P → Rn, there exists a functional parameter ft∗ such thatjt = ft∗◦∆. Leth∗t = ∆◦pt, then, clearly from(M3’), such parametersθ∗ = (f∗,h∗)are optimal: pθ∗(y|x,t) = p(y|x,t).
Since have all assumptions for Lemma1, we have
∆◦j−1(y) = f∗−1(y) =A ◦ f−1(y)|t, on(y,t)∈ {(jt◦pt(x),t)|p(t,x)>0}, (4.24)
where f isanyoptimal parameter, and “|t” collects all subscriptst. Note, except for
∆, all the symbols should have subscriptt.
Nevertheless, using(D2’), we can further proveA0= A1.
We repeat the core quantities from Lemma 1 here: At = L−t TL′tT and ct = L−t T(βt−β′t).
From(D2’), we immediately have
L0−1L1 =L0′−1L′1=C ⇐⇒ A0 = A1 (4.25)
And also,
L−01L1=C ⇐⇒ L−0TC−T = L−1T
β0−C−Tβ1= β′0−C−Tβ′1 =d/k ⇐⇒ CT(β0−β′0) =β1−β′1
(4.26)
Multiply right hand sides of the two lines, we havec0 = c1. Now we haveA0 = A1:= A. Apply this to (4.24), we have
ft = jt◦v−1, v:= A−1◦∆ (4.27)
foranyoptimal parametersθ= (f,h). Again, from(M3’), we have
pθ(y|x,t) =p(y|x,t) =⇒ pϵ(y− ft(ht(x))) = pe(y−jt(pt(x))) (4.28) where pϵ = pe. And the above is only possible when ft◦ht = jt◦pt. Combined with ft = jt◦v−1, we have conclusion 1).
And conclusion 2) follows from the same reasoning as Proposition5, applied to bothp0andp1.
Note, when multiplying the two lines of (4.26), the effects ofk → 0 cancel out, andctis finite and well-defined. Also, it is apparent from above proof that(D2’)is a necessary and sufficient condition forA0=A1, if other conditions of Theorem1are given.
Below, we prove the results in Sec.4.3.2. The definitions and results work for the prior; simplyreplace qt(x|x)with pt(z|x):= pλ(z|x,t)in definitions and statements, and the proofs below hold as the same. The dependence on f prevail, and the superscripts are omitted. The argumentsxare sometimes also omitted.
Lemma 2(Counterfactual risk bound). Assume|Lf(z,t)| ≤M, we have
ϵCF(x)≤ ∑tq(1−t|x)ϵF,t(x) +MD(x) (4.29) whereϵCF(x):=∑tp(1−t|x)ϵCF,t(x), andD(x):=∑tpDKL(qt∥q1−t)/2.
Proof of Lemma2.
ϵCF−
∑
t
p(1−t|x)ϵF,t
= p(0|x)(ϵCF,1−ϵF,1) +p(1|x)(ϵCF,0−ϵF,0)
= p(0|x)
Z
Lf(z, 1)(q0(z|x)−q1(z|x))dz+p(1|x)
Z
Lf(z, 0)(q1(z|x)−q0(z|x))dz
≤2MTV(q1,q0)≤ MD.
TV(p,q) := 12E|p(z)−q(z)|is the total variance distance between probability densityp,q. The last inequality uses Pinsker’s inequalityTV(p,q)≤pDKL(p∥q)/2 twice, to get the symmetricD.
Theorem2is a direct corollary of Lemma2and the following.
Lemma 3. DefineϵF =∑tp(t|x)ϵF,t. We have
ϵf ≤2(G(ϵF+ϵCF)−VY). (4.30) Simply boundϵCF in (4.30) by Lemma2, we have Theorem2. To prove Lemma3, we first examine a bias-variance decomposition ofϵFandϵCF.
ϵCF,t =Eq1−t(z|x)gt(z)Ep
Y(t)|pt(y|z)(y− ft(z))2
≥GEq1−t(z|x)EpY(t)|pt(y|z)(y− ft(z))2
=GEq1−t(z|x)Ep
Y(t)|pt(y|z)((y−jt(z))2+ (jt(z)− ft(z))2)
(4.31)
The second line uses|gt(z)| ≤ G, and the third line is a bias-variance decomposi- tion. Now we can defineVCF,t(x):=Eq1−t(z|x)Ep
Y(t)|pt(y|z)(y−jt(z))2andBCF,t(x):= Eq1−t(z|x)(jt(z)− ft(z))2, and we have
ϵCF,t≥ G(VCF,t(x) +BCF,t(x)) =⇒ ϵCF ≥G(VCF(x) +BCF(x)) (4.32) whereVCF :=∑tp(1−t|x)VCF,t=∑tEq(z,1−t|x)Ep
Y(t)|pt(y|z)(y−jt(z))2and similarly BCF = ∑tEq(z,1−t|x)(jt(z)− ft(z))2. Repeat the above derivation forϵF, we have
ϵF ≥G(VF(x) +BF(x)) (4.33) whereVF = ∑tEq(z,t|x)Ep
Y(t)|pt(y|z)(y−jt(z))2 andBF = ∑tEq(z,t|x)(jt(z)− ft(z))2. Now, we are ready to prove Lemma3.
Proof of Lemma3.
ϵf =Eq(z|x)((f1− f0)−(j1−j0))2
=Eq((f1−j1) + (j0− f0))2
≤2Eq((f1−j1)2+ (j0− f0)2)
=2 Z
[(f1−j1)2q(z, 1|x) + (j0− f0)2q(z, 0|x)+
(f1−j1)2q(z, 0|x) + (j0− f0)2q(z, 1|x)]dz
=2(BF+BCF)≤2(G(ϵF+ϵCF)−VY)
The first inequality uses(a+b)2 ≤2(a2+b2). The next equality splitsq(z|x)into q(z, 0|x)andq(z, 1|x)and rearranges to getBFandBCF. The last inequality uses the two bias-variance decompositions, andVY=VF+VCF.