Most general version of the chainability conjecture (for arbitrary graphs) has now been falsified numerically by David, but the version specific to the DAGs we need (i.e. the redundancy conditions, or one redundancy and the mediation condition) still looks good.
Most likely proof structure would use this lemma:
Lemma
Let f1,f2 be nonexpansive maps under distance metric D. (Nonexpansive maps are the non-strict version of contraction maps.)
By the nonexpansive map property, D(x,f1(x))≥D(f2(x),f2(f1(x))). And by the triangle inequality for the distance metric, D(x,f2(f1(x)))≤D(x,f2(x))+D(f2(x),f2(f1(x))). Put those two together, and we get
D(x,f2(f1(x)))≤D(x,f1(x))+D(x,f2(x))
(Note: this is a quick-and-dirty comment so I didn’t draw a nice picture, but this lemma is easiest to understand by drawing the picture with the four points and distances between them.)
I think that lemma basically captures my intuitive mental picture for how the chainability conjecture “should” work, for the classes of DAGs on which it works at all. Each DAG j would correspond to one of the functions fj. where fj takes in a distribution and returns the distribution factored over the DAG j , i.e.
fj(X↦P[X]):=(X↦∏iP[Xi|Xpaj(i)])
In order to apply the lemma to get our desired theorem, we then need to find a distance metric which:
Is a distance metric (in particular, it must satisfy the triangle inequality, unlike DKL)
Makes our DAG functions nonexpansive mappings
Matches DKL(P,fj(P)) AT THE SPECIFIC POINT P (not necessarily anywhere else)
The first two of those are pretty easy to satisfy for the redundancy condition DAGs: those two DAG operators are convex combinations, so good ol’ Euclidean distance on the distributions should work fine. Making it match DKL at P is trickier, still working that out.
After some back and forth last night with an LLM[1], we now have a proof of “chainability” for the redundancy diagrams in particular. (And have some hope that this will be most of what we need to rescue the stochastic->deterministic nat lat proof.)
(Theorem) Chainability of Redunds
Let P be a distribution over X1, X2, and Λ. Define: Q[X,Λ]:=P[X]P[Λ|X1] S[X,Λ]:=P[X]P[Λ|X2]=P[X]∑X1P[X1|X2]P[Λ|X]R[X,Λ]:=P[X]Q[Λ|X2]=P[X]∑X1P[X1|X2]P[Λ|X1]
Where you can think of Q as ‘forcing’ P into factorizing per one redundancy pattern: X2→X1→Λ, S as forcing the other pattern: X1→X2→Λ, and R as forcing one after the other: first X2→X1→Λ, and then X1→X2→Λ.
The theorem states, DKL(P||R)≤DKL(P||Q)+DKL(P||S),
Or in words: The error (in DKL from P) accrued by applying both factorizations to P, is bounded by the the sum of the errors accrued by applying each of the factorizations to P, separately.
Combining steps 1 and 2, DKL(P||Q)≥DKL(S||R)=DKL(P||R)−DKL(P||S)
⟹DKL(P||R)≤DKL(P||Q)+DKL(P||S)
which completes the proof.
Notes: In the second to last line of step 2, the expectation over P[X1|X2] is allowed because there are no free X1’s in the expression. Then, this aggregates into an expectation over S[X,Λ] as S[Λ|X2]=S[Λ|X].
We are hopeful that this, thought different than the invalidated result in the top level post, will be an important step to rescuing the stochastic natural latent ⇒ deterministic natural latent result.
Additional note which might be relevant later: we can also get proof step 1 in a somewhat more general way, which establishes that the function P[X,Λ]↦P[X]P[Λ|Xi] is a nonexpansive map under DKL. We’ll write that proof down later if we need it.
(Update 6)
Most general version of the chainability conjecture (for arbitrary graphs) has now been falsified numerically by David, but the version specific to the DAGs we need (i.e. the redundancy conditions, or one redundancy and the mediation condition) still looks good.
Most likely proof structure would use this lemma:
Lemma
Let f1,f2 be nonexpansive maps under distance metric D. (Nonexpansive maps are the non-strict version of contraction maps.)
By the nonexpansive map property, D(x,f1(x))≥D(f2(x),f2(f1(x))). And by the triangle inequality for the distance metric, D(x,f2(f1(x)))≤D(x,f2(x))+D(f2(x),f2(f1(x))). Put those two together, and we get
D(x,f2(f1(x)))≤D(x,f1(x))+D(x,f2(x))
(Note: this is a quick-and-dirty comment so I didn’t draw a nice picture, but this lemma is easiest to understand by drawing the picture with the four points and distances between them.)
I think that lemma basically captures my intuitive mental picture for how the chainability conjecture “should” work, for the classes of DAGs on which it works at all. Each DAG j would correspond to one of the functions fj. where fj takes in a distribution and returns the distribution factored over the DAG j , i.e.
fj(X↦P[X]):=(X↦∏iP[Xi|Xpaj(i)])
In order to apply the lemma to get our desired theorem, we then need to find a distance metric which:
Is a distance metric (in particular, it must satisfy the triangle inequality, unlike DKL)
Makes our DAG functions nonexpansive mappings
Matches DKL(P,fj(P)) AT THE SPECIFIC POINT P (not necessarily anywhere else)
The first two of those are pretty easy to satisfy for the redundancy condition DAGs: those two DAG operators are convex combinations, so good ol’ Euclidean distance on the distributions should work fine. Making it match DKL at P is trickier, still working that out.
(Update 7)
After some back and forth last night with an LLM[1], we now have a proof of “chainability” for the redundancy diagrams in particular. (And have some hope that this will be most of what we need to rescue the stochastic->deterministic nat lat proof.)
(Theorem) Chainability of Redunds
Let P be a distribution over X1, X2, and Λ.
Define:
Q[X,Λ]:=P[X]P[Λ|X1]
S[X,Λ]:=P[X]P[Λ|X2]=P[X]∑X1P[X1|X2]P[Λ|X]R[X,Λ]:=P[X]Q[Λ|X2]=P[X]∑X1P[X1|X2]P[Λ|X1]
Where you can think of Q as ‘forcing’ P into factorizing per one redundancy pattern: X2→X1→Λ, S as forcing the other pattern: X1→X2→Λ, and R as forcing one after the other: first X2→X1→Λ, and then X1→X2→Λ.
The theorem states,
DKL(P||R)≤DKL(P||Q)+DKL(P||S),
Or in words: The error (in DKL from P) accrued by applying both factorizations to P, is bounded by the the sum of the errors accrued by applying each of the factorizations to P, separately.
Proof
The proof proceeds in 3 steps.
DKL(P||Q)≥DKL(S||R)
Pf.
Let aX1:=P[X1|X2]P[Λ|X]≥0
Let bX1:=P[X1|X2]P[Λ|X1]≥0
By the log-sum inequality:
∑X1(aX1lnaX1bX1)≥(∑X1aX1)ln(∑X1aX1∑X1bX1)
⟹DKL(P||Q)≥DKL(S||R) as desired.
DKL(P||R)=DKL(P||S)+DKL(S||R)
Pf.
DKL(P||R)−DKL(P||S)
=∑X,ΛP[X,Λ][lnP[Λ|X]−lnR[Λ|X2]−lnP[Λ|X]+lnS[Λ|X2]]
=∑X2P[X2]∑Λ∑X1P[X1|X2]P[Λ|X]lnS[Λ|X2]R[Λ|X2]
=∑X1∑X2P[X1|X2]P[X2]∑ΛS[Λ|X2]lnS[Λ|X2]P[X]R[Λ|X2]P[X]
=DKL(S||R)
Combining steps 1 and 2,
DKL(P||Q)≥DKL(S||R)=DKL(P||R)−DKL(P||S)
⟹DKL(P||R)≤DKL(P||Q)+DKL(P||S)
which completes the proof.
Notes:
In the second to last line of step 2, the expectation over P[X1|X2] is allowed because there are no free X1’s in the expression. Then, this aggregates into an expectation over S[X,Λ] as S[Λ|X2]=S[Λ|X].
We are hopeful that this, thought different than the invalidated result in the top level post, will be an important step to rescuing the stochastic natural latent ⇒ deterministic natural latent result.
A (small) positive update for me on their usefulness to my workflow!
Additional note which might be relevant later: we can also get proof step 1 in a somewhat more general way, which establishes that the function P[X,Λ]↦P[X]P[Λ|Xi] is a nonexpansive map under DKL. We’ll write that proof down later if we need it.