Alfred Harwood and I were working through this as part of a Dovetail project and unfortunately I think we’ve found a mistake. The Taylor expansion in Step 2 has the 3rd order term o(δ3)=16[2(√P[X])3](−δ[X])3. This term should disappear as δ[X] goes to zero, but this is only true if √P[X] stays constant. The Γ transformation in Part 1 reduces (most terms of) P[X] and Q[X] at the same rate, so √P[X] decreases at the same rate as δ[X]. So the 2nd order approximation isn’t valid.
For example, we could consider two binary random variables with probability distributions
P(x=0)=zp and P(X=1)=1−zp and Q(X=0)=zq and Q(X=1)=1−zq.
If δ[X]=√P(X)−√Q(X), then δ[X]→0 as z→0.
But consider the third order term for X=0 which is
This is a constant term which does not vanish as z→0.
We found a counterexample to the whole theorem (which is what led to us finding this mistake), which has KL(X2→X1→Λ′)max[KL(X1→X2→Λ),KL(X2→X1→Λ)]>10, and it can be found in this colab. There are some stronger counterexamples at the bottom as well. We used sympy because we were getting occasional floating point errors with numpy.
Sorry to bring bad news! We’re going to keep working on this over the next 7 weeks, so hopefully we’ll find a way to prove a looser bound. Please let us know if you find one before us!
I’m starting by checking that there’s actually a counterexample here. We also found some numerical counterexamples which were qualitatively similar (i.e. approximately-all of the weight was on one outcome), but thought it was just numerical error. Kudos for busting out the sympy and actually checking it.
Looking at the math on that third-order issue… note that the whole expansion is multiplied by P[X]. So even if δ[X]∼√P[X], P[X] itself will still go to zero for small δ, so P[X](δ[X]√P[X])3 will go to zero. So it’s not obviously a fatal flaw, though at the very least some more careful accounting would be needed at that step to make sure everything converges.
We’ve looked at the code and fiddled with the math and are now more convinced of the issue.
The 2nd order approximation holds when Q[X]P[X]≈1 …Which our scaling-down construction does not provide. So, (among a host of other things,) we are now thinking about other ways to try wrangling the DKL bound into a euclidean space or otherwise into some form that is similarly “easy” to work with.
Taking the limit of the ratio of DKLs (using summation rather than max) with c→0 while b:=10c gives 5(r+111)
Setting c very small and ramping up r indeed brakes the bound more and more severely. (Code changes from the collab you provided, below.)
Code changes / additions
Block 1: a,b,c,d,r = sp.symbols(“a b c d r”) variable_substitutions = { # The definitions of these variables a: 0.25, b: 1e-90, c: 1e-91, r: 20000000, }
One thread is to simplify the counterexamples into something more intuitively-understandable, mainly hopes of getting an intuitive sense for whatever phenomenon is going on with the counterexamples. Then we’d build new theory specifically around that phenomenon.
The other thread is to go back to first principles and think about entirely different operationalizations of the things we’re trying to do here, e.g. not using diagram DKL’s as our core tool for approximation. The main hope there is that maybe DKL isn’t really the right error metric for latents, but then we need to figure out a principled story which fully determines some other error metric.
Either way, we’re now >80% that this is a fundamental and fatal flaw for a pretty big chunk of our theory.
We have now started referring to “Jeremy et Al” when discussing the findings at top-of-thread, and find this amusing.
As of this morning, our current thread is an adjustment to the error measure. Thinking it through from first principles, it makes intuitive sense to marginalize out latents inside a DKL, i.e.DKL(P[X]||∑ΛQ[X,Λ]) rather than DKL(P[X,Λ]||Q[X,Λ]) (where Q is typically some factorization of P[X,Λ]). Conceptually, that would mean always grounding out errors in terms of predictions on observables, not on mind-internal latent constructs. We’re now checking whether that new error gives us the properties we want in order to make the error measure useful (and in the process, we’re noticing what properties we want in order for the error measure to be useful, and making those more explicit than we had before).
A conjecture we are working on which we expect to be generally useful beyond possibly rescuing the stoch->det proof that used to rely on the work in this post:
Here is a collab with a quick numerical test that suggests the bound holds (and that n=1, in this case).
(Note: The above as written is just one step of chaining, and ultimately we are hoping to show it holds for arbitrarily many steps, accumulating an associated number of epsilons as error.)
Most general version of the chainability conjecture (for arbitrary graphs) has now been falsified numerically by David, but the version specific to the DAGs we need (i.e. the redundancy conditions, or one redundancy and the mediation condition) still looks good.
Most likely proof structure would use this lemma:
Lemma
Let f1,f2 be nonexpansive maps under distance metric D. (Nonexpansive maps are the non-strict version of contraction maps.)
By the nonexpansive map property, D(x,f1(x))≥D(f2(x),f2(f1(x))). And by the triangle inequality for the distance metric, D(x,f2(f1(x)))≤D(x,f2(x))+D(f2(x),f2(f1(x))). Put those two together, and we get
D(x,f2(f1(x)))≤D(x,f1(x))+D(x,f2(x))
(Note: this is a quick-and-dirty comment so I didn’t draw a nice picture, but this lemma is easiest to understand by drawing the picture with the four points and distances between them.)
I think that lemma basically captures my intuitive mental picture for how the chainability conjecture “should” work, for the classes of DAGs on which it works at all. Each DAG j would correspond to one of the functions fj. where fj takes in a distribution and returns the distribution factored over the DAG j , i.e.
fj(X↦P[X]):=(X↦∏iP[Xi|Xpaj(i)])
In order to apply the lemma to get our desired theorem, we then need to find a distance metric which:
Is a distance metric (in particular, it must satisfy the triangle inequality, unlike DKL)
Makes our DAG functions nonexpansive mappings
Matches DKL(P,fj(P)) AT THE SPECIFIC POINT P (not necessarily anywhere else)
The first two of those are pretty easy to satisfy for the redundancy condition DAGs: those two DAG operators are convex combinations, so good ol’ Euclidean distance on the distributions should work fine. Making it match DKL at P is trickier, still working that out.
After some back and forth last night with an LLM[1], we now have a proof of “chainability” for the redundancy diagrams in particular. (And have some hope that this will be most of what we need to rescue the stochastic->deterministic nat lat proof.)
(Theorem) Chainability of Redunds
Let P be a distribution over X1, X2, and Λ. Define: Q[X,Λ]:=P[X]P[Λ|X1] S[X,Λ]:=P[X]P[Λ|X2]=P[X]∑X1P[X1|X2]P[Λ|X]R[X,Λ]:=P[X]Q[Λ|X2]=P[X]∑X1P[X1|X2]P[Λ|X1]
Where you can think of Q as ‘forcing’ P into factorizing per one redundancy pattern: X2→X1→Λ, S as forcing the other pattern: X1→X2→Λ, and R as forcing one after the other: first X2→X1→Λ, and then X1→X2→Λ.
The theorem states, DKL(P||R)≤DKL(P||Q)+DKL(P||S),
Or in words: The error (in DKL from P) accrued by applying both factorizations to P, is bounded by the the sum of the errors accrued by applying each of the factorizations to P, separately.
Combining steps 1 and 2, DKL(P||Q)≥DKL(S||R)=DKL(P||R)−DKL(P||S)
⟹DKL(P||R)≤DKL(P||Q)+DKL(P||S)
which completes the proof.
Notes: In the second to last line of step 2, the expectation over P[X1|X2] is allowed because there are no free X1’s in the expression. Then, this aggregates into an expectation over S[X,Λ] as S[Λ|X2]=S[Λ|X].
We are hopeful that this, thought different than the invalidated result in the top level post, will be an important step to rescuing the stochastic natural latent ⇒ deterministic natural latent result.
Additional note which might be relevant later: we can also get proof step 1 in a somewhat more general way, which establishes that the function P[X,Λ]↦P[X]P[Λ|Xi] is a nonexpansive map under DKL. We’ll write that proof down later if we need it.
our current thread is an adjustment to the error measure.
We’re not sure that this is necessary. I quite like the current form of the errors. I’ve spent much of the past week searching for counterexamples to the ∃ deterministic latent theorem and I haven’t found anything yet (although it’s partially a manual search). My current approach takes a P(X_1,X_2) distribution, finds a minimal stochastic NL, then finds a minimal deterministic NL. The deterministic error has always been within a factor of 2 of the stochastic error. So currently we’re expecting the theorem can be rescued.
DKL(P[X]||∑ΛQ[X,Λ]) rather than DKL(P[X,Λ]||Q[X,Λ])
That seems like a cool idea for the mediation condition, but Isn’t it trivial for the redundancy conditions?
Would this still give us guarantees on the conditional distribution P(X|Λ)?
E.g. Mediation: DKL(P(X1,X2,Λ)∥P(X1|Λ)P(X2|Λ)P(Λ))=DKL(P(X1,X2|Λ)P(Λ)∥P(X1|Λ)P(X2|Λ)P(Λ))=DKL(P(X1,X2|Λ)∥P(X1|Λ)P(X2|Λ))
is really about the expected error conditional on individual values of Λ, & it seems like there are distributions with high mediation error but low error when the latent is marginalized inside DKL, which could be load-bearing when the agents cast out predictions on observables after updating on Λ
Oh nice, we tried to wrangle that counterexample into a simple expression but didn’t get there. So that rules out a looser bound under these assumptions, that’s good to know.
I was totally convinced it was a numerical error. I spent a full day trying to trace it in my numpy code before I started to reconsider. At that point we’d worked through the proof carefully and felt confident of every step. But we needed to work out what was going on because we wanted empirical support for a tighter bound before we tried to improve the proof.
Do you have sympy code for the example noted at the bottom of the collab that claims a ratio of > 9.77 including the mediation DKL? I tried with the parameters you mention and am getting a ratio of ~3.4 (which is still a violation of previous expectations, tbc.)
By the way, there seems to be an issue where sympy silently drops precision under some circumstances. Definitely a bug. A couple of times it’s caused non-trivial errors in my KLs. It’s pretty rare, but I don’t know any way to completely avoid it. Thinking of switching to a different library.
Alfred Harwood and I were working through this as part of a Dovetail project and unfortunately I think we’ve found a mistake. The Taylor expansion in Step 2 has the 3rd order term o(δ3)=16[2(√P[X])3](−δ[X])3. This term should disappear as δ[X] goes to zero, but this is only true if √P[X] stays constant. The Γ transformation in Part 1 reduces (most terms of) P[X] and Q[X] at the same rate, so √P[X] decreases at the same rate as δ[X]. So the 2nd order approximation isn’t valid.
For example, we could consider two binary random variables with probability distributions
P(x=0)=zp and P(X=1)=1−zp and Q(X=0)=zq and Q(X=1)=1−zq.
If δ[X]=√P(X)−√Q(X), then δ[X]→0 as z→0.
But consider the third order term for X=0 which is
13(√Q(0)−√P(0)√P(0))3=13(√zq−√zp√zp)3=13(√q−√p√p)3
This is a constant term which does not vanish as z→0.
We found a counterexample to the whole theorem (which is what led to us finding this mistake), which has KL(X2→X1→Λ′)max[KL(X1→X2→Λ),KL(X2→X1→Λ)]>10, and it can be found in this colab. There are some stronger counterexamples at the bottom as well. We used sympy because we were getting occasional floating point errors with numpy.
Sorry to bring bad news! We’re going to keep working on this over the next 7 weeks, so hopefully we’ll find a way to prove a looser bound. Please let us know if you find one before us!
I plan to spend today digging into this, and will leave updates under this comment as I check things.
(Update 0)
I’m starting by checking that there’s actually a counterexample here. We also found some numerical counterexamples which were qualitatively similar (i.e. approximately-all of the weight was on one outcome), but thought it was just numerical error. Kudos for busting out the sympy and actually checking it.
Looking at the math on that third-order issue… note that the whole expansion is multiplied by P[X]. So even if δ[X]∼√P[X], P[X] itself will still go to zero for small δ, so P[X](δ[X]√P[X])3 will go to zero. So it’s not obviously a fatal flaw, though at the very least some more careful accounting would be needed at that step to make sure everything converges.
(Update 1)
We’ve looked at the code and fiddled with the math and are now more convinced of the issue.
The 2nd order approximation holds when Q[X]P[X]≈1 …Which our scaling-down construction does not provide. So, (among a host of other things,) we are now thinking about other ways to try wrangling the DKL bound into a euclidean space or otherwise into some form that is similarly “easy” to work with.
(Thanks for finding this!)
(Update 2)
Taking the limit of the ratio of DKLs (using summation rather than max) with c→0 while b:=10c gives 5(r+111)
Setting c very small and ramping up r indeed brakes the bound more and more severely. (Code changes from the collab you provided, below.)
Code changes / additions
Block 1:
a,b,c,d,r = sp.symbols(“a b c d r”)
variable_substitutions = { # The definitions of these variables
a: 0.25,
b: 1e-90,
c: 1e-91,
r: 20000000,
}
Block 2 (later on):
expr = (kl3/(kl1 + kl2)).subs(d, (1-3*c-(r+1)*b-2*a))
print(“KL(X2->X1->L’)/sum[KL(X1->X2->L),KL(X2->X1->L)]=”,(kl3/(kl1 +kl2)).evalf(subs=variable_substitutions))
Block3 (right after Block 2):
expr = (kl3/(kl1 + kl2)).subs(d, (1-3*c-(r+1)*b-2*a)).subs(b, 10*c)
lim = sp.simplify(sp.limit(expr, c, 0))
print(“Limit of KL(X2->X1->L’)/sum[KL(X1->X2->L),KL(X2->X1->L)] as c->0+ =”, lim)
(Update 3)
We’re now pursuing two main threads here.
One thread is to simplify the counterexamples into something more intuitively-understandable, mainly hopes of getting an intuitive sense for whatever phenomenon is going on with the counterexamples. Then we’d build new theory specifically around that phenomenon.
The other thread is to go back to first principles and think about entirely different operationalizations of the things we’re trying to do here, e.g. not using diagram DKL’s as our core tool for approximation. The main hope there is that maybe DKL isn’t really the right error metric for latents, but then we need to figure out a principled story which fully determines some other error metric.
Either way, we’re now >80% that this is a fundamental and fatal flaw for a pretty big chunk of our theory.
(Update 4)
We have now started referring to “Jeremy et Al” when discussing the findings at top-of-thread, and find this amusing.
As of this morning, our current thread is an adjustment to the error measure. Thinking it through from first principles, it makes intuitive sense to marginalize out latents inside a DKL, i.e.DKL(P[X]||∑ΛQ[X,Λ]) rather than DKL(P[X,Λ]||Q[X,Λ]) (where Q is typically some factorization of P[X,Λ]). Conceptually, that would mean always grounding out errors in terms of predictions on observables, not on mind-internal latent constructs. We’re now checking whether that new error gives us the properties we want in order to make the error measure useful (and in the process, we’re noticing what properties we want in order for the error measure to be useful, and making those more explicit than we had before).
(Update 5)
A conjecture we are working on which we expect to be generally useful beyond possibly rescuing the stoch->det proof that used to rely on the work in this post:
Chainability (Conjecture):
∀P[X,Λ]with ϵ1:=DKL(X1→X2→Λ), ϵ2:=DKL(X1←Λ→X2),
Define Q[X,Λ]:=P[X]P[Λ|X2], and Q′[X,Λ]:=Q[X1|Λ]Q[X2|Λ]Q[Λ].
Then, DKL(P||Q)<n(ϵ1+ϵ2)
Here is a collab with a quick numerical test that suggests the bound holds (and that n=1, in this case).
(Note: The above as written is just one step of chaining, and ultimately we are hoping to show it holds for arbitrarily many steps, accumulating an associated number of epsilons as error.)
(Update 6)
Most general version of the chainability conjecture (for arbitrary graphs) has now been falsified numerically by David, but the version specific to the DAGs we need (i.e. the redundancy conditions, or one redundancy and the mediation condition) still looks good.
Most likely proof structure would use this lemma:
Lemma
Let f1,f2 be nonexpansive maps under distance metric D. (Nonexpansive maps are the non-strict version of contraction maps.)
By the nonexpansive map property, D(x,f1(x))≥D(f2(x),f2(f1(x))). And by the triangle inequality for the distance metric, D(x,f2(f1(x)))≤D(x,f2(x))+D(f2(x),f2(f1(x))). Put those two together, and we get
D(x,f2(f1(x)))≤D(x,f1(x))+D(x,f2(x))
(Note: this is a quick-and-dirty comment so I didn’t draw a nice picture, but this lemma is easiest to understand by drawing the picture with the four points and distances between them.)
I think that lemma basically captures my intuitive mental picture for how the chainability conjecture “should” work, for the classes of DAGs on which it works at all. Each DAG j would correspond to one of the functions fj. where fj takes in a distribution and returns the distribution factored over the DAG j , i.e.
fj(X↦P[X]):=(X↦∏iP[Xi|Xpaj(i)])
In order to apply the lemma to get our desired theorem, we then need to find a distance metric which:
Is a distance metric (in particular, it must satisfy the triangle inequality, unlike DKL)
Makes our DAG functions nonexpansive mappings
Matches DKL(P,fj(P)) AT THE SPECIFIC POINT P (not necessarily anywhere else)
The first two of those are pretty easy to satisfy for the redundancy condition DAGs: those two DAG operators are convex combinations, so good ol’ Euclidean distance on the distributions should work fine. Making it match DKL at P is trickier, still working that out.
(Update 7)
After some back and forth last night with an LLM[1], we now have a proof of “chainability” for the redundancy diagrams in particular. (And have some hope that this will be most of what we need to rescue the stochastic->deterministic nat lat proof.)
(Theorem) Chainability of Redunds
Let P be a distribution over X1, X2, and Λ.
Define:
Q[X,Λ]:=P[X]P[Λ|X1]
S[X,Λ]:=P[X]P[Λ|X2]=P[X]∑X1P[X1|X2]P[Λ|X]R[X,Λ]:=P[X]Q[Λ|X2]=P[X]∑X1P[X1|X2]P[Λ|X1]
Where you can think of Q as ‘forcing’ P into factorizing per one redundancy pattern: X2→X1→Λ, S as forcing the other pattern: X1→X2→Λ, and R as forcing one after the other: first X2→X1→Λ, and then X1→X2→Λ.
The theorem states,
DKL(P||R)≤DKL(P||Q)+DKL(P||S),
Or in words: The error (in DKL from P) accrued by applying both factorizations to P, is bounded by the the sum of the errors accrued by applying each of the factorizations to P, separately.
Proof
The proof proceeds in 3 steps.
DKL(P||Q)≥DKL(S||R)
Pf.
Let aX1:=P[X1|X2]P[Λ|X]≥0
Let bX1:=P[X1|X2]P[Λ|X1]≥0
By the log-sum inequality:
∑X1(aX1lnaX1bX1)≥(∑X1aX1)ln(∑X1aX1∑X1bX1)
⟹DKL(P||Q)≥DKL(S||R) as desired.
DKL(P||R)=DKL(P||S)+DKL(S||R)
Pf.
DKL(P||R)−DKL(P||S)
=∑X,ΛP[X,Λ][lnP[Λ|X]−lnR[Λ|X2]−lnP[Λ|X]+lnS[Λ|X2]]
=∑X2P[X2]∑Λ∑X1P[X1|X2]P[Λ|X]lnS[Λ|X2]R[Λ|X2]
=∑X1∑X2P[X1|X2]P[X2]∑ΛS[Λ|X2]lnS[Λ|X2]P[X]R[Λ|X2]P[X]
=DKL(S||R)
Combining steps 1 and 2,
DKL(P||Q)≥DKL(S||R)=DKL(P||R)−DKL(P||S)
⟹DKL(P||R)≤DKL(P||Q)+DKL(P||S)
which completes the proof.
Notes:
In the second to last line of step 2, the expectation over P[X1|X2] is allowed because there are no free X1’s in the expression. Then, this aggregates into an expectation over S[X,Λ] as S[Λ|X2]=S[Λ|X].
We are hopeful that this, thought different than the invalidated result in the top level post, will be an important step to rescuing the stochastic natural latent ⇒ deterministic natural latent result.
A (small) positive update for me on their usefulness to my workflow!
Additional note which might be relevant later: we can also get proof step 1 in a somewhat more general way, which establishes that the function P[X,Λ]↦P[X]P[Λ|Xi] is a nonexpansive map under DKL. We’ll write that proof down later if we need it.
We’re not sure that this is necessary. I quite like the current form of the errors. I’ve spent much of the past week searching for counterexamples to the ∃ deterministic latent theorem and I haven’t found anything yet (although it’s partially a manual search). My current approach takes a P(X_1,X_2) distribution, finds a minimal stochastic NL, then finds a minimal deterministic NL. The deterministic error has always been within a factor of 2 of the stochastic error. So currently we’re expecting the theorem can be rescued.
That seems like a cool idea for the mediation condition, but Isn’t it trivial for the redundancy conditions?
Indeed, that specific form doesn’t work for the redundancy conditions. We’ve been fiddling with it.
Would this still give us guarantees on the conditional distribution P(X|Λ)?
E.g. Mediation: DKL(P(X1,X2,Λ)∥P(X1|Λ)P(X2|Λ)P(Λ))=DKL(P(X1,X2|Λ)P(Λ)∥P(X1|Λ)P(X2|Λ)P(Λ))=DKL(P(X1,X2|Λ)∥P(X1|Λ)P(X2|Λ))
is really about the expected error conditional on individual values of Λ, & it seems like there are distributions with high mediation error but low error when the latent is marginalized inside DKL, which could be load-bearing when the agents cast out predictions on observables after updating on Λ
Oh nice, we tried to wrangle that counterexample into a simple expression but didn’t get there. So that rules out a looser bound under these assumptions, that’s good to know.
I was totally convinced it was a numerical error. I spent a full day trying to trace it in my numpy code before I started to reconsider. At that point we’d worked through the proof carefully and felt confident of every step. But we needed to work out what was going on because we wanted empirical support for a tighter bound before we tried to improve the proof.
Do you have sympy code for the example noted at the bottom of the collab that claims a ratio of > 9.77 including the mediation DKL? I tried with the parameters you mention and am getting a ratio of ~3.4 (which is still a violation of previous expectations, tbc.)
That’ll be the difference between max and sum in the denominator. If you use sum it’s 3.39.
Here’s one we worked out last night, where the ratio goes to infinity.
By the way, there seems to be an issue where sympy silently drops precision under some circumstances. Definitely a bug. A couple of times it’s caused non-trivial errors in my KLs. It’s pretty rare, but I don’t know any way to completely avoid it. Thinking of switching to a different library.