Will leave high-level thoughts in a separate comment, here are just issues with the mathematical claims.
Proposition 1 seems false to me as stated:
For any given pair of tokens sa and sb, the probability (as induced by any non-degenerate transition rule) of any given token bridge (sa,s1,...,sB,sb) of length B occurring decreases monotonically as B increases,
Counterexample: the sequence (1, 2, 3, 4, 5, 6, 7, 10) has lower probability than (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) under most reasonable inference systems (including actual language models, I’d guess, especially if we increase the length of the pattern). The mistake in the proof sketch is that I think you just ignore the p(sb|s1,…,sB) factor? This claim seems potentially fixable “in the limit”, e.g. a statement like: for any fixed sa,sb, there is a natural number N such that the probability is monotonically decreasing in B for B≥N. I’m not convinced even this weaker result is true though, because of the next problem:
limB→∞P[(sa,s1,…,sB,sb)]=0.
This is not the same as saying the probabilities decrease monotonically: the latter would also allow that they converge to some value greater than zero. And this stronger claim that the limit is zero is false in general: an infinite product where each of the factors is strictly between zero and one can still converge to a non-zero value if the factors approach one “quickly enough”. If the “non-degenerate” assumption was meant to rule this out, you should specify what that assumption means precisely (I just read it as “no probabilities are exactly one”).
I’m also skeptical of proposition 2, though I might be misunderstanding that one. For large B, I’d often expect P(sb|sa,B) to be approximately the monogram probability of sb in the context of language models. E.g. if sa is “This” and B is 1000, then I think we have basically no information about sb. So the LHS of your equation should asymptotically just be 1B, right? So then proposition 2 would imply that the maximum probability of a sequence of length B decays only as 1B if I understand what you’re claiming. I’d be a bit surprised if this was true for language models, thought not entirely sure. If it is true, I think it would have to be because the most likely sequence is something silly like “this this this …”, which highlights that maybe the proposition is “asking the wrong question”? In any case, the 1B decay clearly isn’t true for more general “well-behaved” transition rules, you need pretty strong assumptions! Rounding those off to “sufficiently well-behaved” seems misleading.
Hi Erik! Thank you for the careful read, this is awesome!
Regarding proposition 1 - I think you’re right, that counter-example disproves the proposition. The proposition we were actually going for was limB→∞P[(sa,s1,…,sB)]=0. , i.e. the probability without the end of the bridge! I’ll fix this in the post.
Regarding proposition II—janus had the same intuition and I tried to explain it with the following argument: When the distance between tokens becomes large enough, then eventually all bridges between the first token and an arbitrary second token end up with approximately the same “cost”. At that point, only the prior likelihood of the token will decide which token gets sampled. So Proposition II implies something like P(sb)∼exp[−(B+1)maxP(sa,s1,…,sB,sb)], or that in the limit “the probability of the most likely sequence ending in sb will be (when appropriately normalized) proportional to the probability of sb”, which seems sensible? (assuming something like ergodicity). Although I’m now becoming a bit suspicious about the sign of the exponent, perhaps there is a “log” or a minus missing on the RHS… I’ll think about that a bit more.
The proposition we were actually going for was limB→∞P[(sa,s1,…,sB)]=0., i.e. the probability without the end of the bridge!
In that case, I agree the monotonically decreasing version of the statement is correct. I think the limit still isn’t necessarily zero, for the reasons I mention in my original comment. (Though I do agree it will be zero under somewhat reasonable assumptions, and in particular for LMs)
So Proposition II implies something like P(sb)∼exp[−(B+1)maxP(sa,s1,…,sB,sb)], or that in the limit “the probability of the most likely sequence ending in sb will be (when appropriately normalized) proportional to the probability of sb”, which seems sensible?
One crux here is the “appropriately normalized”: why should the normalization be linear, i.e. just B + 1? I buy that there are some important systems where this holds, and maybe it even holds for LMs, but it certainly won’t be true in general (e.g. sometimes you need exponential normalization). Even modulo that issue, the claim still isn’t obvious to me, but that may be a good point to start (i.e. an explanation of where the normalization factor comes from would plausibly also clear up my remaining skepticism).
Will leave high-level thoughts in a separate comment, here are just issues with the mathematical claims.
Proposition 1 seems false to me as stated:
Counterexample: the sequence (1, 2, 3, 4, 5, 6, 7, 10) has lower probability than (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) under most reasonable inference systems (including actual language models, I’d guess, especially if we increase the length of the pattern). The mistake in the proof sketch is that I think you just ignore the p(sb|s1,…,sB) factor? This claim seems potentially fixable “in the limit”, e.g. a statement like: for any fixed sa,sb, there is a natural number N such that the probability is monotonically decreasing in B for B≥N. I’m not convinced even this weaker result is true though, because of the next problem:
This is not the same as saying the probabilities decrease monotonically: the latter would also allow that they converge to some value greater than zero. And this stronger claim that the limit is zero is false in general: an infinite product where each of the factors is strictly between zero and one can still converge to a non-zero value if the factors approach one “quickly enough”. If the “non-degenerate” assumption was meant to rule this out, you should specify what that assumption means precisely (I just read it as “no probabilities are exactly one”).
I’m also skeptical of proposition 2, though I might be misunderstanding that one. For large B, I’d often expect P(sb|sa,B) to be approximately the monogram probability of sb in the context of language models. E.g. if sa is “This” and B is 1000, then I think we have basically no information about sb. So the LHS of your equation should asymptotically just be 1B, right? So then proposition 2 would imply that the maximum probability of a sequence of length B decays only as 1B if I understand what you’re claiming. I’d be a bit surprised if this was true for language models, thought not entirely sure. If it is true, I think it would have to be because the most likely sequence is something silly like “this this this …”, which highlights that maybe the proposition is “asking the wrong question”? In any case, the 1B decay clearly isn’t true for more general “well-behaved” transition rules, you need pretty strong assumptions! Rounding those off to “sufficiently well-behaved” seems misleading.
Hi Erik! Thank you for the careful read, this is awesome!
Regarding proposition 1 - I think you’re right, that counter-example disproves the proposition. The proposition we were actually going for was limB→∞P[(sa,s1,…,sB)]=0. , i.e. the probability without the end of the bridge! I’ll fix this in the post.
Regarding proposition II—janus had the same intuition and I tried to explain it with the following argument: When the distance between tokens becomes large enough, then eventually all bridges between the first token and an arbitrary second token end up with approximately the same “cost”. At that point, only the prior likelihood of the token will decide which token gets sampled. So Proposition II implies something like P(sb)∼exp[−(B+1)maxP(sa,s1,…,sB,sb)], or that in the limit “the probability of the most likely sequence ending in sb will be (when appropriately normalized) proportional to the probability of sb”, which seems sensible? (assuming something like ergodicity). Although I’m now becoming a bit suspicious about the sign of the exponent, perhaps there is a “log” or a minus missing on the RHS… I’ll think about that a bit more.
In that case, I agree the monotonically decreasing version of the statement is correct. I think the limit still isn’t necessarily zero, for the reasons I mention in my original comment. (Though I do agree it will be zero under somewhat reasonable assumptions, and in particular for LMs)
One crux here is the “appropriately normalized”: why should the normalization be linear, i.e. just B + 1? I buy that there are some important systems where this holds, and maybe it even holds for LMs, but it certainly won’t be true in general (e.g. sometimes you need exponential normalization). Even modulo that issue, the claim still isn’t obvious to me, but that may be a good point to start (i.e. an explanation of where the normalization factor comes from would plausibly also clear up my remaining skepticism).