This means the likelihood distribution over data generated by Steady is closer to the distribution generated by Switchy than to the distribution generated by Sticky.
Their KL divergences are exactly the same. Suppose Baylee’s observations are x1,…,xn. Let P(x1,…,xn) be the probability if there’s a p chance of switching, and similar for Q. By the chain rule,
Hm, I’m not following your definitions of P and Q. Note that there’s no (that I know of) easy closed-form expression for the likelihoods of various sequences for these chains; I had to calculate them using dynamic programming on the Markov chains.
The relevant effect driving it is that the degree of shiftiness (how far it deviates from 50%-heads rate) builds up over a streak, so although in any given case where Switchy and Sticky deviate (say there’s a streak of 2, and Switchy has a 30% of continuing while Sticky has a 70% chance), they have the same degree of divergence, Switchy makes it less likely that you’ll run into these long streaks of divergences while Sticky makes it extremely likely. Neither Switchy nor Sticky gives a constant rate of switching; it depends on the streak length. (Compare a hypergeometric distribution.)
Take a look at §4 of the paper and the “Limited data (full sequence): asymmetric closeness and convergence” section of the Mathematica Notebook linked from the paper to see how I calculated their KL divergences. Let me know what you think!
Their KL divergences are exactly the same. Suppose Baylee’s observations are x1,…,xn. Let P(x1,…,xn) be the probability if there’s a p chance of switching, and similar for Q. By the chain rule,
DKL(P(x1,…,xn)||Q(x1,…,xn))=DKL(P(x1)||Q(x1))+n−1∑i=1DKL(P(xi+1|xi)||Q(xi+1|xi))=0+(n−1)[plogpq+(1−p)log1−p1−q].
In particular, when either p or q is equal to one half, this divergence is symmetric for the other variable.
Hm, I’m not following your definitions of P and Q. Note that there’s no (that I know of) easy closed-form expression for the likelihoods of various sequences for these chains; I had to calculate them using dynamic programming on the Markov chains.
The relevant effect driving it is that the degree of shiftiness (how far it deviates from 50%-heads rate) builds up over a streak, so although in any given case where Switchy and Sticky deviate (say there’s a streak of 2, and Switchy has a 30% of continuing while Sticky has a 70% chance), they have the same degree of divergence, Switchy makes it less likely that you’ll run into these long streaks of divergences while Sticky makes it extremely likely. Neither Switchy nor Sticky gives a constant rate of switching; it depends on the streak length. (Compare a hypergeometric distribution.)
Take a look at §4 of the paper and the “Limited data (full sequence): asymmetric closeness and convergence” section of the Mathematica Notebook linked from the paper to see how I calculated their KL divergences. Let me know what you think!