This seems like an interesting problem! I’ve been thinking about it a little bit but wanted to make sure I understood before diving in too deep. Can I see if I understand this by going through the biased coin example?
Suppose I have 2^5 coins and each one is given a unique 5-bit string label covering all binary strings from 00000 to 11111. Call the string on the label Λ.
The label given to the coin indicates its ‘true’ bias. The string 00000 indicates that the coin with that label has p(heads)=0. The coin labelled 11111 has p(heads)=1. The ‘true’ p(heads) increases in equal steps going up from 00000 to 00001 to 00010 etc. Suppose I randomly pick a coin from this collection, toss it 200 times and call the number of heads X_1. Then I toss it another 200 times and call the number of heads X_2.
Now, if I tell you what the label on the coin was (which tells us the true bias of the coin), telling you X_1 would not give you any more information to help you guess X_2 (and vice versa). This is the first Natural Latent condition (Λ induces independence between X_1 and X_2). Alternatively, if I didn’t tell you the label, you could estimate it from either X_1 or X_2 equally well. This is the other two diagrams.
I think that the full label Λ will be an approximate stochastic natural latent. But if we consider only the first bit[1] of the label (which roughly tells us whether the bias is above or below 50% heads) then this bit will be a deterministic natural latent because with reasonably high certainty, you can guess the first bit of Λ from X_1 or X_2. This is because the conditional entropy H(first bit of Λ|X_1) is low. On the other hand H(Λ | X_1) will be high. If I get only 23 heads out of 200 tosses, I can be reasonably certain that the first bit of Λ is a 0 (ie the coin has a less than 50% of coming up heads) but can’t be as certain what the last bit of Λ is. Just because Λ satisfies the Natural Latent conditions within ϵ, this doesn’t imply that Λ satisfies H(Λ|X1)<ϵ. We can use X_1 to find a 5-bit estimate of Λ, but most of the useful information in that estimate is contained in the first bit. The second bit might be somewhat useful, but its less certain than the first. The last bit of the estimate will largely be noise. This means that going from using Λ to using ‘first bit of Λ’ doesn’t decrease the usefulness of the latent very much, since the stuff we are throwing out is largely random. As a result, the ‘first bit of Λ’ will still satisfy the natural latent conditions almost as well as Λ. By throwing out the later bits, we threw away the most ‘stochastic’ bits, while keeping the most ‘latenty’ bits.
So in this case, we have started from a stochastic natural latent and used it to construct a deterministic natural latent which is almost as good. I haven’t done the calculation, but hopefully we could say something like ‘if Λ satisfies the natural latent conditions within ϵ then the first bit of Λ satisfies the natural latent conditions within 2ϵ (or 3ϵ or something else)’. Would an explicit proof of a statement like this for this case be a special case of the general problem?
The problem question could be framed as something like: “Is there some standard process we can do for every stochastic natural latent, in order to obtain a deterministic natural latent which is almost as good (in terms of \epsilon)”. This process will be analogous to the ‘throwing away the less useful/more random bits of \lambda’ which we did in the example above. Does this sound right?
Also, can all stochastic natural latents can be thought of as ‘more approximate’ deterministic latents? If a latent satisfies the the three natural latents conditions within ϵ1, we can always find a (potentially much bigger) ϵ2 such that this latent also satisfies the deterministic latent condition, right? This is why you need to specify that the problem is showing that a deterministic natural latent exists with ‘almost the same’ ϵ. Does this sound right?
I’m going to talk about the ‘first bit’ but an equivalent argument might also hold for the ‘first two bits’ or something. I haven’t actually checked the maths.
Some details mildly off, but I think you’ve got the big picture basically right.
Alternatively, if I didn’t tell you the label, you could estimate it from either X_1 or X_2 equally well. This is the other two diagrams.
Minor clarification here: the other two diagrams say not only that I can estimate the label equally well from either X1 or X2, but that I can estimate the label (approximately) equally well from X1, X2, or the pair (X1,X2).
I think that the full label Λ will be an approximate stochastic natural latent.
I’d have to run the numbers to check that 200 flips is enough to give a high-confidence estimate of Λ (in which case 400 flips from the pair of variables will also put high confidence on the same value with high probability), but I think yes.
But if we consider only the first bit[1] of the label (which roughly tells us whether the bias is above or below 50% heads) then this bit will be a deterministic natural latent because with reasonably high certainty, you can guess the first bit of Λ from X1 or X2.
Not quite; I added some emphasis. The first bit will (approximately) satisfy the two redundancy conditions, i.e.X1→X2→1bit(Λ) and X2→X1→1bit(Λ), and indeed will be an approximately deterministic function of X. But it won’t (approximately) satisfy the mediation condition X1←1bit(Λ)→X2; the two sets of flips will not be (approximately) independent given only the first bit. (At least not to nearly as good an approximation as the original label.)
That said, the rest of your qualitative reasoning is correct. As we throw out more low-order bits, the mediation condition becomes less well approximated, the redundancy conditions become better approximated, and the entropy of the coarse-grained latent given X falls.
So to build a proof along these lines, one would need to show that a bit-cutoff can be chosen such that bit_cutoff(Λ) still mediates (to an approximation roughly ϵ-ish), while making the entropy of bit_cutoff(Λ) low given X.
I do think this is a good angle of attack on the problem, and it’s one of the main angles I’d try.
If a latent satisfies the the three natural latents conditions within ϵ1, we can always find a (potentially much bigger) ϵ2 such that this latent also satisfies the deterministic latent condition, right? This is why you need to specify that the problem is showing that a deterministic natural latent exists with ‘almost the same’ ϵ. Does this sound right?
Yes. Indeed, if we allow large enough ϵ (possibly scaling with system size/entropy) then there’s always a deterministic natural latent regardless; the whole thing becomes trivial.
I’d have to run the numbers to check that 200 flips is enough to give a high-confidence estimate of Λ
It isn’t enough. See plot. Also, 200 not being enough flips is part of what makes this interesting. With a million flips, this would pretty much just be the exact case. The fact that it’s only 200 flips gives you a tradeoff in how many label_bits to include.
This seems like an interesting problem! I’ve been thinking about it a little bit but wanted to make sure I understood before diving in too deep. Can I see if I understand this by going through the biased coin example?
Suppose I have 2^5 coins and each one is given a unique 5-bit string label covering all binary strings from 00000 to 11111. Call the string on the label Λ.
The label given to the coin indicates its ‘true’ bias. The string 00000 indicates that the coin with that label has p(heads)=0. The coin labelled 11111 has p(heads)=1. The ‘true’ p(heads) increases in equal steps going up from 00000 to 00001 to 00010 etc. Suppose I randomly pick a coin from this collection, toss it 200 times and call the number of heads X_1. Then I toss it another 200 times and call the number of heads X_2.
Now, if I tell you what the label on the coin was (which tells us the true bias of the coin), telling you X_1 would not give you any more information to help you guess X_2 (and vice versa). This is the first Natural Latent condition (Λ induces independence between X_1 and X_2). Alternatively, if I didn’t tell you the label, you could estimate it from either X_1 or X_2 equally well. This is the other two diagrams.
I think that the full label Λ will be an approximate stochastic natural latent. But if we consider only the first bit[1] of the label (which roughly tells us whether the bias is above or below 50% heads) then this bit will be a deterministic natural latent because with reasonably high certainty, you can guess the first bit of Λ from X_1 or X_2. This is because the conditional entropy H(first bit of Λ|X_1) is low. On the other hand H(Λ | X_1) will be high. If I get only 23 heads out of 200 tosses, I can be reasonably certain that the first bit of Λ is a 0 (ie the coin has a less than 50% of coming up heads) but can’t be as certain what the last bit of Λ is. Just because Λ satisfies the Natural Latent conditions within ϵ, this doesn’t imply that Λ satisfies H(Λ|X1)<ϵ. We can use X_1 to find a 5-bit estimate of Λ, but most of the useful information in that estimate is contained in the first bit. The second bit might be somewhat useful, but its less certain than the first. The last bit of the estimate will largely be noise. This means that going from using Λ to using ‘first bit of Λ’ doesn’t decrease the usefulness of the latent very much, since the stuff we are throwing out is largely random. As a result, the ‘first bit of Λ’ will still satisfy the natural latent conditions almost as well as Λ. By throwing out the later bits, we threw away the most ‘stochastic’ bits, while keeping the most ‘latenty’ bits.
So in this case, we have started from a stochastic natural latent and used it to construct a deterministic natural latent which is almost as good. I haven’t done the calculation, but hopefully we could say something like ‘if Λ satisfies the natural latent conditions within ϵ then the first bit of Λ satisfies the natural latent conditions within 2ϵ (or 3ϵ or something else)’. Would an explicit proof of a statement like this for this case be a special case of the general problem?
The problem question could be framed as something like: “Is there some standard process we can do for every stochastic natural latent, in order to obtain a deterministic natural latent which is almost as good (in terms of \epsilon)”. This process will be analogous to the ‘throwing away the less useful/more random bits of \lambda’ which we did in the example above. Does this sound right?
Also, can all stochastic natural latents can be thought of as ‘more approximate’ deterministic latents? If a latent satisfies the the three natural latents conditions within ϵ1, we can always find a (potentially much bigger) ϵ2 such that this latent also satisfies the deterministic latent condition, right? This is why you need to specify that the problem is showing that a deterministic natural latent exists with ‘almost the same’ ϵ. Does this sound right?
I’m going to talk about the ‘first bit’ but an equivalent argument might also hold for the ‘first two bits’ or something. I haven’t actually checked the maths.
Some details mildly off, but I think you’ve got the big picture basically right.
Minor clarification here: the other two diagrams say not only that I can estimate the label equally well from either X1 or X2, but that I can estimate the label (approximately) equally well from X1, X2, or the pair (X1,X2).
I’d have to run the numbers to check that 200 flips is enough to give a high-confidence estimate of Λ (in which case 400 flips from the pair of variables will also put high confidence on the same value with high probability), but I think yes.
Not quite; I added some emphasis. The first bit will (approximately) satisfy the two redundancy conditions, i.e.X1→X2→1bit(Λ) and X2→X1→1bit(Λ), and indeed will be an approximately deterministic function of X. But it won’t (approximately) satisfy the mediation condition X1←1bit(Λ)→X2; the two sets of flips will not be (approximately) independent given only the first bit. (At least not to nearly as good an approximation as the original label.)
That said, the rest of your qualitative reasoning is correct. As we throw out more low-order bits, the mediation condition becomes less well approximated, the redundancy conditions become better approximated, and the entropy of the coarse-grained latent given X falls.
So to build a proof along these lines, one would need to show that a bit-cutoff can be chosen such that bit_cutoff(Λ) still mediates (to an approximation roughly ϵ-ish), while making the entropy of bit_cutoff(Λ) low given X.
I do think this is a good angle of attack on the problem, and it’s one of the main angles I’d try.
Yes. Indeed, if we allow large enough ϵ (possibly scaling with system size/entropy) then there’s always a deterministic natural latent regardless; the whole thing becomes trivial.
It isn’t enough. See plot. Also, 200 not being enough flips is part of what makes this interesting. With a million flips, this would pretty much just be the exact case. The fact that it’s only 200 flips gives you a tradeoff in how many label_bits to include.
Thanks for the clarifications, that all makes sense. I will keep thinking about this!
Here is the probability density function for heads plotted for each of your coins.
python code
import numpy as np
l=np.linspace(0,1,32)
def f(a):
a=np.array([a,1-a])
b=a
for i in range(199):
b=np.convolve(b,a)
return b
q=np.arange(201)
import matplotlib.pyplot as plt
ff=[f(i) for i in l]
plt.plot(ff);plt.show()
for i in ff:
_=plt.plot(q,i)
plt.show()
plt.xlabel(“heads”)
Text(0.5, 0, ‘heads’)
plt.ylabel(“prob”)
Text(0, 0.5, ‘prob’)
for i in ff:
_=plt.plot(q,i)
plt.show()