Reversible networks (even when trained) for example have the same partition induced even if you keep stacking more layers, so from the perspective of information theory, everything looks the same
I don’t think this is true? The differential entropy changes, even if you use a reversible map:
H(Y)=H(X)+EX[log|detJ|]
where J is the Jacobian of your map. Features that are “squeezed together” are less usable, and you end up with a smaller entropy. Similarly, “unsqueezing” certain features, or examining them more closely, gives a higher entropy.
Ah you’re right. I was thinking about the deterministic case.
Your explanation of the jacobian term accounting for features “squeezing together” makes me update towards thinking maybe the quantizing done to turn neural networks from continuous & deterministic to discrete & stochastic, while ad hoc, isn’t as unreasonable as I originally thought it was. This paper is where I got the idea that discretization is bad because it “conflates ‘information theoretic stuff’ with ‘geometric stuff’, like clustering”—but perhaps this is in fact capturing something real.
I don’t think this is true? The differential entropy changes, even if you use a reversible map:
H(Y)=H(X)+EX[log|detJ|]where J is the Jacobian of your map. Features that are “squeezed together” are less usable, and you end up with a smaller entropy. Similarly, “unsqueezing” certain features, or examining them more closely, gives a higher entropy.
Ah you’re right. I was thinking about the deterministic case.
Your explanation of the jacobian term accounting for features “squeezing together” makes me update towards thinking maybe the quantizing done to turn neural networks from continuous & deterministic to discrete & stochastic, while ad hoc, isn’t as unreasonable as I originally thought it was. This paper is where I got the idea that discretization is bad because it “conflates ‘information theoretic stuff’ with ‘geometric stuff’, like clustering”—but perhaps this is in fact capturing something real.