programjames comments on What’s the Right Way to think about Information Theoretic quantities in Neural Networks?

programjames 20 Jan 2025 18:42 UTC
4 points
0
Reversible networks (even when trained) for example have the same partition induced even if you keep stacking more layers, so from the perspective of information theory, everything looks the same
I don’t think this is true? The differential entropy changes, even if you use a reversible map:
$H (Y) = H (X) + E_{X} [log | det J |]$
where $J$ is the Jacobian of your map. Features that are “squeezed together” are less usable, and you end up with a smaller entropy. Similarly, “unsqueezing” certain features, or examining them more closely, gives a higher entropy.
- Dalcy 22 Jan 2025 5:19 UTC
  2 points
  0
  Parent
  Ah you’re right. I was thinking about the deterministic case.
  Your explanation of the jacobian term accounting for features “squeezing together” makes me update towards thinking maybe the quantizing done to turn neural networks from continuous & deterministic to discrete & stochastic, while ad hoc, isn’t as unreasonable as I originally thought it was. This paper is where I got the idea that discretization is bad because it “conflates ‘information theoretic stuff’ with ‘geometric stuff’, like clustering”—but perhaps this is in fact capturing something real.
  What links here?
  - What’s the Right Way to think about Information Theoretic quantities in Neural Networks? by Dalcy (19 Jan 2025 8:04 UTC; 45 points)