Thank you, first three examples make sense and seem like an appropriate use of mutual information. I’d like to ask about the fourth example though, where you take the weights as unknown:
What’s the operational meaning of I(X;Z) under some p(W)? More importantly: to what kind of theoretical questions we could ask are these quantities an answer to? (I’m curious of examples in which such quantities came out as a natural answer to the research questions you were asking in practice.)
I would guess the choice of p(W) (maybe the bayesian posterior, maybe the post-training SGD distribution) and the operational meaning they have will depend on what kind of questions we’re interested in answering, but I don’t yet have a good idea as to in what contexts these quantities come up as answers to natural research questions.
And more generally, do you think shannon information measures (at least in your research experience) basically work for most theoretical purposes in saying interesting stuff about deterministic neural networks, or perhaps do you think we need something else?
Reason being (sorry this is more vibes, I am confused and can’t seem to state something more precise): Neural networks seem like they could (and should perhaps) be analyzed as individual deterministic information processing systems without reference to ensembles, since each individual weights are individual algorithms that on its own does some “information processing” and we want to understand its nature.
Meanwhile shannon information measures seem bad at this, since all they care about is the abstract partition that functions induce on their domain by the preimage map. Reversible networks (even when trained) for example have the same partition induced even if you keep stacking more layers, so from the perspective of information theory, everything looks the same—even though as individual weights without considering the whole ensemble, the networks are changing in its information processing nature in some qualitative way, like representation learning—changing the amount of easily accessible / usable information—hence why I thought something in the line of V-information might be useful.
I would be surprised (but happy!) if such notions could be captured by cleverly using shannon information measures. I think that’s what the papers attempting to put a distribution over weights were doing, using languages like (in my words) ”P(w) is an arbitrary choice and that’s fine, since it is used as means of probing information” or smth, but the justifications feel hand-wavy. I have yet to see a convincing argument for why this works, or more generally how shannon measures could capture these aspects of information processing (like usable information).
Reversible networks (even when trained) for example have the same partition induced even if you keep stacking more layers, so from the perspective of information theory, everything looks the same
I don’t think this is true? The differential entropy changes, even if you use a reversible map:
H(Y)=H(X)+EX[log|detJ|]
where J is the Jacobian of your map. Features that are “squeezed together” are less usable, and you end up with a smaller entropy. Similarly, “unsqueezing” certain features, or examining them more closely, gives a higher entropy.
Ah you’re right. I was thinking about the deterministic case.
Your explanation of the jacobian term accounting for features “squeezing together” makes me update towards thinking maybe the quantizing done to turn neural networks from continuous & deterministic to discrete & stochastic, while ad hoc, isn’t as unreasonable as I originally thought it was. This paper is where I got the idea that discretization is bad because it “conflates ‘information theoretic stuff’ with ‘geometric stuff’, like clustering”—but perhaps this is in fact capturing something real.
Thank you, first three examples make sense and seem like an appropriate use of mutual information. I’d like to ask about the fourth example though, where you take the weights as unknown:
What’s the operational meaning of I(X;Z) under some p(W)? More importantly: to what kind of theoretical questions we could ask are these quantities an answer to? (I’m curious of examples in which such quantities came out as a natural answer to the research questions you were asking in practice.)
I would guess the choice of p(W) (maybe the bayesian posterior, maybe the post-training SGD distribution) and the operational meaning they have will depend on what kind of questions we’re interested in answering, but I don’t yet have a good idea as to in what contexts these quantities come up as answers to natural research questions.
And more generally, do you think shannon information measures (at least in your research experience) basically work for most theoretical purposes in saying interesting stuff about deterministic neural networks, or perhaps do you think we need something else?
Reason being (sorry this is more vibes, I am confused and can’t seem to state something more precise): Neural networks seem like they could (and should perhaps) be analyzed as individual deterministic information processing systems without reference to ensembles, since each individual weights are individual algorithms that on its own does some “information processing” and we want to understand its nature.
Meanwhile shannon information measures seem bad at this, since all they care about is the abstract partition that functions induce on their domain by the preimage map. Reversible networks (even when trained) for example have the same partition induced even if you keep stacking more layers, so from the perspective of information theory, everything looks the same—even though as individual weights without considering the whole ensemble, the networks are changing in its information processing nature in some qualitative way, like representation learning—changing the amount of easily accessible / usable information—hence why I thought something in the line of V-information might be useful.
I would be surprised (but happy!) if such notions could be captured by cleverly using shannon information measures. I think that’s what the papers attempting to put a distribution over weights were doing, using languages like (in my words) ”P(w) is an arbitrary choice and that’s fine, since it is used as means of probing information” or smth, but the justifications feel hand-wavy. I have yet to see a convincing argument for why this works, or more generally how shannon measures could capture these aspects of information processing (like usable information).
I don’t think this is true? The differential entropy changes, even if you use a reversible map:
H(Y)=H(X)+EX[log|detJ|]where J is the Jacobian of your map. Features that are “squeezed together” are less usable, and you end up with a smaller entropy. Similarly, “unsqueezing” certain features, or examining them more closely, gives a higher entropy.
Ah you’re right. I was thinking about the deterministic case.
Your explanation of the jacobian term accounting for features “squeezing together” makes me update towards thinking maybe the quantizing done to turn neural networks from continuous & deterministic to discrete & stochastic, while ad hoc, isn’t as unreasonable as I originally thought it was. This paper is where I got the idea that discretization is bad because it “conflates ‘information theoretic stuff’ with ‘geometric stuff’, like clustering”—but perhaps this is in fact capturing something real.