Dalcy comments on What’s the Right Way to think about Information Theoretic quantities in Neural Networks?

Dalcy 22 Jan 2025 5:34 UTC
1 point
0
That makes sense. I’ve updated towards thinking this is reasonable (albeit binning and discretization is still ad hoc) and captures something real.
We could formalize it like $I_{σ} (X; f (X))$ where $I_{σ} (X; f (X)) = I (X; f (X) + ϵ_{σ})$ with $ϵ_{σ}$ being some independent noise parameterized by \sigma. Then $I_{σ} (X; f (X))$ would become finite. We could think of binning the output of a layer to make it stochastic in a similar way.
Ideally we’d like the new measure to be finite even for deterministic maps (this is the case for above) and some strict data processing inequality like $I_{σ} (X; g (f (X))) < I_{σ} (f (X); g (f (X)))$ to hold, intuition being that each step of the map adds more noise.
But $I_{σ} (X; f (X))$ is just $h (f (X) + ϵ_{σ})$ up to a constant that depends on the noise statistic, so the above is an equality.
Issue is that the above intuition is based on each application of $f$ and $g$ adding additional noise to the input (just like how discretization lets us do this: each layer further discretizes and bins its input, leading in gradual loss of information hence letting mutual information capture something real in the sense of amount of bits needed to recover information up to certain precision across layers), but I_\sigma just adds an independent noise. So any relaxation if $I (X; f (X))$ will have to depend on the functional structure of $f$ .
With that (+ Dmitry’s comment on precision scale), I think the papers that measure mutual information between activations in different layers with a noise distribution over the parameters of $f$ sound a lot more reasonable than I originally thought.