Charlie Steiner answers What’s the Right Way to think about Information Theoretic quantities in Neural Networks?

Charlie Steiner 20 Jan 2025 19:05 UTC
3 points
0
First, I agree with Dmitry.
But it does seem like maybe you could recover a notion of information bottleneck even with out the Bayesian NN model. If you quantize real numbers to N-bit floating point numbers, there’s a very real quantity which is “how many more bits do you need to exactly reconstruct X, given Z?” My suspicion is that for a fixed network, this quantity grows linearly with N (and if it’s zero at ‘actual infinity’ for some network despite being nonzero in the limit, maybe we should ignore actual infinity).
But this isn’t all that useful, it would be nicer to have an information that converges. The divergence seems a bit silly, too, because it seems silly to treat the millionth digit as as important as the first.
So suppose you don’t want to perfectly reconstruct X. Instead, maybe you could say the distribution of X is made of some fixed number of bins or summands, and you want to figure out which one based on Z. Then you get a converging amount of information, and you correctly treat small numbers as less important, but you’ve had to introduce this somewhat arbitrary set of bins. shrug
- Dalcy 22 Jan 2025 5:34 UTC
  1 point
  0
  Parent
  That makes sense. I’ve updated towards thinking this is reasonable (albeit binning and discretization is still ad hoc) and captures something real.
  We could formalize it like $I_{σ} (X; f (X))$ where $I_{σ} (X; f (X)) = I (X; f (X) + ϵ_{σ})$ with $ϵ_{σ}$ being some independent noise parameterized by \sigma. Then $I_{σ} (X; f (X))$ would become finite. We could think of binning the output of a layer to make it stochastic in a similar way.
  Ideally we’d like the new measure to be finite even for deterministic maps (this is the case for above) and some strict data processing inequality like $I_{σ} (X; g (f (X))) < I_{σ} (f (X); g (f (X)))$ to hold, intuition being that each step of the map adds more noise.
  But $I_{σ} (X; f (X))$ is just $h (f (X) + ϵ_{σ})$ up to a constant that depends on the noise statistic, so the above is an equality.
  Issue is that the above intuition is based on each application of $f$ and $g$ adding additional noise to the input (just like how discretization lets us do this: each layer further discretizes and bins its input, leading in gradual loss of information hence letting mutual information capture something real in the sense of amount of bits needed to recover information up to certain precision across layers), but I_\sigma just adds an independent noise. So any relaxation if $I (X; f (X))$ will have to depend on the functional structure of $f$ .
  With that (+ Dmitry’s comment on precision scale), I think the papers that measure mutual information between activations in different layers with a noise distribution over the parameters of $f$ sound a lot more reasonable than I originally thought.