I just read your koan and wow it’s a great post, thank you for writing it. It also gave me some new insights as to how to think about my confusions and some answers. Here’s my chain of thought:
if I want my information theoretic quantities to not degenerate, then I need some distribution over the weights. What is the natural distribution to consider?
Well, there’s the Bayesian posterior.
But I feel like there is a sense in which an individual neural network with its weight should be considered as a deterministic information processing system on its own, without reference to an ensemble.
Using the Bayesian posterior won’t let me do this:
If I have a fixed neural network that contains a circuit C that takes activation X (at a particular location in the network) to produce activation Y (at a different location), it would make sense to ask questions about the nature of information processing that C does, like I(X;Y).
But intuitively, taking the weight as an unknown averages everything out—even if my original fixed network had a relatively high probability density in the Bayesian posterior, it is unlikely that X and Y would be related by similar circuit mechanisms given another random sample weight from the posterior.
Same with sampling from the post-SGD distribution.
So it would be nice to find a way to interpolate the two. And I think the idea of a tempered local Bayesian posterior from your koan post basically is the right way to do this! (and all of this makes me think papersthat measure mutual information between activations in different layers via introducing a noise distribution over the parameters of f are a lot more reasonable than I originally thought)
If I understand correctly, you want a way of thinking about a reference class of programs that has some specific, perhaps interpretability-relevant or compression-related properties in common with the deterministic program you’re studying?
I think in this case I’d actually say the tempered Bayesian posterior by itself isn’t enough, since even if you work locally in a basin, it might not preserve the specific features you want. In this case I’d probably still start with the tempered Bayesian posterior, but then also condition on the specific properties/explicit features/ etc. that you want to preserve. (I might be misunderstanding your comment though)
I just read your koan and wow it’s a great post, thank you for writing it. It also gave me some new insights as to how to think about my confusions and some answers. Here’s my chain of thought:
if I want my information theoretic quantities to not degenerate, then I need some distribution over the weights. What is the natural distribution to consider?
Well, there’s the Bayesian posterior.
But I feel like there is a sense in which an individual neural network with its weight should be considered as a deterministic information processing system on its own, without reference to an ensemble.
Using the Bayesian posterior won’t let me do this:
If I have a fixed neural network that contains a circuit C that takes activation X (at a particular location in the network) to produce activation Y (at a different location), it would make sense to ask questions about the nature of information processing that C does, like I(X;Y).
But intuitively, taking the weight as an unknown averages everything out—even if my original fixed network had a relatively high probability density in the Bayesian posterior, it is unlikely that X and Y would be related by similar circuit mechanisms given another random sample weight from the posterior.
Same with sampling from the post-SGD distribution.
So it would be nice to find a way to interpolate the two. And I think the idea of a tempered local Bayesian posterior from your koan post basically is the right way to do this! (and all of this makes me think papers that measure mutual information between activations in different layers via introducing a noise distribution over the parameters of f are a lot more reasonable than I originally thought)
If I understand correctly, you want a way of thinking about a reference class of programs that has some specific, perhaps interpretability-relevant or compression-related properties in common with the deterministic program you’re studying?
I think in this case I’d actually say the tempered Bayesian posterior by itself isn’t enough, since even if you work locally in a basin, it might not preserve the specific features you want. In this case I’d probably still start with the tempered Bayesian posterior, but then also condition on the specific properties/explicit features/ etc. that you want to preserve. (I might be misunderstanding your comment though)