Is the following a fair summary of the thread ~up to “Natural degradation” from the SLT persepctive?
Current SLT-inspired approaches are right to consider samples of the “tempered local Bayesian posterior” provided by SGLD as natural degradations of the model.
However they mostly only use those samples (at a fixed Watanabe temperature) to compute the expectation of the loss and the resulting LLC, because that is theoretically grounded by Watanabe’s work.
You suggest instead to compute, using those sampled weights, the expectations of more complicated observables derived from other interpretability methods, and to interpret those expectations using the “natural scale” heuristics laid out in the post.
Thanks for asking! I said in a later shortform that I was trying to do too many things in this post, with only vague relationships between them, and I’m planning to split it into pieces in the future.
Your 1-3 are mostly correct. I’d comment as follows:
(and also kind of 3) That advice of using the tempered local Bayesian posterior (I like the term—let’s shorten it to TLBP) is mostly aimed at non-SLT researchers (but may apply also to some SLT experiments). The suggestion is simpler than to compute expectations. Rather, it’s just to run a single experiment at a weight sampled from the TLBP. This is analogous to tuning a precision dial on your NN to noise away all circuits for which the quotient (usefulness)/(description length) is bounded above by 1/t (where usefulness is measured in reduction of loss). At t = 0, you’re adding no noise and at t=∞ you’re fully noising it.
This is interesting to do in interp experiments for two general reasons:
You can see whether the behavior your experiment finds is general or spurious. The higher the temperature range it persists over, the more general it is in the sense of usefulness/description length (and all else being equal, the more important your result is).
If you are hoping to say that a behavior you found, e.g. a circuit, is “natural from the circuit’s point of view” (i.e., plausibly occurs in some kind of optimal weight- or activation-level description of your model), you need to make sure your experiment isn’t just putting together bits of other circuits in an ad-hoc way and calling it a circuit. One way to see this, that works 0% of the time, is to notice that turning this circuit on or off affects the output on exactly the context/ structure you care about, and has absolutely no effect at all on performance elsewhere. This never works because our interp isn’t at a level where we can perform uber-precise targeted interventions, and whenever we do something to a network in an experiment, this always significantly affects loss on unrelated inputs. By having a tunable precision parameter (as given by the TLBP for example), you have more freedom to find such “clean” effects that only do what you want and don’t affect loss otherwise. In general, in an imprecise sense, you expect each “true” circuit to have some “temperature of entanglement” with the rest of the model, and if this circuit is important enough to survive tempering to this temperature of entanglement, you expect to see much cleaner and nicer results in the resulting tempered model.
In the above context, you rarely want to use the Watanabe temperature or any other temperature that only depends on the number of samples n, since it’s much too low in most cases. Instead, you’re either looking for a characteristic temperature associated with an experiment or circuit (which in general will not depend on n much), or fishing for behaviors that you hope are “significantly general”. Here the characteristic temperature associated with the level of generality that “is not literally memorizing” is the Watanabe temperature or very similar, but it is probably more interesting to consider larger scales.
(maybe more related to your question 1): Above, I explained why I think performing experiments at TLBP weight values is useful for “general interp”. I also explain that you sometimes have a natural “characteristic temperature” for the TLBP that is independent of sample number (e.g. meaningful at infinite samples), which is the difference between the loss of the network you’re studying and a SOTA NN, which you think of as that “true optimal loss”. In large-sample (highly underparameterized) cases, this is probably a better characteristic temperature than the Watanabe temperature, including for notions of effective parameter count: indeed, insofar as your NN is “an imperfect approximation of an optimal NN”, the noise inherent in this imperfection is on this scale (and not the Watanabe scale). Of course there are issues with this PoV as less expressive NN’s are rarely well-conceptualized as TLBP samples (insofar as they find a subset of a “perfect NN’s circuits”, they find the easily learnable ones rather than the maximally general ones). However it’s still reasonable to think of this as a first stab at the inherent noise scale associated to an underparametrized model, and to think of the effective parameter count at this scale (i.e., free energy / log temperature) as a better approximatin of some “inherent” parameter count.
I mentioned samples and expectations for the TLBP because it seems possible (and suggested by the role of degeneracies in SLT) that different samples can correspond to qualitatively different degradations of the model. Cartoon picture : besides the robust circuit X of interest, there are “fragile” circuits A and B, and most samples at a given loss scale degrade either A or B but not both.
I agree that there is no strong reason to overindex on the Watanabe temperature, which is derived from an idealised situation: global Bayesian inference, degeneracies exactly at the optimal parameters, “relatively finite variance”, etc. The scale you propose seems quite natural but I will let LLC-practitioners comment on that.
Is the following a fair summary of the thread ~up to “Natural degradation” from the SLT persepctive?
Current SLT-inspired approaches are right to consider samples of the “tempered local Bayesian posterior” provided by SGLD as natural degradations of the model.
However they mostly only use those samples (at a fixed Watanabe temperature) to compute the expectation of the loss and the resulting LLC, because that is theoretically grounded by Watanabe’s work.
You suggest instead to compute, using those sampled weights, the expectations of more complicated observables derived from other interpretability methods, and to interpret those expectations using the “natural scale” heuristics laid out in the post.
Thanks for asking! I said in a later shortform that I was trying to do too many things in this post, with only vague relationships between them, and I’m planning to split it into pieces in the future.
Your 1-3 are mostly correct. I’d comment as follows:
(and also kind of 3) That advice of using the tempered local Bayesian posterior (I like the term—let’s shorten it to TLBP) is mostly aimed at non-SLT researchers (but may apply also to some SLT experiments). The suggestion is simpler than to compute expectations. Rather, it’s just to run a single experiment at a weight sampled from the TLBP. This is analogous to tuning a precision dial on your NN to noise away all circuits for which the quotient (usefulness)/(description length) is bounded above by 1/t (where usefulness is measured in reduction of loss). At t = 0, you’re adding no noise and at t=∞ you’re fully noising it.
This is interesting to do in interp experiments for two general reasons:
You can see whether the behavior your experiment finds is general or spurious. The higher the temperature range it persists over, the more general it is in the sense of usefulness/description length (and all else being equal, the more important your result is).
If you are hoping to say that a behavior you found, e.g. a circuit, is “natural from the circuit’s point of view” (i.e., plausibly occurs in some kind of optimal weight- or activation-level description of your model), you need to make sure your experiment isn’t just putting together bits of other circuits in an ad-hoc way and calling it a circuit. One way to see this, that works 0% of the time, is to notice that turning this circuit on or off affects the output on exactly the context/ structure you care about, and has absolutely no effect at all on performance elsewhere. This never works because our interp isn’t at a level where we can perform uber-precise targeted interventions, and whenever we do something to a network in an experiment, this always significantly affects loss on unrelated inputs. By having a tunable precision parameter (as given by the TLBP for example), you have more freedom to find such “clean” effects that only do what you want and don’t affect loss otherwise. In general, in an imprecise sense, you expect each “true” circuit to have some “temperature of entanglement” with the rest of the model, and if this circuit is important enough to survive tempering to this temperature of entanglement, you expect to see much cleaner and nicer results in the resulting tempered model.
In the above context, you rarely want to use the Watanabe temperature or any other temperature that only depends on the number of samples n, since it’s much too low in most cases. Instead, you’re either looking for a characteristic temperature associated with an experiment or circuit (which in general will not depend on n much), or fishing for behaviors that you hope are “significantly general”. Here the characteristic temperature associated with the level of generality that “is not literally memorizing” is the Watanabe temperature or very similar, but it is probably more interesting to consider larger scales.
(maybe more related to your question 1): Above, I explained why I think performing experiments at TLBP weight values is useful for “general interp”. I also explain that you sometimes have a natural “characteristic temperature” for the TLBP that is independent of sample number (e.g. meaningful at infinite samples), which is the difference between the loss of the network you’re studying and a SOTA NN, which you think of as that “true optimal loss”. In large-sample (highly underparameterized) cases, this is probably a better characteristic temperature than the Watanabe temperature, including for notions of effective parameter count: indeed, insofar as your NN is “an imperfect approximation of an optimal NN”, the noise inherent in this imperfection is on this scale (and not the Watanabe scale). Of course there are issues with this PoV as less expressive NN’s are rarely well-conceptualized as TLBP samples (insofar as they find a subset of a “perfect NN’s circuits”, they find the easily learnable ones rather than the maximally general ones). However it’s still reasonable to think of this as a first stab at the inherent noise scale associated to an underparametrized model, and to think of the effective parameter count at this scale (i.e., free energy / log temperature) as a better approximatin of some “inherent” parameter count.
I mentioned samples and expectations for the TLBP because it seems possible (and suggested by the role of degeneracies in SLT) that different samples can correspond to qualitatively different degradations of the model. Cartoon picture : besides the robust circuit X of interest, there are “fragile” circuits A and B, and most samples at a given loss scale degrade either A or B but not both.
I agree that there is no strong reason to overindex on the Watanabe temperature, which is derived from an idealised situation: global Bayesian inference, degeneracies exactly at the optimal parameters, “relatively finite variance”, etc. The scale you propose seems quite natural but I will let LLC-practitioners comment on that.