RogerDearnaley comments on Compact Proofs of Model Performance via Mechanistic Interpretability

RogerDearnaley 25 Jun 2024 4:01 UTC
LW: 5 AF: 2
2
AF
Although the residuals for each of the four component matrices (after removing the first two principal components) are both small and seem to be noise, proving that there’s no structure that causes the noise to interact constructively when we multiply the matrices and “blow up” is hard.
Have you tried replacing what you believe is noise with actual random noise, with similar statistical properties, and then testing the performance of the resulting model? You may not be able to prove the original model is safe, but you can produce a model that has had all potential structure that you hypothesize is just noise replaced, where you know the noise hypothesis is true.
- Jason Gross 25 Jun 2024 23:25 UTC
  LW: 12 AF: 6
  3
  AF Parent
  I believe what you describe is effectively Casual Scrubbing. Edit: Note that it is not exactly the same as causal scrubbing, which picks looks at the activations for another input sampled at random.
  
  On our particular model, doing this replacement shows us that the noise bound in our particular model is actually about 4 standard deviations worse than random, probably because the training procedure (sequences chosen uniformly at random) means we care a lot more about large possible maxes than small ones. (See Appendix H.1.2 for some very sparse details.)
  
  On other toy models we’ve looked at (modular addition in particular, writeup forthcoming), we have (very) preliminary evidence suggesting that randomizing the noise has a steep drop-off in bound-tightness (as a function of how compact a proof the noise term comes from) in a very similar fashion to what we see with proofs. There seems to be a pretty narrow band of hypotheses for which the noise is structureless but we can’t prove it. This is supported by a handful of comments about how causal scrubbing indicates that many existing mech interp hypotheses in fact don’t capture enough of the behavior.
  - RogerDearnaley 6 Jul 2024 4:09 UTC
    2 points
    0
    Parent
    That sounds very promising, especially that in some cases you can demonstrate that it really is just noise, and in others it seems more like it’s behavior you don’t yet understand so looks like noise. and replacing it with noise degrades performance — that sounds like a very useful diagnostic.
    - RogerDearnaley 20 Jan 2025 23:22 UTC
      2 points
      0
      Parent
      Another variant would be, rather than replacing what you believe is structureless noise with actual structureless noise as an intervention, to simply always run the model with an additional noise term added to each neuron, or to the residual stream between each layer, or whatever, both during training and inference. (combined with a weight decay or a loss term on activation amplitudes, this soft-limits the information capacity of any specific path through the neural net). This then forces any real mechanisms in the model to operate above this background noise level: so then, once you understand how the background noise level is propagated through the model, it becomes clear that any unexplained noise below that level is in fact structureless, since any structure will be washed out by the injected noise, whereas any unexplained noise level above that, while it could still be structureless, seems more likely to be unexplained structure.
      (Note that this architectural change also gives the model a new non-linearity to use: in the presence of a fixed noise term, changes in activation norm near the noise level have non-linear effects.)
      
      Quantizing model weights during training also has a somewhat similar effect, but is likely harder to analyze, since now the information capacity limit is per weight, not per data path.