Lucius Bushnaq comments on Why Not Just Train For Interpretability?

Lucius Bushnaq 22 Nov 2025 12:39 UTC
17 points
0
I think just minimising the $L_{0}$ norm of the weights is worth a try. There’s a picture of neural network computation under which this mostly matches their native ontology. It doesn’t match their native ontology under my current picture, which is why I personally didn’t try doing this. But the empirical results here seem maybe^[1] better than I predicted they were going to be last February.

I’d also add that we just have way more compute and way better standard tools for high-dimensional nonlinear optimisation than we used to. It’s somewhat plausible to me that some AI techniques people never got to work at all in the old days could now be made to kind of work a little bit with sufficient effort and sheer brute force, maybe enough to get something on the level of an AlphaGo or GPT-2. Which is all we’d really need to unlock the most crucial advances in interp at the moment.
1. ^
  I haven’t finished digesting the paper yet, so I’m not sure.