Neel Nanda comments on A Pragmatic Vision for Interpretability

Neel Nanda 2 Dec 2025 17:29 UTC
4 points
0
In my parlance it sounds like you agree with me about the importance of robustly useful settings, you just disagree with my more specific claim that model organisms are often bad robustly useful settings?

I think that it’s easy to simulate being difficult and expensive to evaluate, And current models already show situational awareness. You can try simulating things like scheming with certain prompting experiments though I think that one’s more tenuous.

Notes that I consider model organism to involve fine-tuning and prompted settings are in scope depending on specific details of how contrived they seem

I’m generally down for making a model organism designed to test a very specific thing that we expect to see in future models. My scepticism is that this generalizes and that you can just screw around in a model organism and expect to discover useful things. I think if you design it to test a narrow phenomena then there’s a good chance you can in fact test that narrow phenomena
- Oliver Daniels 2 Dec 2025 20:21 UTC
  1 point
  0
  Parent
  hmm yeah I guess I basically agree—free form exploration is better on robustly useful settings, i.e. “let’s discover interesting things about current models” (though this exploration can still be useful for improving the realism of model organisms).
  
  maybe I think methods work should be more more focused on model organisms then prosaic problems.
  
  There’s also the dynamic where as capabilities improve, model organisms become more realistic and robust, but at the current margins I think its still more useful to add artificial properties rather than solving prosaic problems.
  - Neel Nanda 2 Dec 2025 22:15 UTC
    2 points
    0
    Parent
    If you can get a good proxy to the eventual problem in a real model I much prefer that, on realism grounds. Eg eval awareness in Sonnet