Oliver Daniels comments on A Pragmatic Vision for Interpretability

Oliver Daniels 1 Dec 2025 22:25 UTC
6 points
2
I support the “focus on downstream tasks” pivot, but I’m somewhat skeptical of the “robustly useful setting” framing

Insofar as we’re worried about risks from future AI systems, all tasks are proxy tasks (which the post does a good job of pointing out). Then there’s a question of which tasks are better proxies, with two broad categories:

1. current problems that are analogous to future problems
2. artificially constructed problems (model organisms) which are designed to be analogous to future problems
I think 2 will tend to be more useful than 1, because problems with future AI system really are pretty different than problems with current AI systems. In particular, future AI systems will be very expensive to evaluate, situationally aware, and possibly scheming against us.

To take an example given in the post, pre-fill elicitation methods might work well on current models because current models are kind of dumb and not looking out for pre-fill attacks. However, I expect as models get smarter, more coherent etc, they will predictably become less susceptible to prefill elicitation (even setting aside transfer from explicit prefill-attack resistance training). One way of testing this now is to artificially give model organisms situational awareness of prefill attacks (either in-context or via SDF).

Note that this setup, as a model organism, would not quality as a “robustly useful setting”, and may indeed give opposite conclusions to experiments in more prosaic settings. But I expect that this setting would give more information about the usefulness of prefill vs white-box methods for future more capable AI systems than the more prosaic setting.
- Neel Nanda 2 Dec 2025 17:29 UTC
  4 points
  0
  Parent
  In my parlance it sounds like you agree with me about the importance of robustly useful settings, you just disagree with my more specific claim that model organisms are often bad robustly useful settings?
  
  I think that it’s easy to simulate being difficult and expensive to evaluate, And current models already show situational awareness. You can try simulating things like scheming with certain prompting experiments though I think that one’s more tenuous.
  
  Notes that I consider model organism to involve fine-tuning and prompted settings are in scope depending on specific details of how contrived they seem
  
  I’m generally down for making a model organism designed to test a very specific thing that we expect to see in future models. My scepticism is that this generalizes and that you can just screw around in a model organism and expect to discover useful things. I think if you design it to test a narrow phenomena then there’s a good chance you can in fact test that narrow phenomena
  - Oliver Daniels 2 Dec 2025 20:21 UTC
    1 point
    0
    Parent
    hmm yeah I guess I basically agree—free form exploration is better on robustly useful settings, i.e. “let’s discover interesting things about current models” (though this exploration can still be useful for improving the realism of model organisms).
    
    maybe I think methods work should be more more focused on model organisms then prosaic problems.
    
    There’s also the dynamic where as capabilities improve, model organisms become more realistic and robust, but at the current margins I think its still more useful to add artificial properties rather than solving prosaic problems.
    - Neel Nanda 2 Dec 2025 22:15 UTC
      2 points
      0
      Parent
      If you can get a good proxy to the eventual problem in a real model I much prefer that, on realism grounds. Eg eval awareness in Sonnet