Rohin Shah comments on Evaluating and monitoring for AI scheming

Rohin Shah 11 Jul 2025 23:08 UTC
4 points
0
The settings the paper tests seem closer to “complex reasoning for a single output / decision / choice” than “complex reasoning for many outputs / decisions / choices”. If the paper tried to explicitly test whether model capacity to evade monitors in settings with single choices vs many choices and draw conclusions about what settings monitoring might work in, I’d find it more compelling. Given that instead their conclusion is stated more generally about difficult bad behaviors being easy to monitor because of these results, I think it’s reasonable to consider it much stronger than the results should imply.
This seems to imply that your position that it would be easier for models to evade monitors when there is “complex reasoning for many outputs” than when there is “complex reasoning for a single output”. Is that correct? If so, why do you think that?
I would assume that many outputs that all require complex reasoning just mean multiple single outputs that require complex reasoning, so if the model can’t evade on single outputs that require complex reasoning then it shouldn’t be able to evade on multiple outputs that require complex reasoning. (Or more simply: if models can’t evade monitors on a short task, they shouldn’t be able to evade monitors on a long task, holding difficulty per step constant.)
- Jozdien 11 Jul 2025 23:48 UTC
  4 points
  0
  Parent
  This seems to imply that your position that it would be easier for models to evade monitors when there is “complex reasoning for many outputs” than when there is “complex reasoning for a single output”. Is that correct? If so, why do you think that?
  No, my position is that it’s probably harder for models to evade monitors when there is “complex reasoning for many outputs”. Sorry, that wasn’t written clearly—that was meant to say that my top comment didn’t draw the distinction between tasks with a single output and tasks with many outputs (and instead only mentioned examples of other tasks with single outputs) because the paper tests settings closer to single outputs.
  The last sentence was intended to be about my top comment in light of that: if the stated conclusion had been something like “difficult bad behavior is easy to monitor when models are ambivalent about their reasoning being monitored” or “difficult bad behavior is easy to monitor for current models in tasks requiring complex reasoning for many outputs regardless of propensity”, I would have agreed^[1]. But the stated conclusion is instead that difficult bad behaviors are generally easy to monitor, which I don’t think the results are sufficient to imply.
  1. ^
    Though in the latter case it’d probably be important to plot number of outputs vs monitor-ability and compare for models over time.
  - Rohin Shah 12 Jul 2025 10:13 UTC
    2 points
    0
    Parent
    Okay, that position seems to me equivalent to the position that models will be more capable at evading monitors when they intrinsically want to evade monitors, than when they are explicitly prompted to evade monitors. Do you also endorse that position? (If not, why isn’t it implied by your stance?)
    That position seems pretty unlikely to me (for current models). Models are trained really hard to follow instructions in the prompt to the best of their ability.