Rohin Shah comments on Evaluating and monitoring for AI scheming

Rohin Shah 12 Jul 2025 10:25 UTC
3 points
0
Yeah, I’d guess that the explanation is that the refusal circuits are stronger at the first output token position and weaker later on, given that models aren’t that impressive at steg. I would nevertheless like to see someone investigate this, though—the MONA paper showed that models can learn to pass a single bit of information using semantic steganography, so it’s at least plausible that the reasoning about AF could be passed through tokens in a similar way.
Note that in normal setups, in principle later tokens are usually capable of replicating whatever reasoning happened at the earlier tokens (they can just copy it over via an attention head, or run the exact same circuit themselves). The challenging part is doing additional serial reasoning with the results of the previous reasoning. So I wouldn’t use this as a reason to predict that something like a “refusal circuit” is weakening as the number of tokens increase.
In the MONA single bit steganography environment, there are two steps, which take place in different contexts. The summary is produced with the entire loan application in context, but the decision is produce with only the summary in context (not the loan application). That’s why steganography is required; if the loan application remained in context (as it would in a standard chatbot setup) then no steganography would be needed, the decision step could just ignore the summary and use the application itself.
- Rauno Arike 12 Jul 2025 15:19 UTC
  3 points
  0
  Parent
  Note that in normal setups, in principle later tokens are usually capable of replicating whatever reasoning happened at the earlier tokens
  Sorry, I didn’t explain the reason I saw a need for steg here clearly enough in my previous comment. My understanding of Jozdien’s position is that the compliance gap appears when models are allowed to use the scratchpad due to the extra computation that they can perform over multiple tokens, but that they don’t verbalize the most important thoughts which lead them to complying. I was originally thinking that this couldn’t happen without steg, since the max hidden serial computation depth doesn’t increase with context length and the no-scratchpad results suggest that models can’t discover the same reasons for complying in a single forward pass. However, I suppose that it’s possible to calculate different reasons for complying/not complying at different token positions and verbalize only some of them, but then attend to and act on the non-verbalized reasons (though this is almost certainly an oversimplified picture of what actually happens inside the model).