Rauno Arike comments on Evaluating and monitoring for AI scheming

Rauno Arike 12 Jul 2025 8:52 UTC
3 points
0
I don’t know whether the models in the alignment faking setup are doing the relevant reasoning in a single forward pass.
Yeah, I’d guess that the explanation is that the refusal circuits are stronger at the first output token position and weaker later on, given that models aren’t that impressive at steg. I would nevertheless like to see someone investigate this, though—the MONA paper showed that models can learn to pass a single bit of information using semantic steganography, so it’s at least plausible that the reasoning about AF could be passed through tokens in a similar way.
Safety-critical decisions probably involve some element of long-term planning (though this could look different than explicit planning ahead), but high-stakes alignment failures are about contexts in which a small number of decisions by the AI matters a lot. This could still look like more than a literal binary decision the model can make (i.e., more than choosing to defect at one highly critical point), but high-stakes failures don’t have to look as distributed and involving as many bad actions as research sabotage.
An important reason behind my belief that safety-critical decisions would involve verbalization is that I expect high-stakes decisions to happen in OOD contexts for the model—e.g., I don’t expect models’ training data to contain precise descriptions of the safeguards used by the company, so the AI would have to do fairly complex reasoning about what situation it’s in, how it’s being monitored, what’s the right moment to defect, etc. This would plausibly change once we have continual deployment-time learning or something similar, but until then, I don’t expect this reasoning to be doable in a single forward pass or even with weak forms of steg, even if the number of actions needed to defect is very small.
- Rohin Shah 12 Jul 2025 10:25 UTC
  3 points
  0
  Parent
  Yeah, I’d guess that the explanation is that the refusal circuits are stronger at the first output token position and weaker later on, given that models aren’t that impressive at steg. I would nevertheless like to see someone investigate this, though—the MONA paper showed that models can learn to pass a single bit of information using semantic steganography, so it’s at least plausible that the reasoning about AF could be passed through tokens in a similar way.
  Note that in normal setups, in principle later tokens are usually capable of replicating whatever reasoning happened at the earlier tokens (they can just copy it over via an attention head, or run the exact same circuit themselves). The challenging part is doing additional serial reasoning with the results of the previous reasoning. So I wouldn’t use this as a reason to predict that something like a “refusal circuit” is weakening as the number of tokens increase.
  In the MONA single bit steganography environment, there are two steps, which take place in different contexts. The summary is produced with the entire loan application in context, but the decision is produce with only the summary in context (not the loan application). That’s why steganography is required; if the loan application remained in context (as it would in a standard chatbot setup) then no steganography would be needed, the decision step could just ignore the summary and use the application itself.
  - Rauno Arike 12 Jul 2025 15:19 UTC
    3 points
    0
    Parent
    Note that in normal setups, in principle later tokens are usually capable of replicating whatever reasoning happened at the earlier tokens
    Sorry, I didn’t explain the reason I saw a need for steg here clearly enough in my previous comment. My understanding of Jozdien’s position is that the compliance gap appears when models are allowed to use the scratchpad due to the extra computation that they can perform over multiple tokens, but that they don’t verbalize the most important thoughts which lead them to complying. I was originally thinking that this couldn’t happen without steg, since the max hidden serial computation depth doesn’t increase with context length and the no-scratchpad results suggest that models can’t discover the same reasons for complying in a single forward pass. However, I suppose that it’s possible to calculate different reasons for complying/not complying at different token positions and verbalize only some of them, but then attend to and act on the non-verbalized reasons (though this is almost certainly an oversimplified picture of what actually happens inside the model).