Rauno Arike comments on Evaluating and monitoring for AI scheming

Rauno Arike 12 Jul 2025 15:19 UTC
3 points
0
Note that in normal setups, in principle later tokens are usually capable of replicating whatever reasoning happened at the earlier tokens
Sorry, I didn’t explain the reason I saw a need for steg here clearly enough in my previous comment. My understanding of Jozdien’s position is that the compliance gap appears when models are allowed to use the scratchpad due to the extra computation that they can perform over multiple tokens, but that they don’t verbalize the most important thoughts which lead them to complying. I was originally thinking that this couldn’t happen without steg, since the max hidden serial computation depth doesn’t increase with context length and the no-scratchpad results suggest that models can’t discover the same reasons for complying in a single forward pass. However, I suppose that it’s possible to calculate different reasons for complying/not complying at different token positions and verbalize only some of them, but then attend to and act on the non-verbalized reasons (though this is almost certainly an oversimplified picture of what actually happens inside the model).