If a single model is end-to-end situationally aware enough to not drop hints of the most reward-maximizing bad behaviour in chain of thought, I do not see any reason to think it would not act equally sensibly with respect to confessions.
I talk about this in this comment. I think situational awareness can be an issue, but it is not clear that a model can “help itself” from being honest in neither COT nor confessions.
If a single model is end-to-end situationally aware enough to not drop hints of the most reward-maximizing bad behaviour in chain of thought, I do not see any reason to think it would not act equally sensibly with respect to confessions.
I talk about this in this comment. I think situational awareness can be an issue, but it is not clear that a model can “help itself” from being honest in neither COT nor confessions.