Dusto comments on OpenAI: Detecting misbehavior in frontier reasoning models

Dusto 12 Mar 2025 0:40 UTC
1 point
0
I wonder what the impact is of these “misbehaviours” across a multi-turn exchange. Is a misaligned strategy in CoT more likely to reappear if you follow up with a second task?

If there isn’t an indication of increased self-awareness, is the appearance of the misbehaving related more to the distribution of “good” vs “bad” strategies available for any given task? Maybe there is value in the “bad” strategies as part of CoT even if that specific strategy isn’t implemented, and removing that angle is actually hindering the overall thought space.