Thomas Kwa comments on OpenAI: Detecting misbehavior in frontier reasoning models

Thomas Kwa 12 Mar 2025 23:27 UTC
LW: 4 AF: 3
2
AF
Important results from the paper as I see it:
- RL agents can learn to cheat in the wild, and CoT pressure doesn’t prevent it.
- The slope of cheating increase and final cheating rate are both slightly lower when applying CoT pressure, but all of the cheating will go uncaught by the monitor.
- GPT-4o can zero-shot monitor o3-level agents with 95% accuracy when it has access to CoT, only 60% when it can only monitor actions.
- If there is no CoT pressure, RL agents telegraph their cheating, such that cheating precision and recall actually go up over time for a monitor that can only see CoT
- True utility (success rate counting cheating as failure) initially increases faster with CoT pressure than without, despite low initial levels of cheating in both.
  - I’m not sure why this is given they said the only change in reward related to CoT pressure is that the agent doesn’t get reward when the monitor flags.
- If you allow rejection sampling, the utility curve is roughly similar between CoT pressure and no CoT pressure. This rhymes with the Scaling Laws for Reward Model Overoptimization result where utility is similar between RL and best-of-k
Speculation:
- The information that the model is cheating seems basically gone from the CoT, so training against a weak monitor and monitoring using a stronger monitor could be unreliable
- How much CoT pressure should you apply to maximize utility? Figure 4 suggests that if you do RL, cheating will happen at roughly the same point in training with or without CoT pressure (~0.3 epochs), which also coincides with the maximum utility point, so stop at the point where the non-CoT-pressure model starts to do detectable cheating.
- It is unclear how this applies to cheating / other malicious behaviors that are hard enough that they require CoT. My prior is that capabilities at hiding misaligned behaviors under CoT monitoring will improve at similar rates to capabilities at reasoning without CoT in other domains, although the presence of a monitor might specifically target the skill of obfuscating misaligned reasoning and make it happen sooner.