ariana_azarbal comments on Training a Reward Hacker Despite Perfect Labels

ariana_azarbal 17 Aug 2025 19:31 UTC
1 point
0
I agree that answer-only (no CoT) training would be a super insightful ablation.
Using a CoT monitor as the sole reward signal would also be interesting. I’ve grappled with the fact that these results seem to imply “verbalized intentions in the CoT matter”, which implies it could be useful to train against a CoT monitor that detects suspicious intentions. But of course, this is the forbidden technique.