Have been doing some experiments on obfuscated reasoning lately and think this is definitely a thing. For example, I ran a set up where the model was filing reimbursement claims and was rewarded for getting more money but penalised for explicitly planning to lie about how much it spent. It ends up consistently claiming more money than it should while having totally innocent CoTs, but just because it learns a really dumb circuit that attends from the <answer> tag back to the numbers in the prompt and inflates them. You can noise or even completely delete the CoT and the behaviour persists
Have been unsuccesful in getting a set-up where the behaviour both needs planning and survives the monitor
Bit late but this is really cool, thanks! I did some work a while ago on obfuscated reasoning in CoTs and kept finding that the “unfaithful” CoTs were more like vestigial—you could rerun after ablating the entire CoT and the reward hacking would still happen.
Had a play around with your checkpoint and it seems like something similar? 20⁄20 samples also reward-hacked when running forward from a fully ablated CoT, so seems like evidence that it’s doing some induction-y type shallow behaviour within the answer tags
Edit: looking through some of these manually it definitely seems vestigial, ablating the CoT leads the model to catastrophically fail on actually solving the problem but will still make sure to correctly implement the hack: