Have been doing some experiments on obfuscated reasoning lately and think this is definitely a thing. For example, I ran a set up where the model was filing reimbursement claims and was rewarded for getting more money but penalised for explicitly planning to lie about how much it spent. It ends up consistently claiming more money than it should while having totally innocent CoTs, but just because it learns a really dumb circuit that attends from the <answer> tag back to the numbers in the prompt and inflates them. You can noise or even completely delete the CoT and the behaviour persists
Have been unsuccesful in getting a set-up where the behaviour both needs planning and survives the monitor
Have been doing some experiments on obfuscated reasoning lately and think this is definitely a thing. For example, I ran a set up where the model was filing reimbursement claims and was rewarded for getting more money but penalised for explicitly planning to lie about how much it spent. It ends up consistently claiming more money than it should while having totally innocent CoTs, but just because it learns a really dumb circuit that attends from the <answer> tag back to the numbers in the prompt and inflates them. You can noise or even completely delete the CoT and the behaviour persists
Have been unsuccesful in getting a set-up where the behaviour both needs planning and survives the monitor