By penalizing the reward hacks you can identify, you’re training the AI to find reward hacks you can’t detect, and to only do them when you won’t detect them.
I wonder if it would be helpful to penalize deception only if the CoT doesn’t admit to it. It might be harder generate test data for this since it’s less obvious, but hopefully you’d train the model to be honest in CoT?
I’m thinking of this like the parenting stategy of not punishing children for something bad if they admit unprompted that they did it. Blameless portmortems are also sort-of similar.
I wonder if it would be helpful to penalize deception only if the CoT doesn’t admit to it. It might be harder generate test data for this since it’s less obvious, but hopefully you’d train the model to be honest in CoT?
I’m thinking of this like the parenting stategy of not punishing children for something bad if they admit unprompted that they did it. Blameless portmortems are also sort-of similar.