I don’t actually buy this. If the models are trying to reward-hack, and also not get caught, I expect them to trip up sometimes. There should be a long tail of instances of transparent reward hacking.
I’m a bit confused about the assumptions in this section, I thought it was conditioning on no longer seeing reward hacking?
I’m a bit confused about the assumptions in this section, I thought it was conditioning on no longer seeing reward hacking?