Yes, I was thinking that if you ask an AI to do some specific gradient hacking, you know what to look for (thoughts about bread), while it should at least indicate some ability to gradient hack without instructions.
I have no idea how to measure tendency to gradient hack without such instructions though. But perhaps models start to occasionally reason about gradient hacking in their chain-of-thoughts during training once they are a bit smarter.
Yes, I was thinking that if you ask an AI to do some specific gradient hacking, you know what to look for (thoughts about bread), while it should at least indicate some ability to gradient hack without instructions.
I have no idea how to measure tendency to gradient hack without such instructions though. But perhaps models start to occasionally reason about gradient hacking in their chain-of-thoughts during training once they are a bit smarter.