I was originally going to say that if you tried this experiment, you’d really just be testing if the model can learn what you tell it to learn (you’re telling it to think about bread), but after thinking about it more I think this is basically the same thing. If a model can reward hack because you told it to, it can likely reward hack on its own too.
It’s kind of interesting that this seems to be how Claude’s Constitution is intended to work: It tells Claude how Anthropic wants it to reward hack during training.
Yes, I was thinking that if you ask an AI to do some specific gradient hacking, you know what to look for (thoughts about bread), while it should at least indicate some ability to gradient hack without instructions.
I have no idea how to measure tendency to gradient hack without such instructions though. But perhaps models start to occasionally reason about gradient hacking in their chain-of-thoughts during training once they are a bit smarter.
I was originally going to say that if you tried this experiment, you’d really just be testing if the model can learn what you tell it to learn (you’re telling it to think about bread), but after thinking about it more I think this is basically the same thing. If a model can reward hack because you told it to, it can likely reward hack on its own too.
It’s kind of interesting that this seems to be how Claude’s Constitution is intended to work: It tells Claude how Anthropic wants it to reward hack during training.
Yes, I was thinking that if you ask an AI to do some specific gradient hacking, you know what to look for (thoughts about bread), while it should at least indicate some ability to gradient hack without instructions.
I have no idea how to measure tendency to gradient hack without such instructions though. But perhaps models start to occasionally reason about gradient hacking in their chain-of-thoughts during training once they are a bit smarter.