No I think that confirms what I meant to say: the paper shows that using the inoculation prompts reinforces reward hacking and causes it to be learned ~immediately:
I should maybe have been clearer about what I meant by “works”—I meant in the sense of being useful, not that it wouldn’t suppress generalization of other traits (I made that comment after seeing a draft of the paper, though I made the same prediction earlier since it seemed somewhat obvious). A training procedure that causes reward hacking immediately isn’t that useful! Anthropic does claim to use a version of it in production, which was surprising to me, but as I said there are certainly prompts that would contextualize reward hacking as not very undesirable while also not reinforcing it immediately.
No I think that confirms what I meant to say: the paper shows that using the inoculation prompts reinforces reward hacking and causes it to be learned ~immediately:
I should maybe have been clearer about what I meant by “works”—I meant in the sense of being useful, not that it wouldn’t suppress generalization of other traits (I made that comment after seeing a draft of the paper, though I made the same prediction earlier since it seemed somewhat obvious). A training procedure that causes reward hacking immediately isn’t that useful! Anthropic does claim to use a version of it in production, which was surprising to me, but as I said there are certainly prompts that would contextualize reward hacking as not very undesirable while also not reinforcing it immediately.