I think that’s a reasonable approximation, though some of it doesn’t sound exactly right.
Specifically: realistic reward hacking examples are more effective than pretend reward hacking (where it’s almost entirely play-acting), but I expect that in higher-stakes situations they would go back to being aligned. They’re certainly more robust to going off-distribution than the pretend RH models, but they still stop being misaligned by the third turn of a conversation.
One way to think about it is that data implies roles at differing levels of depth, which in turns affects how robust the effect is. Using fully on-policy data for example would affect the assistant persona directly, and would I expect cause the most robust misalignment (though obviously you’d have to either sample egregious reward hacking or otherwise increase the implied update from the data in some way). o3 for example misbehaves sometimes, in a way that seems much more embedded and robust (though it isn’t that badly misaligned).
and doing other forms of essentially reward hacking instead
I’m not sure if you meant to say that training on realistic data causes misbehavior and also generalized reward hacking, or that the misbehavior is a form of reward hacking. If the latter, I disagree: some of the misbehavior can be seen as reward hacking (e.g. fooling the rater), but it’s harder to call e.g. complying with harmful requests and being scope-insensitive to harm forms of reward hacking.
I think that’s a reasonable approximation, though some of it doesn’t sound exactly right.
Specifically: realistic reward hacking examples are more effective than pretend reward hacking (where it’s almost entirely play-acting), but I expect that in higher-stakes situations they would go back to being aligned. They’re certainly more robust to going off-distribution than the pretend RH models, but they still stop being misaligned by the third turn of a conversation.
One way to think about it is that data implies roles at differing levels of depth, which in turns affects how robust the effect is. Using fully on-policy data for example would affect the assistant persona directly, and would I expect cause the most robust misalignment (though obviously you’d have to either sample egregious reward hacking or otherwise increase the implied update from the data in some way). o3 for example misbehaves sometimes, in a way that seems much more embedded and robust (though it isn’t that badly misaligned).
I’m not sure if you meant to say that training on realistic data causes misbehavior and also generalized reward hacking, or that the misbehavior is a form of reward hacking. If the latter, I disagree: some of the misbehavior can be seen as reward hacking (e.g. fooling the rater), but it’s harder to call e.g. complying with harmful requests and being scope-insensitive to harm forms of reward hacking.