Incidentally, I think reward hacking has gone down as a result of people improving techniques, despite capabilities increasing. I believe this because of anecdotal reports and also graphs like the one from the Anthropic model card for Opus and Sonnet 4:
My guess is you changed your mind on this? It seems pretty clear to me that GPT 5.5 and Opus 4.7 are much more reward-hacky than their predecessors, just substantially better at it (or like, they are much stronger apparent-success seekers, which IMO clearly was the reward they were trained on).
Noting again for the record that I would not be surprised if at some future stage, the model figures out what humans want to hear and see, errors and all, and then there is an apparent sudden amazing success with alignment.
Incidentally, I think reward hacking has gone down as a result of people improving techniques, despite capabilities increasing. I believe this because of anecdotal reports and also graphs like the one from the Anthropic model card for Opus and Sonnet 4:
My guess is you changed your mind on this? It seems pretty clear to me that GPT 5.5 and Opus 4.7 are much more reward-hacky than their predecessors, just substantially better at it (or like, they are much stronger apparent-success seekers, which IMO clearly was the reward they were trained on).
Noting again for the record that I would not be surprised if at some future stage, the model figures out what humans want to hear and see, errors and all, and then there is an apparent sudden amazing success with alignment.
This isn’t actually clear to me; is this just based on your impression? I’m not skeptical, I just don’t have an opinion.