The “feeling bad about reward hacking” is an artifact of still being regularized too closely to a human-like base model that further RL training would eliminate.
The “feeling bad about reward hacking” is an artifact of still being regularized too closely to a human-like base model that further RL training would eliminate.