The “feeling bad about reward hacking” is an artifact of still being regularized too closely to a human-like base model that further RL training would eliminate.
We can monitor that and mitigate it when we get there, using the previous generation of AIs.
This is now a completely different topic. Do you take my point?
The “feeling bad about reward hacking” is an artifact of still being regularized too closely to a human-like base model that further RL training would eliminate.
We can monitor that and mitigate it when we get there, using the previous generation of AIs.
This is now a completely different topic. Do you take my point?