Stephen McAleese comments on The Sweet Lesson: AI Safety Should Scale With Compute

Stephen McAleese 10 May 2025 12:49 UTC
5 points
0
RLHF is another method that improves with more compute. When a policy is optimized against a reward model, the reward continues to increase but the true reward eventually decreases due to reward hacking. This manifests as a divergence between the proxy and true reward.
The Scaling Laws for Reward Model Overoptimization paper shows that when the policy size is held constant and the reward model size is increased, the gold and proxy rewards diverge less as the policy is optimized. This suggests that larger reward models trained on more data are more robust and less prone to reward hacking.