Tell me if this is stupid, but my first thought reading this post is that, the ES algorithm is literally just estimating∇θtE[R] and then doing gradient descent on it, so I’d be really quite surprised if this leads to qualitatively different results wrt high level behaviors like reward-hacking.
(It might still exhibit efficiency gains or stuff like that though I’m not sure.)
One intuition I can offer is that you end up in wider basins of reward/loss landscape.
If you want to hit a very narrow basin, but your variance is too high, then you might not sample the high reward point.
Although, sampling enough points which do include the reward hacking weights will eventually center you on the reward hacking weights.
Suppose you sample 1k points, and one of them is the reward hacking weight with reward 100 (and the rest 1). Then you will move towards the reward hacking weight the most, which would make it more likely sampled the next time AFAIK. So maybe not??
The second intuition is the paths being substantially different, which can be quantified as well.
Why do you end up in wider basins in the reward/loss landscape? This method and eg policy gradient methods for llm RLVR are both constructing an estimate of the same quantity. Are you saying this will have higher variance? You can control variance with normal methods, and typically you want low variance.
In general evolutionary methods reward hack just as much as RL I think.
EDIT: I think I misunderstood As $\omega$ → 0, you’re just estimating $\lambda_\theta E[R]$. However, if its not very small, youre optimizing a smoothed objective. So it makes sense to me that this would encourage “wider basins”.
That said, I’m still skeptical that this would lead to less reward hacking, at least not in the general case. Like reward hacking doesn’t really seem like a more “brittle” strategy in general. Like, what makes me skeptical is that reward-hacking is not a natural category from the model/reward-functions perspective, so it doesn’t seem plausible to me that it would admit a compact description, like how sensitive the solution is to perturbations in parameter space.
Would be interesting to empirically check the reward surrounding reward hacking solutions. Should be able to plot the reward against variance and see if that’s different than other spots.
The paper does have a few empirical experiments showing they arrive at different solutions. Specifically the KL-reward plot. Would you need more settings to be convinced here?
Tell me if this is stupid, but my first thought reading this post is that, the ES algorithm is literally just estimating∇θtE[R] and then doing gradient descent on it, so I’d be really quite surprised if this leads to qualitatively different results wrt high level behaviors like reward-hacking.
(It might still exhibit efficiency gains or stuff like that though I’m not sure.)
One intuition I can offer is that you end up in wider basins of reward/loss landscape.
If you want to hit a very narrow basin, but your variance is too high, then you might not sample the high reward point.
Although, sampling enough points which do include the reward hacking weights will eventually center you on the reward hacking weights.
Suppose you sample 1k points, and one of them is the reward hacking weight with reward 100 (and the rest 1). Then you will move towards the reward hacking weight the most, which would make it more likely sampled the next time AFAIK. So maybe not??
The second intuition is the paths being substantially different, which can be quantified as well.
Why do you end up in wider basins in the reward/loss landscape? This method and eg policy gradient methods for llm RLVR are both constructing an estimate of the same quantity. Are you saying this will have higher variance? You can control variance with normal methods, and typically you want low variance.
In general evolutionary methods reward hack just as much as RL I think.
EDIT: I think I misunderstood As $\omega$ → 0, you’re just estimating $\lambda_\theta E[R]$. However, if its not very small, youre optimizing a smoothed objective. So it makes sense to me that this would encourage “wider basins”.
That said, I’m still skeptical that this would lead to less reward hacking, at least not in the general case. Like reward hacking doesn’t really seem like a more “brittle” strategy in general. Like, what makes me skeptical is that reward-hacking is not a natural category from the model/reward-functions perspective, so it doesn’t seem plausible to me that it would admit a compact description, like how sensitive the solution is to perturbations in parameter space.
Would be interesting to empirically check the reward surrounding reward hacking solutions. Should be able to plot the reward against variance and see if that’s different than other spots.
The paper does have a few empirical experiments showing they arrive at different solutions. Specifically the KL-reward plot. Would you need more settings to be convinced here?