Tell me if this is stupid, but my first thought reading this post is that, the ES algorithm is literally just estimating∇θtE[R] and then doing gradient descent on it, so I’d be really quite surprised if this leads to qualitatively different results wrt high level behaviors like reward-hacking.
(It might still exhibit efficiency gains or stuff like that though I’m not sure.)
One intuition I can offer is that you end up in wider basins of reward/loss landscape.
If you want to hit a very narrow basin, but your variance is too high, then you might not sample the high reward point.
Although, sampling enough points which do include the reward hacking weights will eventually center you on the reward hacking weights.
Suppose you sample 1k points, and one of them is the reward hacking weight with reward 100 (and the rest 1). Then you will move towards the reward hacking weight the most, which would make it more likely sampled the next time AFAIK. So maybe not??
The second intuition is the paths being substantially different, which can be quantified as well.
The paper does have a few empirical experiments showing they arrive at different solutions. Specifically the KL-reward plot. Would you need more settings to be convinced here?
Tell me if this is stupid, but my first thought reading this post is that, the ES algorithm is literally just estimating∇θtE[R] and then doing gradient descent on it, so I’d be really quite surprised if this leads to qualitatively different results wrt high level behaviors like reward-hacking.
(It might still exhibit efficiency gains or stuff like that though I’m not sure.)
One intuition I can offer is that you end up in wider basins of reward/loss landscape.
If you want to hit a very narrow basin, but your variance is too high, then you might not sample the high reward point.
Although, sampling enough points which do include the reward hacking weights will eventually center you on the reward hacking weights.
Suppose you sample 1k points, and one of them is the reward hacking weight with reward 100 (and the rest 1). Then you will move towards the reward hacking weight the most, which would make it more likely sampled the next time AFAIK. So maybe not??
The second intuition is the paths being substantially different, which can be quantified as well.
The paper does have a few empirical experiments showing they arrive at different solutions. Specifically the KL-reward plot. Would you need more settings to be convinced here?