Logan Riggs comments on Replacing RL w/ Parameter-based Evolutionary Strategies

Logan Riggs 8 Oct 2025 11:08 UTC
4 points
0
One intuition I can offer is that you end up in wider basins of reward/loss landscape.
If you want to hit a very narrow basin, but your variance is too high, then you might not sample the high reward point.

Although, sampling enough points which do include the reward hacking weights will eventually center you on the reward hacking weights.
Suppose you sample 1k points, and one of them is the reward hacking weight with reward 100 (and the rest 1). Then you will move towards the reward hacking weight the most, which would make it more likely sampled the next time AFAIK. So maybe not??

The second intuition is the paths being substantially different, which can be quantified as well.
- williawa 9 Oct 2025 15:40 UTC
  3 points
  0
  Parent
  Why do you end up in wider basins in the reward/loss landscape? This method and eg policy gradient methods for llm RLVR are both constructing an estimate of the same quantity. Are you saying this will have higher variance? You can control variance with normal methods, and typically you want low variance.
  In general evolutionary methods reward hack just as much as RL I think.
  
  EDIT: I think I misunderstood As $\omega$ → 0, you’re just estimating $\lambda_\theta E[R]$. However, if its not very small, youre optimizing a smoothed objective. So it makes sense to me that this would encourage “wider basins”.
  That said, I’m still skeptical that this would lead to less reward hacking, at least not in the general case. Like reward hacking doesn’t really seem like a more “brittle” strategy in general. Like, what makes me skeptical is that reward-hacking is not a natural category from the model/reward-functions perspective, so it doesn’t seem plausible to me that it would admit a compact description, like how sensitive the solution is to perturbations in parameter space.
  - Logan Riggs 9 Oct 2025 17:52 UTC
    2 points
    0
    Parent
    Would be interesting to empirically check the reward surrounding reward hacking solutions. Should be able to plot the reward against variance and see if that’s different than other spots.

Logan Riggs comments on Replacing RL w/​ Parameter-based Evolutionary Strategies

Logan Riggs comments on Replacing RL w/ Parameter-based Evolutionary Strategies