williawa comments on Replacing RL w/ Parameter-based Evolutionary Strategies

williawa 8 Oct 2025 10:40 UTC
4 points
0
Tell me if this is stupid, but my first thought reading this post is that, the ES algorithm is literally just estimating $\nabla_{θ_{t}} E [R]$ and then doing gradient descent on it, so I’d be really quite surprised if this leads to qualitatively different results wrt high level behaviors like reward-hacking.

(It might still exhibit efficiency gains or stuff like that though I’m not sure.)
- Logan Riggs 8 Oct 2025 11:08 UTC
  4 points
  0
  Parent
  One intuition I can offer is that you end up in wider basins of reward/loss landscape.
  If you want to hit a very narrow basin, but your variance is too high, then you might not sample the high reward point.
  
  Although, sampling enough points which do include the reward hacking weights will eventually center you on the reward hacking weights.
  Suppose you sample 1k points, and one of them is the reward hacking weight with reward 100 (and the rest 1). Then you will move towards the reward hacking weight the most, which would make it more likely sampled the next time AFAIK. So maybe not??
  
  The second intuition is the paths being substantially different, which can be quantified as well.
  - williawa 9 Oct 2025 15:40 UTC
    3 points
    0
    Parent
    Why do you end up in wider basins in the reward/loss landscape? This method and eg policy gradient methods for llm RLVR are both constructing an estimate of the same quantity. Are you saying this will have higher variance? You can control variance with normal methods, and typically you want low variance.
    In general evolutionary methods reward hack just as much as RL I think.
    
    EDIT: I think I misunderstood As $\omega$ → 0, you’re just estimating $\lambda_\theta E[R]$. However, if its not very small, youre optimizing a smoothed objective. So it makes sense to me that this would encourage “wider basins”.
    That said, I’m still skeptical that this would lead to less reward hacking, at least not in the general case. Like reward hacking doesn’t really seem like a more “brittle” strategy in general. Like, what makes me skeptical is that reward-hacking is not a natural category from the model/reward-functions perspective, so it doesn’t seem plausible to me that it would admit a compact description, like how sensitive the solution is to perturbations in parameter space.
    - Logan Riggs 9 Oct 2025 17:52 UTC
      2 points
      0
      Parent
      Would be interesting to empirically check the reward surrounding reward hacking solutions. Should be able to plot the reward against variance and see if that’s different than other spots.
- Logan Riggs 8 Oct 2025 11:08 UTC
  2 points
  0
  Parent
  The paper does have a few empirical experiments showing they arrive at different solutions. Specifically the KL-reward plot. Would you need more settings to be convinced here?

williawa comments on Replacing RL w/​ Parameter-based Evolutionary Strategies

williawa comments on Replacing RL w/ Parameter-based Evolutionary Strategies