Replacing RL w/ Parameter-based Evolutionary Strategies
I want to highlight this paper (from Sept 29, 2025) of an alternative to RL (for fine-tuning pre-trained LLMs) which:
Performs better
Requires less data
Consistent across seeds
Robust (ie don’t need to do a grid search on your hyperparameters)
Less “Reward Hacking” (ie when optimizing for conciseness, it naturally stays close to the original model ie low KL-Divergence)
They claim the magic sauce behind all this is the evolutionary strategy optimizing over distributions of model parameters. Surprisingly, they’ve scaled this to optimize over billion-parameter models.
Let’s get into their method.
Evolutionary Strategy (ES) Algorithm
They start w/ a “Basic ES Algorithm” which is:
In other words, we’re gonna sample noise around the original model’s weights N times (ie we’re going to explore around the model weights where the variance I is the identity covariance).
[Below is an example explaining more in depth, feel free to skip to the next section if you get it already]
We get the reward for each of these noises/perturbation , where the full perturbed model weights are θ + σ·εₙ, where σ is a learnable hyperparameter to scale the noise (ie increase the variance)
We normalize these rewards, giving a list of high-to-low rewards that sum to 1. Importantly, we don’t keep only the best reward. Instead we move towards each perturbation weighted by its reward.
Suppose we have 4 perturbations which got rewards [20, 10, 6, 4]. We normalize to get:
R = [0.5, 0.25, 0.15, 0.10]
These determine how much we move towards each pertubat. For the sum:
Where is the perturbed weights at iteration n, we get:
Clearly showing we’re weighting towards the higher reward solutions. Also clear we’re optimizing over a distribution of weights (given by the perturbations + original weights)
New ES Implementation
They make 7 changes to make this scale to large LLMs
Changes 1,2,3,6: Better GPU utilization
Change 4: Reward is normalized using Z-scores meaning mean of 0 and standard deviation of 1 (to keep “the reward scale consistent across iterations and tasks”)
Change 5: Greedy decoding of the LLM to make it deterministic (this seems correct since they’re sampling from nearby weights)
Change 7: They add a learnable hyperparameter that scales the learning rate (which they just fold into the learning rate)
And with just those changes, they achieve really impressive results w/ less compute.
Task 1: Countdown task
The Countdown task (Pan et al., 2025; Goodfellow et al., 2016) requires constructing an arithmetic expression from a given set of numbers using basic operations (+, −, ×, ÷) to match a target value. For instance, the target 950 can be obtained from {100, 50, 6, 3} with 100 × (6 + 3) + 50 = 950). This constitutes a compact test of constrained symbolic reasoning, i.e. an important use case for fine-tuning.
The results are much much better, especially for smaller models where RL typically fails.
And look at these training curves!
ES gets solid results in way less evaluations (0.5 is still 500k evaluations though!)
Task 2: Conciseness
Train the model w/ the reward only being the length of the response. In RL, there’s typically reward hacking where the model does very short responses that don’t answer the question. ES does have a couple of examples, but it’s drastically lower!
As a quantitative check, they plot the conciseness reward vs KL-divergence (ie how far the output distribution divereges from the base model)
It does seem like a common theme that optimizing over a distribution leads to more conservative optimizations. This sort’of rhymes w/ Attainable Utility Preservation, where optimizing for maintaining the ability to achieve many auxiliary goals leads to more conservative policies.
Future Work
This is still early work that mostly just advanced ES scaling. I’d be interested in work applying this to OpenAI’s reward hacking environment to see if it helps. It’s also nice that ES doesn’t need huge hyperparameter sweeps and converges to the same solutions (at least in these examples), meaning it shouldn’t be too much work to do.
From the paper:
One counterintuitive result is that the ES implementation only needs a population of 30 to effectively optimize billions of parameters. In contrast, previous work (Salimans et al., 2017; Zhang et al., 2017; Lehman et al., 2018; Lorenc & Neruda, 2025) used populations of 10,000 or more for models with millions or fewer parameters. An interesting future direction is to analyze how such small populations are possible. Perhaps this is related to the observed low intrinsic dimensionality of LLMs (Aghajanyan et al., 2021). Another promising direction is to use ES to perform unsupervised fine-tuning based on internal behaviors of LLMs, such as confidence calculated based on semantic entropy and semantic density (Qiu & Miikkulainen, 2024; Farquhar et al., 2024). Such fine-tuning cannot be done with RL, since action space exploration does not change the internal representations of LLMs (that is, each action sampling is generated via output distribution without changing the internal parameters). In a broader sense, since ES does not need process rewards during exploration, it may be a necessary ingredient for superintelligence (Mucci & Stryker, 2023), which would be difficult to achieve by supervised learning using process guidance from human data. Massive parallelization of ES will speed up exploration by distributing the computations across GPU machines or even data centers.
It’s a pretty interesting paper, and I’ve definitely missed a few points. Do give it a read!
Tell me if this is stupid, but my first thought reading this post is that, the ES algorithm is literally just estimating∇θtE[R] and then doing gradient descent on it, so I’d be really quite surprised if this leads to qualitatively different results wrt high level behaviors like reward-hacking.
(It might still exhibit efficiency gains or stuff like that though I’m not sure.)
One intuition I can offer is that you end up in wider basins of reward/loss landscape.
If you want to hit a very narrow basin, but your variance is too high, then you might not sample the high reward point.
Although, sampling enough points which do include the reward hacking weights will eventually center you on the reward hacking weights.
Suppose you sample 1k points, and one of them is the reward hacking weight with reward 100 (and the rest 1). Then you will move towards the reward hacking weight the most, which would make it more likely sampled the next time AFAIK. So maybe not??
The second intuition is the paths being substantially different, which can be quantified as well.
The paper does have a few empirical experiments showing they arrive at different solutions. Specifically the KL-reward plot. Would you need more settings to be convinced here?