Replacing RL w/​ Parameter-based Evolutionary Strategies

I want to highlight this paper (from Sept 29, 2025) of an alternative to RL (for fine-tuning pre-trained LLMs) which:

  • Performs better

  • Requires less data

  • Consistent across seeds

  • Robust (ie don’t need to do a grid search on your hyperparameters)

  • Less “Reward Hacking” (ie when optimizing for conciseness, it naturally stays close to the original model ie low KL-Divergence)

They claim the magic sauce behind all this is the evolutionary strategy optimizing over distributions of model parameters. Surprisingly, they’ve scaled this to optimize over billion-parameter models.

Let’s get into their method.

Evolutionary Strategy (ES) Algorithm

They start w/​ a “Basic ES Algorithm” which is:

In other words, we’re gonna sample noise around the original model’s weights N times (ie we’re going to explore around the model weights where the variance I is the identity covariance).

[Below is an example explaining more in depth, feel free to skip to the next section if you get it already]

We get the reward for each of these noises/​perturbation , where the full perturbed model weights are θ + σ·εₙ, where σ is a learnable hyperparameter to scale the noise (ie increase the variance)

We normalize these rewards, giving a list of high-to-low rewards that sum to 1. Importantly, we don’t keep only the best reward. Instead we move towards each perturbation weighted by its reward.

Suppose we have 4 perturbations which got rewards [20, 10, 6, 4]. We normalize to get:

R = [0.5, 0.25, 0.15, 0.10]

These determine how much we move towards each pertubat. For the sum:

Where is the perturbed weights at iteration n, we get:

Clearly showing we’re weighting towards the higher reward solutions. Also clear we’re optimizing over a distribution of weights (given by the perturbations + original weights)

New ES Implementation

They make 7 changes to make this scale to large LLMs

Changes 1,2,3,6: Better GPU utilization
Change 4: Reward is normalized using Z-scores meaning mean of 0 and standard deviation of 1 (to keep “the reward scale consistent across iterations and tasks”)
Change 5: Greedy decoding of the LLM to make it deterministic (this seems correct since they’re sampling from nearby weights)
Change 7: They add a learnable hyperparameter that scales the learning rate (which they just fold into the learning rate)

And with just those changes, they achieve really impressive results w/​ less compute.

Task 1: Countdown task

The Countdown task (Pan et al., 2025; Goodfellow et al., 2016) requires constructing an arithmetic expression from a given set of numbers using basic operations (+, −, ×, ÷) to match a target value. For instance, the target 950 can be obtained from {100, 50, 6, 3} with 100 × (6 + 3) + 50 = 950). This constitutes a compact test of constrained symbolic reasoning, i.e. an important use case for fine-tuning.

The results are much much better, especially for smaller models where RL typically fails.

And look at these training curves!

ES gets solid results in way less evaluations (0.5 is still 500k evaluations though!)

Task 2: Conciseness

Train the model w/​ the reward only being the length of the response. In RL, there’s typically reward hacking where the model does very short responses that don’t answer the question. ES does have a couple of examples, but it’s drastically lower!

As a quantitative check, they plot the conciseness reward vs KL-divergence (ie how far the output distribution divereges from the base model)

It does seem like a common theme that optimizing over a distribution leads to more conservative optimizations. This sort’of rhymes w/​ Attainable Utility Preservation, where optimizing for maintaining the ability to achieve many auxiliary goals leads to more conservative policies.

Future Work

This is still early work that mostly just advanced ES scaling. I’d be interested in work applying this to OpenAI’s reward hacking environment to see if it helps. It’s also nice that ES doesn’t need huge hyperparameter sweeps and converges to the same solutions (at least in these examples), meaning it shouldn’t be too much work to do.

From the paper:

One counterintuitive result is that the ES implementation only needs a population of 30 to effectively optimize billions of parameters. In contrast, previous work (Salimans et al., 2017; Zhang et al., 2017; Lehman et al., 2018; Lorenc & Neruda, 2025) used populations of 10,000 or more for models with millions or fewer parameters. An interesting future direction is to analyze how such small populations are possible. Perhaps this is related to the observed low intrinsic dimensionality of LLMs (Aghajanyan et al., 2021). Another promising direction is to use ES to perform unsupervised fine-tuning based on internal behaviors of LLMs, such as confidence calculated based on semantic entropy and semantic density (Qiu & Miikkulainen, 2024; Farquhar et al., 2024). Such fine-tuning cannot be done with RL, since action space exploration does not change the internal representations of LLMs (that is, each action sampling is generated via output distribution without changing the internal parameters). In a broader sense, since ES does not need process rewards during exploration, it may be a necessary ingredient for superintelligence (Mucci & Stryker, 2023), which would be difficult to achieve by supervised learning using process guidance from human data. Massive parallelization of ES will speed up exploration by distributing the computations across GPU machines or even data centers.

It’s a pretty interesting paper, and I’ve definitely missed a few points. Do give it a read!