Thane Ruthenis comments on Tsinghua paper: Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Thane Ruthenis 6 May 2025 9:42 UTC
4 points
−1
I think it’s important to keep in mind the difference between the neural network and the algorithm learned by the neural network. Of any neural network, regardless of its parameters’ values^[1], it’s true that it’s capable of outputting any sequence of tokens with some probability. But the same isn’t true of all algorithms that could be learned by neural networks.
As a trivial case, consider a neural network trained to act as a calculator. Assuming the final layer is softmax as usual, is there some probability that it will output “duck” in response to “2 + 2 = ”? Sure. But the “calculator” algorithm learned by the NN would never output this. Which means that if it’s forced down the “duck” execution pathway, whatever is forcing it essentially breaks down the abstraction of the learned algorithm. During that forward pass, the neural substrate isn’t executing the learned “calculator” algorithm; it’s doing something else, it’s randomly twitching.
This, I would expect, generalizes to the algorithms learned by LLMs. It is not the case that Claude-the-learned-algorithm can output any sequence of tokens, even if Claude-the-neural-network can.
E. g., suppose we receive the resolution to P vs. NP from year 3000. We then take Claude Sonnet 3.7 as it is today, and force it down the execution pathways where it recites that resolution verbatim (without being trained on it or shown it). That, I argue, would be the same as forcing a calculator LLM to output “duck”. Internally, its activations during those forward passes wouldn’t correspond to a possible sequence of thoughts Claude-the-learned-algorithm can think. It would correspond to some basically random activations, random mental twitches, random hallucinations/confabulations.
Thus, as we scale pass@k, there’s a phase shift at some k. Up to a point, we’re considering high-probability trajectories that correspond to Claude-the-learned-algorithm thinking a sequence of possible (if perhaps unlikely) thoughts. But as we scale k to, say, a googol, we start encountering trajectories during which the “Claude” abstraction broke and the NN basically behaved as a randomly initialized network.
This is the framework in which the statement “RL doesn’t create capabilities, only elicits capabilities present in the base model” makes the most sense. The pretraining process creates capabilities: changes the fundamental nature of the algorithm the neural network implements. RL, on this hypothesis, only makes small changes in the context of an already-learned algorithm, tweaking its functionality but not rewriting its fundamental nature.
Which is to say: it picks the best trajectory the NN can output without violating the sanctity of the “Claude algorithm” abstraction. Which is potentially a very limited number of trajectories, combinatorially smaller than the set of all possible trajectories.
And indeed, it makes sense that it would work this way. After all, the very reason RL-as-pretraining works (whereas RL-on-randomly-initialized-networks doesn’t) is because the pretrained LLM algorithm serves as a good prior for problem-solving. But if some sequence of computation/thoughts is impossible to represent in the language of that learned algorithm, if a given capability requires going beyond the functionality of that algorithm, RL is as powerless here as when applied to a random network. (Because “eliciting” that capability would require forcing a sequence of activations that “look” random from the perspective of the learned algorithm.)
Or so goes my model of that whole thing, anyway. Which papers like this one do support.
1. ^
  Excepting artificial degenerate cases like “all zero”.
- Jozdien 6 May 2025 10:44 UTC
  4 points
  0
  Parent
  I don’t think we disagree on many of the major points in your comment. But your original claim was:
  if moving a pass@k single-step capability to pass@1 is all RL does, even improvements on multi-step tasks still hit a ceiling soon, even if that ceiling is exponentially higher than the ceiling of single-step performance improvement. And it’s not clear that this potential exponential improvement actually unlocks any transformative/superhuman capabilities.
  The claims in the paper are agnostic to the distinction between the neural network and the algorithm learned by the neural network. It simply claims that RL makes models perform worse on pass@k for sufficiently k—a claim that could follow from the base models having a more diverse distribution to sample from.
  More specifically, the paper doesn’t make a mechanistic claim about whether this arises from RL only eliciting latent computation representable in the internal language of the learned algorithm, or from RL imparting capabilities that go beyond the primary learned algorithm. Outcome-based RL makes the model sample possible trajectories, and cognition outputting trajectories that are rewarded are up-weighted. This is then folded into future trajectory sampling, and future up-weighted cognition may compound upon it to up-weight increasingly unlikely trajectories. This implies that as the process goes on, you may stray from what the learned algorithm was likely to represent, toward what was possible for the base model to output at all.
  I agree that if all RL ever did was elicit capabilities already known by the learned algorithm, I agree that would top out at pretty unremarkable capabilities (from a strong superintelligence perspective—I disagree that the full distribution of base model capabilities aren’t impressive). But that’s very different from the claim that if all RL ever did was move a pass@k capability to pass@1, it implies the same outcome.
  What links here?
  - Noosphere89's comment on Absolute Zero: Alpha Zero for LLM by alapmi (15 May 2025 18:19 UTC; 5 points)
  - Noosphere89's comment on plex’s Shortform by plex (8 May 2025 16:13 UTC; 4 points)