jacquesthibs comments on Tsinghua paper: Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

jacquesthibs 5 May 2025 20:58 UTC
10 points
6
“RL can enable emergent capabilities, especially on long-horizon tasks: Suppose that a capability requires 20 correct steps in a row, and the base model has an independent 50% success rate. Then the base model will have a 0.0001% success rate at the overall task and it would be completely impractical to sample 1 million times, but the RLed model may be capable of doing the task reliably.”

Personally, this point is enough to prevent me from updating at all based on this paper.
- Thane Ruthenis 6 May 2025 0:44 UTC
  4 points
  0
  Parent
  I dispute that non-update. Recall that the promise of RL isn’t simply “nontrivial improvement”, but “enables an RSI loop”. Yes, the gain on multi-step tasks might scale exponentially with k in the corresponding pass@k. But if moving a pass@k single-step capability to pass@1 is all RL does, even improvements on multi-step tasks still hit a ceiling soon, even if that ceiling is exponentially higher than the ceiling of single-step performance improvement. And it’s not clear that this potential exponential improvement actually unlocks any transformative/superhuman capabilities. (“Capabilities exponentially beyond base models’” are not necessarily very impressive capabilities.)
  I’m wary of updating on this paper too much, the conclusion is too appealing and the data has some holes. But if it actually displays what it claims to display, that seems like a pretty big hit to the current Singularity-is-nigh narratives.
  - Jozdien 6 May 2025 1:38 UTC
    2 points
    0
    Parent
    I don’t think this follows—at the limit, any feasible trajectory can be sampled from a model with a broad distribution. Whether a model “knows” something is a pretty fuzzy question. There’s a sense in which all text can be sampled by a model at high temperature, given enough samples. It’s a trivial sense, except it means that moving pass@k to pass@1 for extremely high k is very non-trivial.
    As an example, I took asked o4-mini the following prompt (from the OpenAI docs): “Write a bash script that takes a matrix represented as a string with format ‘[1,2],[3,4],[5,6]’ and prints the transpose in the same format.”, and fed its output into gpt-4-base (the only model I could reliably get logprobs for the input from). The average per-token logprob of o4-mini’s output was −0.8998, and the average per-token logprob of gpt-4-base’s top logprob continuation after each token was −0.3996. For reference, I prompted gpt-4-base with the question alone at temperature 1, and the average per-token logprob of its output was −1.8712, and the average per-token logprob of gpt-4-base’s top logprob continuation after each token was −0.7218.
    It seems pretty hard to dispute that o4-mini has significant improvements over a GPT-4 level model. This isn’t at all at odds with the hypothesis that sampling a base model for long enough will get you arbitrarily performant outputs.
    - Thane Ruthenis 6 May 2025 9:42 UTC
      4 points
      −1
      Parent
      I think it’s important to keep in mind the difference between the neural network and the algorithm learned by the neural network. Of any neural network, regardless of its parameters’ values^[1], it’s true that it’s capable of outputting any sequence of tokens with some probability. But the same isn’t true of all algorithms that could be learned by neural networks.
      As a trivial case, consider a neural network trained to act as a calculator. Assuming the final layer is softmax as usual, is there some probability that it will output “duck” in response to “2 + 2 = ”? Sure. But the “calculator” algorithm learned by the NN would never output this. Which means that if it’s forced down the “duck” execution pathway, whatever is forcing it essentially breaks down the abstraction of the learned algorithm. During that forward pass, the neural substrate isn’t executing the learned “calculator” algorithm; it’s doing something else, it’s randomly twitching.
      This, I would expect, generalizes to the algorithms learned by LLMs. It is not the case that Claude-the-learned-algorithm can output any sequence of tokens, even if Claude-the-neural-network can.
      E. g., suppose we receive the resolution to P vs. NP from year 3000. We then take Claude Sonnet 3.7 as it is today, and force it down the execution pathways where it recites that resolution verbatim (without being trained on it or shown it). That, I argue, would be the same as forcing a calculator LLM to output “duck”. Internally, its activations during those forward passes wouldn’t correspond to a possible sequence of thoughts Claude-the-learned-algorithm can think. It would correspond to some basically random activations, random mental twitches, random hallucinations/confabulations.
      Thus, as we scale pass@k, there’s a phase shift at some k. Up to a point, we’re considering high-probability trajectories that correspond to Claude-the-learned-algorithm thinking a sequence of possible (if perhaps unlikely) thoughts. But as we scale k to, say, a googol, we start encountering trajectories during which the “Claude” abstraction broke and the NN basically behaved as a randomly initialized network.
      This is the framework in which the statement “RL doesn’t create capabilities, only elicits capabilities present in the base model” makes the most sense. The pretraining process creates capabilities: changes the fundamental nature of the algorithm the neural network implements. RL, on this hypothesis, only makes small changes in the context of an already-learned algorithm, tweaking its functionality but not rewriting its fundamental nature.
      Which is to say: it picks the best trajectory the NN can output without violating the sanctity of the “Claude algorithm” abstraction. Which is potentially a very limited number of trajectories, combinatorially smaller than the set of all possible trajectories.
      And indeed, it makes sense that it would work this way. After all, the very reason RL-as-pretraining works (whereas RL-on-randomly-initialized-networks doesn’t) is because the pretrained LLM algorithm serves as a good prior for problem-solving. But if some sequence of computation/thoughts is impossible to represent in the language of that learned algorithm, if a given capability requires going beyond the functionality of that algorithm, RL is as powerless here as when applied to a random network. (Because “eliciting” that capability would require forcing a sequence of activations that “look” random from the perspective of the learned algorithm.)
      Or so goes my model of that whole thing, anyway. Which papers like this one do support.
      ^
      Excepting artificial degenerate cases like “all zero”.
      - Jozdien 6 May 2025 10:44 UTC
        4 points
        0
        Parent
        I don’t think we disagree on many of the major points in your comment. But your original claim was:
        if moving a pass@k single-step capability to pass@1 is all RL does, even improvements on multi-step tasks still hit a ceiling soon, even if that ceiling is exponentially higher than the ceiling of single-step performance improvement. And it’s not clear that this potential exponential improvement actually unlocks any transformative/superhuman capabilities.
        The claims in the paper are agnostic to the distinction between the neural network and the algorithm learned by the neural network. It simply claims that RL makes models perform worse on pass@k for sufficiently k—a claim that could follow from the base models having a more diverse distribution to sample from.
        More specifically, the paper doesn’t make a mechanistic claim about whether this arises from RL only eliciting latent computation representable in the internal language of the learned algorithm, or from RL imparting capabilities that go beyond the primary learned algorithm. Outcome-based RL makes the model sample possible trajectories, and cognition outputting trajectories that are rewarded are up-weighted. This is then folded into future trajectory sampling, and future up-weighted cognition may compound upon it to up-weight increasingly unlikely trajectories. This implies that as the process goes on, you may stray from what the learned algorithm was likely to represent, toward what was possible for the base model to output at all.
        I agree that if all RL ever did was elicit capabilities already known by the learned algorithm, I agree that would top out at pretty unremarkable capabilities (from a strong superintelligence perspective—I disagree that the full distribution of base model capabilities aren’t impressive). But that’s very different from the claim that if all RL ever did was move a pass@k capability to pass@1, it implies the same outcome.
        What links here?
        Noosphere89's comment on Absolute Zero: Alpha Zero for LLM by alapmi (15 May 2025 18:19 UTC; 5 points)
        Noosphere89's comment on plex’s Shortform by plex (8 May 2025 16:13 UTC; 4 points)
- Cole Wyeth 5 May 2025 23:13 UTC
  2 points
  0
  Parent
  I think that’s probably a mistake, the sentence you quoted seems to a hypothetical and the actual experimental results do seem to point against the effectiveness of current RL (?).
  
  I am not confident though. It’s certainly true that if RL can increase the probably of a behavior/ability enough, it is not necessarily helpful to frame it as having already been in the base model’s distribution “for practical purposes.” I would have to look into this more carefully to judge whether the paper actually does a convincing job of demonstrating that this is a good frame.