wassname comments on Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro

wassname 13 Sep 2025 22:47 UTC
2 points
0
Overall, I’d guess this advance is real, but probably isn’t that big of a deal outside of math
There is a paper showing this works for writing chapters of fiction. This shows it generalises outside of math.
https://arxiv.org/abs/2503.22828v1
- Chastity Ruth 14 Sep 2025 1:09 UTC
  4 points
  1
  Parent
  Not sure your point here is correct?
  
  Ryan is talking about how “advances that allow for better verification of natural language proofs wouldn’t really transfer to better verification in the context of agentic software engineering”.
  
  The paper you’ve linked to shows that a model that’s gone through an RL environment for writing fiction can write a chapter that’s preferred over the base-reasoning model’s chapter 64.7% of the time.
  
  You said “this shows it generalises outside of math”. But that’s not true? The paper is interesting and shows you can conduct RL on harder to verify domains successfully, but it doesn’t show an overwhelmingly strong effect. The paper didn’t test whether the RL environment generalised to better results in math or SWE.
  
  Your “it” is different from Ryan’s. Ryan is referring to specific advances in verification that meant models could get a gold in the IMO and the paper is referring to a specific RL environment for fiction writing. I can’t imagine these are the same.
  - wassname 15 Sep 2025 8:49 UTC
    2 points
    0
    Parent
    Ah, yes I see that you are right. The paper shows that the method generalises, not the results.
    
    I am also uncertain if the RLVF math training generalised well outside of math. I had look at recent benchmarks and it was hard to tell
    - wassname 17 Sep 2025 1:36 UTC
      1 point
      0
      Parent
      I just stumbled across this lovely plot. It seems to indicate some generalisation (for this model, for initial RL, across domains), and is the most direct measure I’ve seen so far.
      
      source
      
      EDIT: upon reflection this is a combination of be RLAIF, and RLVF...According to the paper
      
      We employ a combination of Reinforcement Learning with Verifiable Rewards (RLVR) and Reinforcement Learning with Human Feedback (RLHF)
      - wassname 22 Sep 2025 23:18 UTC
        1 point
        0
        Parent
        I found it! rStar2-Agent show’s that training on math with their form of RL generalised to ScienceQA