Ryan is talking about how “advances that allow for better verification of natural language proofs wouldn’t really transfer to better verification in the context of agentic software engineering”.
The paper you’ve linked to shows that a model that’s gone through an RL environment for writing fiction can write a chapter that’s preferred over the base-reasoning model’s chapter 64.7% of the time.
You said “this shows it generalises outside of math”. But that’s not true? The paper is interesting and shows you can conduct RL on harder to verify domains successfully, but it doesn’t show an overwhelmingly strong effect. The paper didn’t test whether the RL environment generalised to better results in math or SWE.
Your “it” is different from Ryan’s. Ryan is referring to specific advances in verification that meant models could get a gold in the IMO and the paper is referring to a specific RL environment for fiction writing. I can’t imagine these are the same.
I just stumbled across this lovely plot. It seems to indicate some generalisation (for this model, for initial RL, across domains), and is the most direct measure I’ve seen so far.
There is a paper showing this works for writing chapters of fiction. This shows it generalises outside of math.
https://arxiv.org/abs/2503.22828v1
Not sure your point here is correct?
Ryan is talking about how “advances that allow for better verification of natural language proofs wouldn’t really transfer to better verification in the context of agentic software engineering”.
The paper you’ve linked to shows that a model that’s gone through an RL environment for writing fiction can write a chapter that’s preferred over the base-reasoning model’s chapter 64.7% of the time.
You said “this shows it generalises outside of math”. But that’s not true? The paper is interesting and shows you can conduct RL on harder to verify domains successfully, but it doesn’t show an overwhelmingly strong effect. The paper didn’t test whether the RL environment generalised to better results in math or SWE.
Your “it” is different from Ryan’s. Ryan is referring to specific advances in verification that meant models could get a gold in the IMO and the paper is referring to a specific RL environment for fiction writing. I can’t imagine these are the same.
Ah, yes I see that you are right. The paper shows that the method generalises, not the results.
I am also uncertain if the RLVF math training generalised well outside of math. I had look at recent benchmarks and it was hard to tell
I just stumbled across this lovely plot. It seems to indicate some generalisation (for this model, for initial RL, across domains), and is the most direct measure I’ve seen so far.
source
EDIT: upon reflection this is a combination of be RLAIF, and RLVF...According to the paper
I found it! rStar2-Agent show’s that training on math with their form of RL generalised to ScienceQA