But the feared / hoped-for generalisation from {training LLMs with RL on tasks with a verifier} to performing on tasks without one remains unclear even after two years of trying.
Very much disagree. Granted there are vacuously weak versions of this claim (‘no free lunch’-like) that I agree with of course.
Just talk to Claude 4.5 Opus! Ask it to describe what a paper is about, what follow up experiments to do, etc. Ask it to ELI-undergrad some STEM topic!
Do you think the pre-trained-only could do as well? Surely not.
Perhaps the claim is an instruct-SFT or “Chat-RLHF-only” compute matched model could do as well? The only variant of this I buy is: Curate enough instruct-SFT STEM data to match the amount of trajectories generated in VeRL post-training. However I don’t think this counterfactual matters much: it would involve far more human labor and is cost prohibitive for that reason.
Thanks. I am uncertain (“unclear”), and am interested in sharpening this to the point where it’s testable.
I basically never use a non-RLed model for anything, so I agree with the minimal version of the generalisation claim.
We could just reuse some transfer learning metric? If 100% is full proportional improvement, I’d claim like <10% spillover on nonverified tasks. What about you?
Another thing I was trying to point at is my not knowing what RL environments they’re using for these things, and so not knowing what tasks count in the denominator. I’m not going to know either.
Seems like Claude has been getting better at playing Pokemon, despite not having been trained on any sort of Pokemon game at all. (Epistemic status: Not sure actually, we don’t know what Anthropic does internally, maybe they’ve trained it on video games for all we know. But I don’t think they have.)
Isn’t this therefore an example of transfer/generalization?
What transfer learning metrics do you have in mind?
My perhaps overcynical take is to assume that any benchmark which gets talked about a lot is being optimised. (The ridiculously elaborate scaffold already exists for Pokemon, so why wouldn’t you train on it?) But I would update on an explicit denial.
I was guessing that the transfer learning people would already have some handy coefficient (normalised improvement on nonverifiable tasks / normalised improvement on verifiable tasks) but a quick look doesn’t turn it up.
Is there a ‘generalization barrier’ between easy-to-verify and hard-to-verify tasks
I’m guessing you mainly are thinking of (1) and have (2) as a special case?
To respond to your question, I’m reading it as:
We assume that there’s a constant multiplier in samples-to-performance needed to match in-domain training with out-of-domain training. For ‘nearby’ verifiable and non-verifiable tasks is that constant >= 10x?
I would guess modally somewhere 3-10x. I’m imagining here comparing training on more more olympiad problems vs some looser question like ‘Compare the clarity of these two proofs’. Of course there’s diminishing returns etc. so it’s not really a constant factor when taking a narrow domain.
My pet theory theory of this is that you get 2 big benefits from RLVR: 1. A model learns how to write sentences in a way that does not confuse itself (for example, markdown files written by an AI tend to context-poison an AI far less than the same amount of text written by a human or by error messages). 2. A model learns how to do “business processes”—for example, that in order to write code, it needs to first read documentation, then write the code, and then run tests.
These are things that RL if done right is going to improve, and they definitely feel like they explain much of the difference between say ChatGPT-4 and GPT-5.
I expect that these effects can have fairly “general” impact (for example, an AI learning how to work with notes), but the biggest improvements would be completely non-generalizable (for example, heuristics in how to place functions in code).
Nice points. I would add “backtracking” as one very plausible general trick purely gained by RLVR.
I will own up to being unclear in OP: the point I was trying to make is that last year that there was a lot of excitement about way bigger off-target generalisation than cleaner CoTs, basic work skills, uncertainty expression, and backtracking. But I should do the work of finding those animal spirits/predictions and quantifying them and quantifying the current situation.
Very much disagree. Granted there are vacuously weak versions of this claim (‘no free lunch’-like) that I agree with of course.
Just talk to Claude 4.5 Opus! Ask it to describe what a paper is about, what follow up experiments to do, etc. Ask it to ELI-undergrad some STEM topic!
Do you think the pre-trained-only could do as well? Surely not.
Perhaps the claim is an instruct-SFT or “Chat-RLHF-only” compute matched model could do as well? The only variant of this I buy is: Curate enough instruct-SFT STEM data to match the amount of trajectories generated in VeRL post-training. However I don’t think this counterfactual matters much: it would involve far more human labor and is cost prohibitive for that reason.
Thanks. I am uncertain (“unclear”), and am interested in sharpening this to the point where it’s testable.
I basically never use a non-RLed model for anything, so I agree with the minimal version of the generalisation claim.
We could just reuse some transfer learning metric? If 100% is full proportional improvement, I’d claim like <10% spillover on nonverified tasks. What about you?
Another thing I was trying to point at is my not knowing what RL environments they’re using for these things, and so not knowing what tasks count in the denominator. I’m not going to know either.
Seems like Claude has been getting better at playing Pokemon, despite not having been trained on any sort of Pokemon game at all. (Epistemic status: Not sure actually, we don’t know what Anthropic does internally, maybe they’ve trained it on video games for all we know. But I don’t think they have.)
Isn’t this therefore an example of transfer/generalization?
What transfer learning metrics do you have in mind?
My perhaps overcynical take is to assume that any benchmark which gets talked about a lot is being optimised. (The ridiculously elaborate scaffold already exists for Pokemon, so why wouldn’t you train on it?) But I would update on an explicit denial.
I was guessing that the transfer learning people would already have some handy coefficient (normalised improvement on nonverifiable tasks / normalised improvement on verifiable tasks) but a quick look doesn’t turn it up.
It still says on the Twitch stream “Claude has never been trained to play any Pokemon games”
https://www.twitch.tv/claudeplayspokemon
Works for me!
Possibly relevant possibly hallucinated data: https://www.lesswrong.com/posts/cxuzALcmucCndYv4a/daniel-kokotajlo-s-shortform?commentId=sBtoCfWNnNxxGEgiL
I suppose there’s two questions here:
How strong is generalization in general in RL?
Is there a ‘generalization barrier’ between easy-to-verify and hard-to-verify tasks
I’m guessing you mainly are thinking of (1) and have (2) as a special case?
To respond to your question, I’m reading it as:
I would guess modally somewhere 3-10x. I’m imagining here comparing training on more more olympiad problems vs some looser question like ‘Compare the clarity of these two proofs’. Of course there’s diminishing returns etc. so it’s not really a constant factor when taking a narrow domain.
I do agree that there are areas where domain-specific training is a bottleneck, and plausibly some of those are non-verifiable ones. See also my shortform where I discuss some reasons for such a need https://www.lesswrong.com/posts/FQAr3afEZ9ehhssmN/jacob-pfau-s-shortform?commentId=vdBjv3frxvFincwvz
My pet theory theory of this is that you get 2 big benefits from RLVR:
1. A model learns how to write sentences in a way that does not confuse itself (for example, markdown files written by an AI tend to context-poison an AI far less than the same amount of text written by a human or by error messages).
2. A model learns how to do “business processes”—for example, that in order to write code, it needs to first read documentation, then write the code, and then run tests.
These are things that RL if done right is going to improve, and they definitely feel like they explain much of the difference between say ChatGPT-4 and GPT-5.
I expect that these effects can have fairly “general” impact (for example, an AI learning how to work with notes), but the biggest improvements would be completely non-generalizable (for example, heuristics in how to place functions in code).
Nice points. I would add “backtracking” as one very plausible general trick purely gained by RLVR.
I will own up to being unclear in OP: the point I was trying to make is that last year that there was a lot of excitement about way bigger off-target generalisation than cleaner CoTs, basic work skills, uncertainty expression, and backtracking. But I should do the work of finding those animal spirits/predictions and quantifying them and quantifying the current situation.