Thanks. I am uncertain (“unclear”), and am interested in sharpening this to the point where it’s testable.
I basically never use a non-RLed model for anything, so I agree with the minimal version of the generalisation claim.
We could just reuse some transfer learning metric? If 100% is full proportional improvement, I’d claim like <10% spillover on nonverified tasks. What about you?
Another thing I was trying to point at is my not knowing what RL environments they’re using for these things, and so not knowing what tasks count in the denominator. I’m not going to know either.
Seems like Claude has been getting better at playing Pokemon, despite not having been trained on any sort of Pokemon game at all. (Epistemic status: Not sure actually, we don’t know what Anthropic does internally, maybe they’ve trained it on video games for all we know. But I don’t think they have.)
Isn’t this therefore an example of transfer/generalization?
What transfer learning metrics do you have in mind?
My perhaps overcynical take is to assume that any benchmark which gets talked about a lot is being optimised. (The ridiculously elaborate scaffold already exists for Pokemon, so why wouldn’t you train on it?) But I would update on an explicit denial.
I was guessing that the transfer learning people would already have some handy coefficient (normalised improvement on nonverifiable tasks / normalised improvement on verifiable tasks) but a quick look doesn’t turn it up.
Is there a ‘generalization barrier’ between easy-to-verify and hard-to-verify tasks
I’m guessing you mainly are thinking of (1) and have (2) as a special case?
To respond to your question, I’m reading it as:
We assume that there’s a constant multiplier in samples-to-performance needed to match in-domain training with out-of-domain training. For ‘nearby’ verifiable and non-verifiable tasks is that constant >= 10x?
I would guess modally somewhere 3-10x. I’m imagining here comparing training on more more olympiad problems vs some looser question like ‘Compare the clarity of these two proofs’. Of course there’s diminishing returns etc. so it’s not really a constant factor when taking a narrow domain.
Thanks. I am uncertain (“unclear”), and am interested in sharpening this to the point where it’s testable.
I basically never use a non-RLed model for anything, so I agree with the minimal version of the generalisation claim.
We could just reuse some transfer learning metric? If 100% is full proportional improvement, I’d claim like <10% spillover on nonverified tasks. What about you?
Another thing I was trying to point at is my not knowing what RL environments they’re using for these things, and so not knowing what tasks count in the denominator. I’m not going to know either.
Seems like Claude has been getting better at playing Pokemon, despite not having been trained on any sort of Pokemon game at all. (Epistemic status: Not sure actually, we don’t know what Anthropic does internally, maybe they’ve trained it on video games for all we know. But I don’t think they have.)
Isn’t this therefore an example of transfer/generalization?
What transfer learning metrics do you have in mind?
My perhaps overcynical take is to assume that any benchmark which gets talked about a lot is being optimised. (The ridiculously elaborate scaffold already exists for Pokemon, so why wouldn’t you train on it?) But I would update on an explicit denial.
I was guessing that the transfer learning people would already have some handy coefficient (normalised improvement on nonverifiable tasks / normalised improvement on verifiable tasks) but a quick look doesn’t turn it up.
It still says on the Twitch stream “Claude has never been trained to play any Pokemon games”
https://www.twitch.tv/claudeplayspokemon
Works for me!
Possibly relevant possibly hallucinated data: https://www.lesswrong.com/posts/cxuzALcmucCndYv4a/daniel-kokotajlo-s-shortform?commentId=sBtoCfWNnNxxGEgiL
I suppose there’s two questions here:
How strong is generalization in general in RL?
Is there a ‘generalization barrier’ between easy-to-verify and hard-to-verify tasks
I’m guessing you mainly are thinking of (1) and have (2) as a special case?
To respond to your question, I’m reading it as:
I would guess modally somewhere 3-10x. I’m imagining here comparing training on more more olympiad problems vs some looser question like ‘Compare the clarity of these two proofs’. Of course there’s diminishing returns etc. so it’s not really a constant factor when taking a narrow domain.
I do agree that there are areas where domain-specific training is a bottleneck, and plausibly some of those are non-verifiable ones. See also my shortform where I discuss some reasons for such a need https://www.lesswrong.com/posts/FQAr3afEZ9ehhssmN/jacob-pfau-s-shortform?commentId=vdBjv3frxvFincwvz