Is there a ‘generalization barrier’ between easy-to-verify and hard-to-verify tasks
I’m guessing you mainly are thinking of (1) and have (2) as a special case?
To respond to your question, I’m reading it as:
We assume that there’s a constant multiplier in samples-to-performance needed to match in-domain training with out-of-domain training. For ‘nearby’ verifiable and non-verifiable tasks is that constant >= 10x?
I would guess modally somewhere 3-10x. I’m imagining here comparing training on more more olympiad problems vs some looser question like ‘Compare the clarity of these two proofs’. Of course there’s diminishing returns etc. so it’s not really a constant factor when taking a narrow domain.
I suppose there’s two questions here:
How strong is generalization in general in RL?
Is there a ‘generalization barrier’ between easy-to-verify and hard-to-verify tasks
I’m guessing you mainly are thinking of (1) and have (2) as a special case?
To respond to your question, I’m reading it as:
I would guess modally somewhere 3-10x. I’m imagining here comparing training on more more olympiad problems vs some looser question like ‘Compare the clarity of these two proofs’. Of course there’s diminishing returns etc. so it’s not really a constant factor when taking a narrow domain.
I do agree that there are areas where domain-specific training is a bottleneck, and plausibly some of those are non-verifiable ones. See also my shortform where I discuss some reasons for such a need https://www.lesswrong.com/posts/FQAr3afEZ9ehhssmN/jacob-pfau-s-shortform?commentId=vdBjv3frxvFincwvz