This is an obvious thing to try, but it’s not what currently already works, and it’s not certain to work without some additional ideas. You can do a little bit of this, but not nearly to the extent that o1/R1 inch towards saturating benchmarks on math/coding olympiad-like problems. So long as using LLMs as reward for scalable RL doesn’t work yet, supercharged capabilities of o1/R1-like models plausibly remain restricted to verifiable tasks.
This is an obvious thing to try, but it’s not what currently already works, and it’s not certain to work without some additional ideas. You can do a little bit of this, but not nearly to the extent that o1/R1 inch towards saturating benchmarks on math/coding olympiad-like problems. So long as using LLMs as reward for scalable RL doesn’t work yet, supercharged capabilities of o1/R1-like models plausibly remain restricted to verifiable tasks.