Sooo, apparently OpenAI’s mysterious breakthrough technique for generalizing RL to hard-to-verify domains that scored them IMO gold is just… “use the LLM as a judge”? Sources: the main one is paywalled, but this seems to capture the main data, and you can also search for various crumbs here and here.
The technical details of how exactly the universal verifier works aren’t yet clear. Essentially, it involves tasking an LLM with the job of checking and grading another model’s answers by using various sources to research them.
My understanding is that they approximate an oracle verifier by an LLM with more compute and access to more information and tools, then train the model to be accurate by this approximate-oracle’s lights.
Now, it’s possible that the journalists are completely misinterpreting the thing they’re reporting on, or that it’s all some galaxy-brained OpenAI op to mislead the competition. It’s also possible that there’s some incredibly clever trick for making it work much better than how it sounds like it’d work.
But if that’s indeed the accurate description of the underlying reality, that’s… kind of underwhelming. I’m curious how far this can scale, but I’m not feeling very threatened by it.
(Haven’t seen this discussed on LW, kudos to @lwreader132 for bringing it to my attention.)
One point of information against the “journalists are completely misinterpreting the thing they’re reporting on” view is that the one of the co-authors is Rocket Drew, who previously worked as a Research Manager at MATS.
But I’ll definitely be interested to follow this space more.
Hard to tell from the sources, but it sounds almost like prover-estimator debate. The estimator is assigning a number to how likely it is that a subclaim for a proof is correct, and this approach might also work for less verifiable domains since a human oracle is used at the last round of the debate. The main problem seems to be that it may not scale if it requires human feedback.
Sooo, apparently OpenAI’s mysterious breakthrough technique for generalizing RL to hard-to-verify domains that scored them IMO gold is just… “use the LLM as a judge”? Sources: the main one is paywalled, but this seems to capture the main data, and you can also search for various crumbs here and here.
My understanding is that they approximate an oracle verifier by an LLM with more compute and access to more information and tools, then train the model to be accurate by this approximate-oracle’s lights.
Now, it’s possible that the journalists are completely misinterpreting the thing they’re reporting on, or that it’s all some galaxy-brained OpenAI op to mislead the competition. It’s also possible that there’s some incredibly clever trick for making it work much better than how it sounds like it’d work.
But if that’s indeed the accurate description of the underlying reality, that’s… kind of underwhelming. I’m curious how far this can scale, but I’m not feeling very threatened by it.
(Haven’t seen this discussed on LW, kudos to @lwreader132 for bringing it to my attention.)
The full text is on archive.today.
One point of information against the “journalists are completely misinterpreting the thing they’re reporting on” view is that the one of the co-authors is Rocket Drew, who previously worked as a Research Manager at MATS.
But I’ll definitely be interested to follow this space more.
Hard to tell from the sources, but it sounds almost like prover-estimator debate. The estimator is assigning a number to how likely it is that a subclaim for a proof is correct, and this approach might also work for less verifiable domains since a human oracle is used at the last round of the debate. The main problem seems to be that it may not scale if it requires human feedback.