We developed new techniques that make LLMs a lot better at hard-to-verify tasks.
Like, there’s some murkiness with them apparently awarding gold to themselves instead of IMO organizers doing it, and with that other competitive-programming contest at which presumably-the-same model did well being OpenAI-funded. But whatever, I’m willing to buy that they have a model that legitimately achieved roughly this performance (even if a fairer set of IMO judges would’ve docked points to slightly below the unimportant “gold” threshold).
But since when are math proofs, or competitive programming, considered good examples of hard-to-verify tasks? I’m surprised I haven’t seen anyone challenging that. (FYI, I do not think they are good examples of hard-to-verify tasks, fight me. Edit: Also, inasmuch as proofs have hard-to-verify properties like “being well-written”, their model sure failed at that.) So pending some results in an actual hard-to-verify domain, or convincing technical arguments that this new approach would really generalize, I’m not buying that.
Overall… This is the performance I would have expected out of them just mindlessly RLing bigger models harder. They needed to have a whole new breakthrough to get here? And the models on the new approach take hours to produce their solutions? They’re spinning it as good news for AI, but pending more information, this sounds pretty bad to me actually.
But since when are math proofs, or competitive programming, considered good examples of hard-to-verify tasks?
When Noam Brown says “hard-to-verify”, I think he means that natural language IMO proofs are “substantially harder to verify”: he says “proofs are pages long and take experts hours to grade”.
(Yes, there are also things which are much harder to verify like things that experts strongly disagree about after years of discussion. Also, for IMO programs, “hours to grade” is probably overstated?)
Also, I interpreted this as mostly in contrast to cases where outputs are trivial to programmatically verify (or reliably verify with a dumb LLM) in the context of large scale RL. E.g., you can trivially verify the answers to purely numerical math problems (or competitive programming or other programming situations where you have test cases). Indeed, OpenAI LLMs have historically been much better at numerical math problems than proofs, though possibly this gap has now been closed (at least partially).
I think this is overall reasonable if you interpret “hard-to-verify” as “substantially harder to verify” and I think this probably how many people would read this by default. I don’t have a strong view about whether this method will actually generalize to other cases where experts can verify things with high agreement in a few hours.
(Noam Brown doesn’t say anything about competitive programming, so I’m not sure why you mentioned that. Competitive programming is trivial to verify.)
I think this is overall reasonable if you interpret “hard-to-verify” as “substantially harder to verify” and I think this probably how many people would read this by default
Not sure about this. The kind of “hard-to-verify” I care about is e. g. agenty behavior in real-world conditions. I assume many other people are also watching out for that specifically, and that capability researchers are deliberately aiming for it.
And I don’t think the proofs are any evidence for that. The issue is that there exists, in principle, a way to easily verify math proofs: by translating them into a formal language and running them through a theorem-verifier. So the “correct” way for gradient descent to solve this was to encode some sort of internal theorem-verifier into the LLM.
Even more broadly, we know that improved performance at IMO could be achieved by task-specific models (AlphaGeometry, AlphaProof), which means that much of the IMO benchmark is not a strong signal of general intelligence. A general intelligence can solve it, but one oughtn’t be a general intelligence for that, and since gradient descent prefers shortcuts and shallow solutions...
They say they’re not using task-specific methodology, but what does this actually mean? Does it mean they did not even RL the model on math-proof tasks, they RL’d it on something else and the gold-level IMO performance arose by transfer learning? Doubt it. Which means this observation doesn’t really distinguish between “the new technique is fully general and works in all domains” and “the new technique looks like it should generalize because of secret technical details, but it doesn’t actually, it only worked here because it exploited the underlying easy-to-verify properties of the task”.
Sure, math is not an example of a hard-to-verify task, but I think you’re getting unnecessarily hung up on these things. It does sound like it may be a new and in a narrow sense unexpected technical development, and it’s unclear how significant it is. I wouldn’t try to read into their communications much more.
It does sound like it may be a new and in a narrow sense unexpected technical development
I buy that, sure. I even buy that they’re as excited about it as they present, that they believe/hope it unlocks generalization to hard-ot-verify domains. And yes, they may or may not be right. But I’m skeptical on priors/based on my model of ML, and their excitement isn’t very credible evidence, so I’ve not moved far from said priors.
Got it! I’m more inclined to generally expect that various half-decent ideas may unlock surprising advances (for no good reason in particular), so I’m less skeptical that this may be true. Also, while math is of course easy to verify, assuming they haven’t significantly used verification in the training process, it makes their claims more reasonable.
The claim I’m squinting at real hard is this one:
Like, there’s some murkiness with them apparently awarding gold to themselves instead of IMO organizers doing it, and with that other competitive-programming contest at which presumably-the-same model did well being OpenAI-funded. But whatever, I’m willing to buy that they have a model that legitimately achieved roughly this performance (even if a fairer set of IMO judges would’ve docked points to slightly below the unimportant “gold” threshold).
But since when are math proofs, or competitive programming, considered good examples of hard-to-verify tasks? I’m surprised I haven’t seen anyone challenging that. (FYI, I do not think they are good examples of hard-to-verify tasks, fight me. Edit: Also, inasmuch as proofs have hard-to-verify properties like “being well-written”, their model sure failed at that.) So pending some results in an actual hard-to-verify domain, or convincing technical arguments that this new approach would really generalize, I’m not buying that.
Overall… This is the performance I would have expected out of them just mindlessly RLing bigger models harder. They needed to have a whole new breakthrough to get here? And the models on the new approach take hours to produce their solutions? They’re spinning it as good news for AI, but pending more information, this sounds pretty bad to me actually.
When Noam Brown says “hard-to-verify”, I think he means that natural language IMO proofs are “substantially harder to verify”: he says “proofs are pages long and take experts hours to grade”.
(Yes, there are also things which are much harder to verify like things that experts strongly disagree about after years of discussion. Also, for IMO programs, “hours to grade” is probably overstated?)
Also, I interpreted this as mostly in contrast to cases where outputs are trivial to programmatically verify (or reliably verify with a dumb LLM) in the context of large scale RL. E.g., you can trivially verify the answers to purely numerical math problems (or competitive programming or other programming situations where you have test cases). Indeed, OpenAI LLMs have historically been much better at numerical math problems than proofs, though possibly this gap has now been closed (at least partially).
I think this is overall reasonable if you interpret “hard-to-verify” as “substantially harder to verify” and I think this probably how many people would read this by default. I don’t have a strong view about whether this method will actually generalize to other cases where experts can verify things with high agreement in a few hours.
(Noam Brown doesn’t say anything about competitive programming, so I’m not sure why you mentioned that. Competitive programming is trivial to verify.)
Not sure about this. The kind of “hard-to-verify” I care about is e. g. agenty behavior in real-world conditions. I assume many other people are also watching out for that specifically, and that capability researchers are deliberately aiming for it.
And I don’t think the proofs are any evidence for that. The issue is that there exists, in principle, a way to easily verify math proofs: by translating them into a formal language and running them through a theorem-verifier. So the “correct” way for gradient descent to solve this was to encode some sort of internal theorem-verifier into the LLM.
Even more broadly, we know that improved performance at IMO could be achieved by task-specific models (AlphaGeometry, AlphaProof), which means that much of the IMO benchmark is not a strong signal of general intelligence. A general intelligence can solve it, but one oughtn’t be a general intelligence for that, and since gradient descent prefers shortcuts and shallow solutions...
They say they’re not using task-specific methodology, but what does this actually mean? Does it mean they did not even RL the model on math-proof tasks, they RL’d it on something else and the gold-level IMO performance arose by transfer learning? Doubt it. Which means this observation doesn’t really distinguish between “the new technique is fully general and works in all domains” and “the new technique looks like it should generalize because of secret technical details, but it doesn’t actually, it only worked here because it exploited the underlying easy-to-verify properties of the task”.
Sure, math is not an example of a hard-to-verify task, but I think you’re getting unnecessarily hung up on these things. It does sound like it may be a new and in a narrow sense unexpected technical development, and it’s unclear how significant it is. I wouldn’t try to read into their communications much more.
I buy that, sure. I even buy that they’re as excited about it as they present, that they believe/hope it unlocks generalization to hard-ot-verify domains. And yes, they may or may not be right. But I’m skeptical on priors/based on my model of ML, and their excitement isn’t very credible evidence, so I’ve not moved far from said priors.
Got it! I’m more inclined to generally expect that various half-decent ideas may unlock surprising advances (for no good reason in particular), so I’m less skeptical that this may be true.
Also, while math is of course easy to verify, assuming they haven’t significantly used verification in the training process, it makes their claims more reasonable.