But since when are math proofs, or competitive programming, considered good examples of hard-to-verify tasks?
When Noam Brown says “hard-to-verify”, I think he means that natural language IMO proofs are “substantially harder to verify”: he says “proofs are pages long and take experts hours to grade”.
(Yes, there are also things which are much harder to verify like things that experts strongly disagree about after years of discussion. Also, for IMO programs, “hours to grade” is probably overstated?)
Also, I interpreted this as mostly in contrast to cases where outputs are trivial to programmatically verify (or reliably verify with a dumb LLM) in the context of large scale RL. E.g., you can trivially verify the answers to purely numerical math problems (or competitive programming or other programming situations where you have test cases). Indeed, OpenAI LLMs have historically been much better at numerical math problems than proofs, though possibly this gap has now been closed (at least partially).
I think this is overall reasonable if you interpret “hard-to-verify” as “substantially harder to verify” and I think this probably how many people would read this by default. I don’t have a strong view about whether this method will actually generalize to other cases where experts can verify things with high agreement in a few hours.
(Noam Brown doesn’t say anything about competitive programming, so I’m not sure why you mentioned that. Competitive programming is trivial to verify.)
I think this is overall reasonable if you interpret “hard-to-verify” as “substantially harder to verify” and I think this probably how many people would read this by default
Not sure about this. The kind of “hard-to-verify” I care about is e. g. agenty behavior in real-world conditions. I assume many other people are also watching out for that specifically, and that capability researchers are deliberately aiming for it.
And I don’t think the proofs are any evidence for that. The issue is that there exists, in principle, a way to easily verify math proofs: by translating them into a formal language and running them through a theorem-verifier. So the “correct” way for gradient descent to solve this was to encode some sort of internal theorem-verifier into the LLM.
Even more broadly, we know that improved performance at IMO could be achieved by task-specific models (AlphaGeometry, AlphaProof), which means that much of the IMO benchmark is not a strong signal of general intelligence. A general intelligence can solve it, but one oughtn’t be a general intelligence for that, and since gradient descent prefers shortcuts and shallow solutions...
They say they’re not using task-specific methodology, but what does this actually mean? Does it mean they did not even RL the model on math-proof tasks, they RL’d it on something else and the gold-level IMO performance arose by transfer learning? Doubt it. Which means this observation doesn’t really distinguish between “the new technique is fully general and works in all domains” and “the new technique looks like it should generalize because of secret technical details, but it doesn’t actually, it only worked here because it exploited the underlying easy-to-verify properties of the task”.
When Noam Brown says “hard-to-verify”, I think he means that natural language IMO proofs are “substantially harder to verify”: he says “proofs are pages long and take experts hours to grade”.
(Yes, there are also things which are much harder to verify like things that experts strongly disagree about after years of discussion. Also, for IMO programs, “hours to grade” is probably overstated?)
Also, I interpreted this as mostly in contrast to cases where outputs are trivial to programmatically verify (or reliably verify with a dumb LLM) in the context of large scale RL. E.g., you can trivially verify the answers to purely numerical math problems (or competitive programming or other programming situations where you have test cases). Indeed, OpenAI LLMs have historically been much better at numerical math problems than proofs, though possibly this gap has now been closed (at least partially).
I think this is overall reasonable if you interpret “hard-to-verify” as “substantially harder to verify” and I think this probably how many people would read this by default. I don’t have a strong view about whether this method will actually generalize to other cases where experts can verify things with high agreement in a few hours.
(Noam Brown doesn’t say anything about competitive programming, so I’m not sure why you mentioned that. Competitive programming is trivial to verify.)
Not sure about this. The kind of “hard-to-verify” I care about is e. g. agenty behavior in real-world conditions. I assume many other people are also watching out for that specifically, and that capability researchers are deliberately aiming for it.
And I don’t think the proofs are any evidence for that. The issue is that there exists, in principle, a way to easily verify math proofs: by translating them into a formal language and running them through a theorem-verifier. So the “correct” way for gradient descent to solve this was to encode some sort of internal theorem-verifier into the LLM.
Even more broadly, we know that improved performance at IMO could be achieved by task-specific models (AlphaGeometry, AlphaProof), which means that much of the IMO benchmark is not a strong signal of general intelligence. A general intelligence can solve it, but one oughtn’t be a general intelligence for that, and since gradient descent prefers shortcuts and shallow solutions...
They say they’re not using task-specific methodology, but what does this actually mean? Does it mean they did not even RL the model on math-proof tasks, they RL’d it on something else and the gold-level IMO performance arose by transfer learning? Doubt it. Which means this observation doesn’t really distinguish between “the new technique is fully general and works in all domains” and “the new technique looks like it should generalize because of secret technical details, but it doesn’t actually, it only worked here because it exploited the underlying easy-to-verify properties of the task”.