OK. Next question: Suppose that next year we get a nice result showing that there is a model with serial inference-time scaling across e.g. MATH + FrontierMath + IMO problems. Recall that FrontierMath and IMO are subdivided into different difficulty levels; suppose that this model can be given e.g. 10 tokens of CoT, 100, 1000, 10,000, etc. and then somewhere around the billion-serial-token-level it starts solving a decent chunk of the “medium” FrontierMath problems (but not all) and at the million-serial-token level it was only solving MATH + some easy IMO problems.
Not for math benchmarks. Here’s one way it can “cheat” at them: suppose that the CoT would involve the model generating candidate proofs/derivations, then running an internal (learned, not hard-coded) proof verifier on them, and either rejecting the candidate proof and trying to generate a new one, or outputting it. We know that this is possible, since we know that proof verifiers can be compactly specified.
This wouldn’t actually show “agency” and strategic thinking of the kinds that might generalize to open-ended domains and “true” long-horizon tasks. In particular, this would mostly fail the condition (2) from my previous comment.
Something more open-ended and requiring “research taste” would be needed. Maybe a comparable performance on METR’s benchmark would work for this (i. e., the model can beat a significantly larger fraction of it at 1 billion tokens compared to 1 million)? Or some other benchmark that comes closer to evaluating real-world performance.
Edit: Oh, math-benchmark performance would convince me if we get access to a CoT sample and it shows that the model doesn’t follow the above “cheating” approach, but instead approaches the problem strategically (in some sense). (Which would also require this CoT not to be hopelessly steganographied, obviously.)
Have you looked at samples of CoT of o1, o3, deepseek, etc. solving hard math problems? I feel like a few examples have been shown & they seem to involve qualitative thinking, not just brute-force-proof-search (though of course they show lots of failed attempts and backtracking—just like a human thought-chain would).
Anyhow, this is nice, because I do expect that probably something like this milestone will be reached before AGI (though I’m not sure)
Have you looked at samples of CoT of o1, o3, deepseek, etc. solving hard math problems?
Certainly (experimenting with r1′s CoTs right now, in fact). I agree that they’re not doing the brute-force stuff I mentioned; that was just me outlining a scenario in which a system “technically” clears the bar you’d outlined, yet I end up unmoved (I don’t want to end up goalpost-moving).
Though neither are they being “strategic” in the way I expect they’d need to be in order to productively use a billion-token CoT.
Anyhow, this is nice, because I do expect that probably something like this milestone will be reached before AGI
Yeah, I’m also glad to finally have something concrete-ish to watch out for. Thanks for prompting me!
OK. Next question: Suppose that next year we get a nice result showing that there is a model with serial inference-time scaling across e.g. MATH + FrontierMath + IMO problems. Recall that FrontierMath and IMO are subdivided into different difficulty levels; suppose that this model can be given e.g. 10 tokens of CoT, 100, 1000, 10,000, etc. and then somewhere around the billion-serial-token-level it starts solving a decent chunk of the “medium” FrontierMath problems (but not all) and at the million-serial-token level it was only solving MATH + some easy IMO problems.
Would this count, for you?
Not for math benchmarks. Here’s one way it can “cheat” at them: suppose that the CoT would involve the model generating candidate proofs/derivations, then running an internal (learned, not hard-coded) proof verifier on them, and either rejecting the candidate proof and trying to generate a new one, or outputting it. We know that this is possible, since we know that proof verifiers can be compactly specified.
This wouldn’t actually show “agency” and strategic thinking of the kinds that might generalize to open-ended domains and “true” long-horizon tasks. In particular, this would mostly fail the condition (2) from my previous comment.
Something more open-ended and requiring “research taste” would be needed. Maybe a comparable performance on METR’s benchmark would work for this (i. e., the model can beat a significantly larger fraction of it at 1 billion tokens compared to 1 million)? Or some other benchmark that comes closer to evaluating real-world performance.
Edit: Oh, math-benchmark performance would convince me if we get access to a CoT sample and it shows that the model doesn’t follow the above “cheating” approach, but instead approaches the problem strategically (in some sense). (Which would also require this CoT not to be hopelessly steganographied, obviously.)
Have you looked at samples of CoT of o1, o3, deepseek, etc. solving hard math problems? I feel like a few examples have been shown & they seem to involve qualitative thinking, not just brute-force-proof-search (though of course they show lots of failed attempts and backtracking—just like a human thought-chain would).
Anyhow, this is nice, because I do expect that probably something like this milestone will be reached before AGI (though I’m not sure)
Certainly (experimenting with r1′s CoTs right now, in fact). I agree that they’re not doing the brute-force stuff I mentioned; that was just me outlining a scenario in which a system “technically” clears the bar you’d outlined, yet I end up unmoved (I don’t want to end up goalpost-moving).
Though neither are they being “strategic” in the way I expect they’d need to be in order to productively use a billion-token CoT.
Yeah, I’m also glad to finally have something concrete-ish to watch out for. Thanks for prompting me!