I don’t have one knock-down counterargument why the timelines will be much longer, so here’s a whole lot of convincing-but-not-super-convincing counterarguments:
This contradicts METR timelines, which, IMO, is the best piece of info we currently have for predicting when AGI will arrive.
Microsoft is not going to fund Stargate. At best, it means Stargate will be delayed. At worst, it means it will be axed. For the timelines in this post to be accurate, Stargate would have to be mostly finished right now, today. Even if OpenAI could literally print money, large data centers take 2-6 years to build.
“Moreover, it [Agent-1] could offer substantial help to terrorists designing bioweapons, thanks to its PhD-level knowledge of every field and ability to browse the web.” ″Agent-1, a new model under internal development, it’s good at many things but great at helping with AI research.” Considering that frontier LLMs of today can solve at most 20% of problems on Humanity’s Last Exam, both of these predictions appear overly optimistic to me. And HLE isn’t even about autonomous research, it’s about “closed-ended, verifiable questions”. Even if some LLM scored >90% on HLE by late 2025 (I bet this won’t happen), that wouldn’t automatically imply that it’s good at open-ended problems with no known answer. Present-day LLMs have so little agency that it’s not even worth talking about. EDIT: for the record, I expect 45-50% accuracy on HLE by the end of 2025, 70-75% by the end of 2026, and 85-90% accuracy by the end of 2027.
A memory module that can be stored externally (on a hard drive) is handwaved as something that Just Works™, I don’t expect it to be so easy.
As of today, there is no robust anti-hallucination/error correction mechanism for LLMs. It seems like another thing that is handwaved as something that Just Works™: just beat the neural net with the RLHF stick until the outputs look about right.
Imagine writing a similar piece for videogames in 2016, after AlphaGo became news. If someone wrote that by 2018-2019 all mainstream videogames would use neural nets trained using RL, they would be laughably wrong. Hell, if someone wrote that but replaced 2018-2019 with 2024-2025, they would still be wrong. This is the least convincing argument of all of these, it’s just my way of saying, “I don’t feel like reality actually, really works this way”. On that nore, I also expect that recursive self-improvement requires a completely new architecture that not only doesn’t look like Transformers but doesn’t even look like a neural network.
Not a criticism, but I think you overlooked a very interesting possibility: developing a near-perfect speech-to-text transcription AI and transcribing the entire YouTube. The biggest issue with training multi-modal models is acquiring the right (“paired”) training data. If YouTube had 99.9% accurate subtitles for every video, this would no longer be a problem.
This contradicts METR timelines, which, IMO, is the best piece of info we currently have for predicting when AGI will arrive.
I disagree, METR’s extrapolations don’t attempt to account for important factors such as intermediate speedups from pre-superhuman systems and possible superexponentiality. METR folks agree with me that their paper is not trying to do what we are trying to do, i.e. predict timelines via a model that takes into account all relevant factors (they don’t necessarily endorse our model’s outputs!). See https://blog.ai-futures.org/p/beyond-the-last-horizon and our full timelines supplement for more.
A memory module that can be stored externally (on a hard drive) is handwaved as something that Just Works™, I don’t expect it to be so easy.
Yeah I mean it’s simplified a little in our story for brevity, of course in practice this advance will likely come after substantial experimentation, rather than on the first try, and there will be iterative improvements on top of early prototypes.
As of today, there is no robust anti-hallucination/error correction mechanism for LLMs. It seems like another thing that is handwaved as something that Just Works™: just beat the neural net with the RLHF stick until the outputs look about right.
I view this is as a continuous problem that is already getting better over time, and it’s valid to project this trend to continue. To be clear, we are not saying anywhere that techniques won’t change, certainly the training won’t look like vanilla RLHF (it already is moving away from this), we are expecting training techniques to iteratively improve over time.
This contradicts METR timelines, which, IMO, is the best piece of info we currently have for predicting when AGI will arrive.
Have you read the timelines supplement? One of their main methodologies involves using this exact data from METR (yielding 2027 medians). The key differences they have from the extrapolation methodology used by METR are: they use a somewhat shorter doubling time which seems closer to what we see in 2024 (4.5 months median rather than 7 months) and they put substantial probability on the trend being superexponential.
why the timelines will be much longer
I think the timelines to superhuman coder implied by METR’s work are closer to 2029 than 2027, so 2 more years or 2x longer. I don’t think most people will think of this as much longer, though I guess 2x longer could qualify as much longer.
Considering that frontier LLMs of today can solve at most 20% of problems on Humanity’s Last Exam, both of these predictions appear overly optimistic to me. And HLE isn’t even about autonomous research, it’s about “closed-ended, verifiable questions”. Even if some LLM scored >90% on HLE by late 2025 (I bet this won’t happen), that wouldn’t automatically imply that it’s good at open-ended problems with no known answer. Present-day LLMs have so little agency that it’s not even worth talking about.
I’m not sure that smart humans can solve 20% on Humanity’s Last Exam (HLE). I also think that around 25-50% of the questions are impossible or mislabeled. So, this doesn’t seem like a very effective way to rule out capabilities.
I think scores on HLE are mostly just not a good indicator of the relevant capabilities. (Given our current understanding.)
TBC, my median to superhuman coder is more like 2031.
they put substantial probability on the trend being superexponential
I think that’s too speculative.
I also think that around 25-50% of the questions are impossible or mislabeled.
I wouldn’t be surprised if 3-5% of questions were mislabeled or impossible to answer, but 25-50%? You’re basically saying that HLE is worthless. I’m curious why. I mean, I don’t know much about the people who had to sift through all of the submissions, but I’d be surprised if they failed that badly. Plus, there was a “bug bounty” aimed at improving the quality of the dataset.
TBC, my median to superhuman coder is more like 2031.
Guess I’m a pessimist then, mine is more like 2034.
I wouldn’t be surprised if 3-5% of questions were mislabeled or impossible to answer, but 25-50%? You’re basically saying that HLE is worthless. I’m curious why.
Various people looked at randomly selected questions and found similar numbers.
(I don’t think the dataset is worthless, I think if you filtered down to the best 25-50% of questions it would be a reasonable dataset with acceptable error rate.)
I don’t have one knock-down counterargument why the timelines will be much longer, so here’s a whole lot of convincing-but-not-super-convincing counterarguments:
This contradicts METR timelines, which, IMO, is the best piece of info we currently have for predicting when AGI will arrive.
Microsoft is not going to fund Stargate. At best, it means Stargate will be delayed. At worst, it means it will be axed. For the timelines in this post to be accurate, Stargate would have to be mostly finished right now, today. Even if OpenAI could literally print money, large data centers take 2-6 years to build.
“Moreover, it [Agent-1] could offer substantial help to terrorists designing bioweapons, thanks to its PhD-level knowledge of every field and ability to browse the web.”
″Agent-1, a new model under internal development, it’s good at many things but great at helping with AI research.”
Considering that frontier LLMs of today can solve at most 20% of problems on Humanity’s Last Exam, both of these predictions appear overly optimistic to me. And HLE isn’t even about autonomous research, it’s about “closed-ended, verifiable questions”. Even if some LLM scored >90% on HLE by late 2025 (I bet this won’t happen), that wouldn’t automatically imply that it’s good at open-ended problems with no known answer. Present-day LLMs have so little agency that it’s not even worth talking about.
EDIT: for the record, I expect 45-50% accuracy on HLE by the end of 2025, 70-75% by the end of 2026, and 85-90% accuracy by the end of 2027.
A memory module that can be stored externally (on a hard drive) is handwaved as something that Just Works™, I don’t expect it to be so easy.
As of today, there is no robust anti-hallucination/error correction mechanism for LLMs. It seems like another thing that is handwaved as something that Just Works™: just beat the neural net with the RLHF stick until the outputs look about right.
Imagine writing a similar piece for videogames in 2016, after AlphaGo became news. If someone wrote that by 2018-2019 all mainstream videogames would use neural nets trained using RL, they would be laughably wrong. Hell, if someone wrote that but replaced 2018-2019 with 2024-2025, they would still be wrong.
This is the least convincing argument of all of these, it’s just my way of saying, “I don’t feel like reality actually, really works this way”. On that nore, I also expect that recursive self-improvement requires a completely new architecture that not only doesn’t look like Transformers but doesn’t even look like a neural network.
Not a criticism, but I think you overlooked a very interesting possibility: developing a near-perfect speech-to-text transcription AI and transcribing the entire YouTube. The biggest issue with training multi-modal models is acquiring the right (“paired”) training data. If YouTube had 99.9% accurate subtitles for every video, this would no longer be a problem.
I disagree, METR’s extrapolations don’t attempt to account for important factors such as intermediate speedups from pre-superhuman systems and possible superexponentiality. METR folks agree with me that their paper is not trying to do what we are trying to do, i.e. predict timelines via a model that takes into account all relevant factors (they don’t necessarily endorse our model’s outputs!). See https://blog.ai-futures.org/p/beyond-the-last-horizon and our full timelines supplement for more.
Yeah I mean it’s simplified a little in our story for brevity, of course in practice this advance will likely come after substantial experimentation, rather than on the first try, and there will be iterative improvements on top of early prototypes.
I view this is as a continuous problem that is already getting better over time, and it’s valid to project this trend to continue. To be clear, we are not saying anywhere that techniques won’t change, certainly the training won’t look like vanilla RLHF (it already is moving away from this), we are expecting training techniques to iteratively improve over time.
Have you read the timelines supplement? One of their main methodologies involves using this exact data from METR (yielding 2027 medians). The key differences they have from the extrapolation methodology used by METR are: they use a somewhat shorter doubling time which seems closer to what we see in 2024 (4.5 months median rather than 7 months) and they put substantial probability on the trend being superexponential.
I think the timelines to superhuman coder implied by METR’s work are closer to 2029 than 2027, so 2 more years or 2x longer. I don’t think most people will think of this as much longer, though I guess 2x longer could qualify as much longer.
I’m not sure that smart humans can solve 20% on Humanity’s Last Exam (HLE). I also think that around 25-50% of the questions are impossible or mislabeled. So, this doesn’t seem like a very effective way to rule out capabilities.
I think scores on HLE are mostly just not a good indicator of the relevant capabilities. (Given our current understanding.)
TBC, my median to superhuman coder is more like 2031.
Looks like Eli beat me to the punch!
I think that’s too speculative.
I wouldn’t be surprised if 3-5% of questions were mislabeled or impossible to answer, but 25-50%? You’re basically saying that HLE is worthless. I’m curious why. I mean, I don’t know much about the people who had to sift through all of the submissions, but I’d be surprised if they failed that badly. Plus, there was a “bug bounty” aimed at improving the quality of the dataset.
Guess I’m a pessimist then, mine is more like 2034.
Various people looked at randomly selected questions and found similar numbers.
(I don’t think the dataset is worthless, I think if you filtered down to the best 25-50% of questions it would be a reasonable dataset with acceptable error rate.)