This contradicts METR timelines, which, IMO, is the best piece of info we currently have for predicting when AGI will arrive.
Have you read the timelines supplement? One of their main methodologies involves using this exact data from METR (yielding 2027 medians). The key differences they have from the extrapolation methodology used by METR are: they use a somewhat shorter doubling time which seems closer to what we see in 2024 (4.5 months median rather than 7 months) and they put substantial probability on the trend being superexponential.
why the timelines will be much longer
I think the timelines to superhuman coder implied by METR’s work are closer to 2029 than 2027, so 2 more years or 2x longer. I don’t think most people will think of this as much longer, though I guess 2x longer could qualify as much longer.
Considering that frontier LLMs of today can solve at most 20% of problems on Humanity’s Last Exam, both of these predictions appear overly optimistic to me. And HLE isn’t even about autonomous research, it’s about “closed-ended, verifiable questions”. Even if some LLM scored >90% on HLE by late 2025 (I bet this won’t happen), that wouldn’t automatically imply that it’s good at open-ended problems with no known answer. Present-day LLMs have so little agency that it’s not even worth talking about.
I’m not sure that smart humans can solve 20% on Humanity’s Last Exam (HLE). I also think that around 25-50% of the questions are impossible or mislabeled. So, this doesn’t seem like a very effective way to rule out capabilities.
I think scores on HLE are mostly just not a good indicator of the relevant capabilities. (Given our current understanding.)
TBC, my median to superhuman coder is more like 2031.
they put substantial probability on the trend being superexponential
I think that’s too speculative.
I also think that around 25-50% of the questions are impossible or mislabeled.
I wouldn’t be surprised if 3-5% of questions were mislabeled or impossible to answer, but 25-50%? You’re basically saying that HLE is worthless. I’m curious why. I mean, I don’t know much about the people who had to sift through all of the submissions, but I’d be surprised if they failed that badly. Plus, there was a “bug bounty” aimed at improving the quality of the dataset.
TBC, my median to superhuman coder is more like 2031.
Guess I’m a pessimist then, mine is more like 2034.
I wouldn’t be surprised if 3-5% of questions were mislabeled or impossible to answer, but 25-50%? You’re basically saying that HLE is worthless. I’m curious why.
Various people looked at randomly selected questions and found similar numbers.
(I don’t think the dataset is worthless, I think if you filtered down to the best 25-50% of questions it would be a reasonable dataset with acceptable error rate.)
Have you read the timelines supplement? One of their main methodologies involves using this exact data from METR (yielding 2027 medians). The key differences they have from the extrapolation methodology used by METR are: they use a somewhat shorter doubling time which seems closer to what we see in 2024 (4.5 months median rather than 7 months) and they put substantial probability on the trend being superexponential.
I think the timelines to superhuman coder implied by METR’s work are closer to 2029 than 2027, so 2 more years or 2x longer. I don’t think most people will think of this as much longer, though I guess 2x longer could qualify as much longer.
I’m not sure that smart humans can solve 20% on Humanity’s Last Exam (HLE). I also think that around 25-50% of the questions are impossible or mislabeled. So, this doesn’t seem like a very effective way to rule out capabilities.
I think scores on HLE are mostly just not a good indicator of the relevant capabilities. (Given our current understanding.)
TBC, my median to superhuman coder is more like 2031.
Looks like Eli beat me to the punch!
I think that’s too speculative.
I wouldn’t be surprised if 3-5% of questions were mislabeled or impossible to answer, but 25-50%? You’re basically saying that HLE is worthless. I’m curious why. I mean, I don’t know much about the people who had to sift through all of the submissions, but I’d be surprised if they failed that badly. Plus, there was a “bug bounty” aimed at improving the quality of the dataset.
Guess I’m a pessimist then, mine is more like 2034.
Various people looked at randomly selected questions and found similar numbers.
(I don’t think the dataset is worthless, I think if you filtered down to the best 25-50% of questions it would be a reasonable dataset with acceptable error rate.)