they put substantial probability on the trend being superexponential
I think that’s too speculative.
I also think that around 25-50% of the questions are impossible or mislabeled.
I wouldn’t be surprised if 3-5% of questions were mislabeled or impossible to answer, but 25-50%? You’re basically saying that HLE is worthless. I’m curious why. I mean, I don’t know much about the people who had to sift through all of the submissions, but I’d be surprised if they failed that badly. Plus, there was a “bug bounty” aimed at improving the quality of the dataset.
TBC, my median to superhuman coder is more like 2031.
Guess I’m a pessimist then, mine is more like 2034.
I wouldn’t be surprised if 3-5% of questions were mislabeled or impossible to answer, but 25-50%? You’re basically saying that HLE is worthless. I’m curious why.
Various people looked at randomly selected questions and found similar numbers.
(I don’t think the dataset is worthless, I think if you filtered down to the best 25-50% of questions it would be a reasonable dataset with acceptable error rate.)
I think that’s too speculative.
I wouldn’t be surprised if 3-5% of questions were mislabeled or impossible to answer, but 25-50%? You’re basically saying that HLE is worthless. I’m curious why. I mean, I don’t know much about the people who had to sift through all of the submissions, but I’d be surprised if they failed that badly. Plus, there was a “bug bounty” aimed at improving the quality of the dataset.
Guess I’m a pessimist then, mine is more like 2034.
Various people looked at randomly selected questions and found similar numbers.
(I don’t think the dataset is worthless, I think if you filtered down to the best 25-50% of questions it would be a reasonable dataset with acceptable error rate.)