ryan_greenblatt comments on AI 2027: What Superintelligence Looks Like

ryan_greenblatt 19 Apr 2025 22:15 UTC
9 points
4

I wouldn’t be surprised if 3-5% of questions were mislabeled or impossible to answer, but 25-50%? You’re basically saying that HLE is worthless. I’m curious why.

Various people looked at randomly selected questions and found similar numbers.

(I don’t think the dataset is worthless, I think if you filtered down to the best 25-50% of questions it would be a reasonable dataset with acceptable error rate.)