ryan_greenblatt comments on What Indicators Should We Watch to Disambiguate AGI Timelines?

ryan_greenblatt 9 Jan 2025 16:56 UTC
9 points
4

top-tier/superhuman benchmark performance vs. frequent falling-flat-on-its-face real-world performance

Models are just recently getting to the point where they can complete 2 hour tasks 50% of the time in METR’s tasks (at least without scaffolding that uses much more inference compute).

This isn’t yet top tier performance, so I don’t see the implication. The key claim is that progress here is very fast.

So, I don’t currently feel that strongly that there is a huge benchmark vs real performance gap in at least autonomous SWE-ish tasks? (There might be in math and I agree that if you just looked at math and exam question benchmarks and compared to humans, the models seem much smarter than they are.)