top-tier/superhuman benchmark performance vs. frequent falling-flat-on-its-face real-world performance
Models are just recently getting to the point where they can complete 2 hour tasks 50% of the time in METR’s tasks (at least without scaffolding that uses much more inference compute).
This isn’t yet top tier performance, so I don’t see the implication. The key claim is that progress here is very fast.
So, I don’t currently feel that strongly that there is a huge benchmark vs real performance gap in at least autonomous SWE-ish tasks? (There might be in math and I agree that if you just looked at math and exam question benchmarks and compared to humans, the models seem much smarter than they are.)
Models are just recently getting to the point where they can complete 2 hour tasks 50% of the time in METR’s tasks (at least without scaffolding that uses much more inference compute).
This isn’t yet top tier performance, so I don’t see the implication. The key claim is that progress here is very fast.
So, I don’t currently feel that strongly that there is a huge benchmark vs real performance gap in at least autonomous SWE-ish tasks? (There might be in math and I agree that if you just looked at math and exam question benchmarks and compared to humans, the models seem much smarter than they are.)