Claude Opus 4.5 felt like a large practical leap, some like Dean Ball going so far as to call it AGI. I don’t agree but I understand where they are coming from.
Alas, Claude Opus 4.5 is likely on track to force METR to add new tasks to the benchmark because comments like this and this indicate that the benchmark itself is no longer as reliable as it once was…
P.S. If you inserted a video in the original post, then why did it become lost in cross-posting?
Alas, Claude Opus 4.5 is likely on track to force METR to add new tasks to the benchmark because comments like this and this indicate that the benchmark itself is no longer as reliable as it once was…
P.S. If you inserted a video in the original post, then why did it become lost in cross-posting?