ryan_greenblatt comments on METR: Measuring AI Ability to Complete Long Tasks

ryan_greenblatt 24 Mar 2025 0:21 UTC
12 points
5
My sense is that the GPT-2 and GPT-3 results are somewhat dubious, especially the GPT-2 result. It really depends on how you relate SWAA (small software engineering subtasks) to the rest of the tasks. My understanding is that no iteration was done though.

However, note that it wouldn’t be wildly more off trend if GPT-3 was anywhere from 4-30 seconds while it is instead at ~8 seconds. And, the GPT-2 results are very consistent with “almost too low to measure”.

Overall, I don’t think its incredibly weird (given that the rate of increase of compute and people in 2019-2023 isn’t that different from the rate in 2024), but many results would have been roughly on trend.