Yes, in particular the concern about benchmark tasks being well-specified remains. We’ll need both more data (probably collected from AI R&D tasks in the wild) and more modeling to get a forecast for overall speedup.
However, I do think if we have a wide enough distribution of tasks, AIs outperform humans on all of them at task lengths that should imply humans spend 1/10th the labor, but AI R&D has not been automated yet, something strange needs to be happening. So looking at different benchmarks is partial progress towards understanding the gap between long time horizons on METR’s task set and actual AI R&D uplift.
Yes, in particular the concern about benchmark tasks being well-specified remains. We’ll need both more data (probably collected from AI R&D tasks in the wild) and more modeling to get a forecast for overall speedup.
However, I do think if we have a wide enough distribution of tasks, AIs outperform humans on all of them at task lengths that should imply humans spend 1/10th the labor, but AI R&D has not been automated yet, something strange needs to be happening. So looking at different benchmarks is partial progress towards understanding the gap between long time horizons on METR’s task set and actual AI R&D uplift.