One thing she is discussing at the 03:36:51 mark is “What METR isn’t doing that other people have to step up and do”.
In this sense, METR’s “Measuring AI Ability to Complete Long Tasks” is not done often enough.
They have done that evaluation within “Details about METR’s preliminary evaluation of o3 and o4-mini”, https://metr.github.io/autonomy-evals-guide/openai-o3-report/on April 16 (an eternity these days) and the results are consistent with the beginning of a super-exponential trend. But there has been no evaluations like that since, and we have had Claude 4 and remarkable Gemini-2.5-Pro updates, and it seems crucial to understand what is going on with that potentially super-exponential trend and how rapidly the doubling period seems to shrink (if it actually shrinks).
So, perhaps, we should start asking who could help in this sense...
One thing she is discussing at the 03:36:51 mark is “What METR isn’t doing that other people have to step up and do”.
In this sense, METR’s “Measuring AI Ability to Complete Long Tasks” is not done often enough.
They have done that evaluation within “Details about METR’s preliminary evaluation of o3 and o4-mini”, https://metr.github.io/autonomy-evals-guide/openai-o3-report/ on April 16 (an eternity these days) and the results are consistent with the beginning of a super-exponential trend. But there has been no evaluations like that since, and we have had Claude 4 and remarkable Gemini-2.5-Pro updates, and it seems crucial to understand what is going on with that potentially super-exponential trend and how rapidly the doubling period seems to shrink (if it actually shrinks).
So, perhaps, we should start asking who could help in this sense...