This follows up important work by METR measuring the maximum human-equivalent lengths of tasks that frontier models can perform successfully, which I predicted would not hold up (perhaps a little too stridently).
I’m also betting on that prediction, please provide me with some liquidity:
Question
o3 doesn’t seem to perform too well according to this chart:
But it gets the best score on this chart:
I understand that these are measuring two different things, so there is no logical inconsistency between these two facts, but the disparity does seem striking. Would someone be willing to provide a more detailed explanation of what is going on here? I am not sure whether to update that the task length trend is in fact continuing (or accelerating) or interpret the overall poor performance of o3 as a sign that the trend is about to break down.
Why does METR score o3 as effective for such a long time duration despite overall poor scores?
Epistemic status: Question, probably missing something.
Context
See the preliminary evaluation of o3 and o4-mini here: https://metr.github.io/autonomy-evals-guide/openai-o3-report/#methodology-overview
This follows up important work by METR measuring the maximum human-equivalent lengths of tasks that frontier models can perform successfully, which I predicted would not hold up (perhaps a little too stridently).
I’m also betting on that prediction, please provide me with some liquidity:
Question
o3 doesn’t seem to perform too well according to this chart:
But it gets the best score on this chart:
I understand that these are measuring two different things, so there is no logical inconsistency between these two facts, but the disparity does seem striking. Would someone be willing to provide a more detailed explanation of what is going on here? I am not sure whether to update that the task length trend is in fact continuing (or accelerating) or interpret the overall poor performance of o3 as a sign that the trend is about to break down.