Josh You comments on METR: Measuring AI Ability to Complete Long Tasks

Josh You 27 Mar 2025 15:39 UTC
1 point
0
I think there are two models that you measured time horizon for, Claude 3 Opus, and GPT-4 Turbo, that didn’t make it onto the main figure. Is that right? There are 13 models in Figure 5, which shows the time horizon curves for a bunch of models across the full test suite, and only 11 dots on Figure 1.