Note that they also claimed Opus 4 worked for over seven hours but only scored 1h20m on the METR task suite. I wouldn’t be surprised to see Sonnet 4.5 get a strong METR score but 30 hours definitely isn’t likely.
Well, dividing by <7 would give >4 hours which is on trend ;)
Actually that would be somewhat ahead still.
Note that they also claimed Opus 4 worked for over seven hours but only scored 1h20m on the METR task suite. I wouldn’t be surprised to see Sonnet 4.5 get a strong METR score but 30 hours definitely isn’t likely.
Well, dividing by <7 would give >4 hours which is on trend ;)
Actually that would be somewhat ahead still.