Baselined tasks tend to be easier and Baselining seems (definitely slightly) biased towards making them look easier still, while Estimated tasks tend to be harder and Estimation seems (potentially greatly) biased towards making them look harder still: the combined effect would be to make progress gradients look artificially steep in analyses where Baselined and Estimated tasks both matter.
but found (to my surprise!) that removing all Estimated tasks didn’t affect headline results, presumably/partly because
most of the Estimated tasks were really difficult ones where AIs never won, so errors here had negligible effect on the shapes of logistic regression curves.
and footnoted that with
Note that this does not mean they will continue to have negligible effects on next year’s agents.
Well, it’s now next year: one more thing to keep in mind when deciding how much salt to take the Scary Graph with.
I once pointed out that METR’s
but found (to my surprise!) that removing all Estimated tasks didn’t affect headline results, presumably/partly because
and footnoted that with
Well, it’s now next year: one more thing to keep in mind when deciding how much salt to take the Scary Graph with.