I think the important thing to realise is that while one needs to take additional steps for the ‘marginal’ approach when fitting a model that explicitly accounts for the deviation in task-length-for-humans vs task-difficulty-for-llms, models that don’t explicitly account for this (such as the original METR model) should have it naturally learned into the shape of their logistic curve.
I don’t immediately see this. The marginal idea is roughly about integrating over random effects, and that’s hard to capture without actually doing it. My statement that METR’s original approach is about the typical effect is wrong though.
Very nice! I’m not able to comment very much since I don’t know the specifics of your model, but can you clarify what you mean by
I have to admit I have worked with the METR data mostly as-is, and not gone into detail about how the times have been estimated. I suppose the problem is that only a subset of the tasks have grounded estimates of human times (as I interpreted HCAST?) and the rest are inferred in a more or less ad-hoc way? If so, then that would explain 80% marginal times being shorter because the residuals would (plausibly) be smaller.