I’m glad METR did this work, and I think their approach is sane and we should keep adding data points to this plot.
It sounds like you also think the current points on the plot are accurate? I would strongly dispute this, for all the reasons discussed here and here. I think you can find sets of tasks where the points fit on an exponential curve, but I don’t think AI can do 1 hour worth of thinking on all, or even most, practically relevant questions.
I remember enjoying that post (perhaps I even linked it somewhere?) and I think it’s probably the case that the inefficiency in task length scaling has to do with LLMs having only a subset of cognitive abilities available. I’m not really committed to a view on that here though.
The links don’t seem to prove that the points are “inaccurate.”
It sounds like you also think the current points on the plot are accurate? I would strongly dispute this, for all the reasons discussed here and here. I think you can find sets of tasks where the points fit on an exponential curve, but I don’t think AI can do 1 hour worth of thinking on all, or even most, practically relevant questions.
I remember enjoying that post (perhaps I even linked it somewhere?) and I think it’s probably the case that the inefficiency in task length scaling has to do with LLMs having only a subset of cognitive abilities available. I’m not really committed to a view on that here though.
The links don’t seem to prove that the points are “inaccurate.”