I think the METR horizon doubling trend stuff doesn’t stand on its own, and it’s really not many datapoints.
It’s less about the datapoints and more about the methodology.
I also really don’t think, without a huge number of assumptions, that “the point at which an AI could complete a year-long software-engineering/DL research project” is a good proxy for “AI R&D automation”
Fair, I very much agree. But my point here is that the METR benchmark works as some additional technical/empirical evidence towards some hypotheses over others, evidence that’s derived independently from one’s intuitions, in a way that more fine-tuned graphs don’t work.
It’s less about the datapoints and more about the methodology.
Fair, I very much agree. But my point here is that the METR benchmark works as some additional technical/empirical evidence towards some hypotheses over others, evidence that’s derived independently from one’s intuitions, in a way that more fine-tuned graphs don’t work.