I think the actual epistemic process that happened here is something like:
The AI 2027 authors had some high-level arguments that AI might be a very big deal soon
They wrote down a bunch of concrete scenarios that seemed like they would follow from those arguments and checked if they sounded coherent and plausible and consistent with lots of other things they thought about the world
As part of that checking, one thing they checked was whether these scenarios would be some kind of huge break from existing trends, which I do think is a hard thing to do, but is an important thing to pay attention to
The right way to interpret the “timeline forecast” sections is not as “here is a simple extrapolation methodology that generated our whole worldview” but instead as a “here is some methodology that sanity-checked that our worldview is not in obvious contradiction to reasonable assumptions about economic growth”
But like, at least for me, it’s clear to me that the beliefs about takeoff and the exact timelines, could not be, and obviously should not be, considered the result of a straightforward and simple extrapolation exercise. I think such an exercise would be pretty doomed, and a claim to objectivity in that space seems misguided. I think it’s plausible that some parts of the Timelines Forecast supplement ended up communicating too much objectivity here, but IDK, I think AI 2027 as a whole communicated this process pretty well, I think.
But like, at least for me, it’s clear to me that the beliefs about takeoff and the exact timelines, could not be, and obviously should not be, considered the result of a straightforward and simple extrapolation exercise
Counterpoint: the METR agency-horizon doubling trend. It has its issues, but I think “the point at which an AI could complete a year-long software-engineering/DL research project” is a reasonable cutoff point for “AI R&D is automated”, and it seems to be the kind of non-overly-fine-tuned model with non-epsilon empirical backing that I’m talking about, in a way AI 2027 graphs are not.
Or maybe the distinction isn’t as stark in others’ minds as in mine, I dunno.
As part of that checking, one thing they checked was whether these scenarios would be some kind of huge break from existing trends, which I do think is a hard thing to do
Is it? See titotal’s six-stories section. If you’re choosing which function to fit, with a bunch of free parameters you set manually, it seems pretty trivial to come up with a “trend” that would fit any model you have.
Counterpoint: the METR agency-horizon doubling trend. It has its issues, but I think “the point at which an AI could complete a year-long software-engineering/DL research project” is a reasonable cutoff point for “AI R&D is automated”, and it seems to be the kind of non-overly-fine-tuned model with non-epsilon empirical backing that I’m talking about, in a way AI 2027 graphs are not.
I think the METR horizon doubling trend stuff doesn’t stand on its own, and it’s really not many datapoints.
I also really don’t think, without a huge number of assumptions, that “the point at which an AI could complete a year-long software-engineering/DL research project” is a good proxy for “AI R&D automation”, and indeed I want to avoid exactly that kind of sleight of hand. It only makes sense to someone who has a much more complicated worldview about how general AI is likely to be, how much the tasks METR measured are likely to generalize, and many other components. What it does make sense for is as a sanity-check on that broader worldview.
I think the METR horizon doubling trend stuff doesn’t stand on its own, and it’s really not many datapoints.
It’s less about the datapoints and more about the methodology.
I also really don’t think, without a huge number of assumptions, that “the point at which an AI could complete a year-long software-engineering/DL research project” is a good proxy for “AI R&D automation”
Fair, I very much agree. But my point here is that the METR benchmark works as some additional technical/empirical evidence towards some hypotheses over others, evidence that’s derived independently from one’s intuitions, in a way that more fine-tuned graphs don’t work.
Those two things sound extremely similar to me, I would appreciate some explanation/pointer to why they seem quite different.
Current guess: Is the idea that automation includes also a lot of (a) management, and (b) research taste in choosing projects, such that being able to complete a year-long project is only a lower-bound, not a central target?
Yeah, I mean, the task distribution is just hugely different. When METR measures software-developing tasks, they mean things in the reference class of well-specified tasks with tests basically already written.
As a concrete example, if you just use a random other distribution of tasks for horizon length as your base, like forecasting performance for unit of time, or writing per unit of time, or graphic design per unit of time, you get extremely drastically different time horizon curves.
This doesn’t make METR’s curves unreasonable as a basis, but you really need a lot of assumptions to get you from “these curves intersect one year here” to “the same year we will get ~fully automated AI R&D” (and indeed I would not currently believe the latter).
I don’t know the details of all of these task distributions, but clearly these are not remotely sampled uniformly from the set of all tasks necessary to automate AI R&D?
Yes, in particular the concern about benchmark tasks being well-specified remains. We’ll need both more data (probably collected from AI R&D tasks in the wild) and more modeling to get a forecast for overall speedup.
However, I do think if we have a wide enough distribution of tasks, AIs outperform humans on all of them at task lengths that should imply humans spend 1/10th the labor, but AI R&D has not been automated yet, something strange needs to be happening. So looking at different benchmarks is partial progress towards understanding the gap between long time horizons on METR’s task set and actual AI R&D uplift.
I think the actual epistemic process that happened here is something like:
The AI 2027 authors had some high-level arguments that AI might be a very big deal soon
They wrote down a bunch of concrete scenarios that seemed like they would follow from those arguments and checked if they sounded coherent and plausible and consistent with lots of other things they thought about the world
As part of that checking, one thing they checked was whether these scenarios would be some kind of huge break from existing trends, which I do think is a hard thing to do, but is an important thing to pay attention to
The right way to interpret the “timeline forecast” sections is not as “here is a simple extrapolation methodology that generated our whole worldview” but instead as a “here is some methodology that sanity-checked that our worldview is not in obvious contradiction to reasonable assumptions about economic growth”
But like, at least for me, it’s clear to me that the beliefs about takeoff and the exact timelines, could not be, and obviously should not be, considered the result of a straightforward and simple extrapolation exercise. I think such an exercise would be pretty doomed, and a claim to objectivity in that space seems misguided. I think it’s plausible that some parts of the Timelines Forecast supplement ended up communicating too much objectivity here, but IDK, I think AI 2027 as a whole communicated this process pretty well, I think.
Counterpoint: the METR agency-horizon doubling trend. It has its issues, but I think “the point at which an AI could complete a year-long software-engineering/DL research project” is a reasonable cutoff point for “AI R&D is automated”, and it seems to be the kind of non-overly-fine-tuned model with non-epsilon empirical backing that I’m talking about, in a way AI 2027 graphs are not.
Or maybe the distinction isn’t as stark in others’ minds as in mine, I dunno.
Is it? See titotal’s six-stories section. If you’re choosing which function to fit, with a bunch of free parameters you set manually, it seems pretty trivial to come up with a “trend” that would fit any model you have.
I think the METR horizon doubling trend stuff doesn’t stand on its own, and it’s really not many datapoints.
I also really don’t think, without a huge number of assumptions, that “the point at which an AI could complete a year-long software-engineering/DL research project” is a good proxy for “AI R&D automation”, and indeed I want to avoid exactly that kind of sleight of hand. It only makes sense to someone who has a much more complicated worldview about how general AI is likely to be, how much the tasks METR measured are likely to generalize, and many other components. What it does make sense for is as a sanity-check on that broader worldview.
It’s less about the datapoints and more about the methodology.
Fair, I very much agree. But my point here is that the METR benchmark works as some additional technical/empirical evidence towards some hypotheses over others, evidence that’s derived independently from one’s intuitions, in a way that more fine-tuned graphs don’t work.
Those two things sound extremely similar to me, I would appreciate some explanation/pointer to why they seem quite different.
Current guess: Is the idea that automation includes also a lot of (a) management, and (b) research taste in choosing projects, such that being able to complete a year-long project is only a lower-bound, not a central target?
Yeah, I mean, the task distribution is just hugely different. When METR measures software-developing tasks, they mean things in the reference class of well-specified tasks with tests basically already written.
As a concrete example, if you just use a random other distribution of tasks for horizon length as your base, like forecasting performance for unit of time, or writing per unit of time, or graphic design per unit of time, you get extremely drastically different time horizon curves.
This doesn’t make METR’s curves unreasonable as a basis, but you really need a lot of assumptions to get you from “these curves intersect one year here” to “the same year we will get ~fully automated AI R&D” (and indeed I would not currently believe the latter).
Preliminary work showing that the METR trend is approximately average:
I don’t know the details of all of these task distributions, but clearly these are not remotely sampled uniformly from the set of all tasks necessary to automate AI R&D?
Yes, in particular the concern about benchmark tasks being well-specified remains. We’ll need both more data (probably collected from AI R&D tasks in the wild) and more modeling to get a forecast for overall speedup.
However, I do think if we have a wide enough distribution of tasks, AIs outperform humans on all of them at task lengths that should imply humans spend 1/10th the labor, but AI R&D has not been automated yet, something strange needs to be happening. So looking at different benchmarks is partial progress towards understanding the gap between long time horizons on METR’s task set and actual AI R&D uplift.
(agree, didn’t intend to imply that they were)