Thanks titotal for taking the time to dig deep into our model and write up your thoughts, it’s much appreciated. This comment speaks for Daniel Kokotajlo and me, not necessarily any of the other authors on the timelines forecast or AI 2027. It addresses most but not all of titotal’s post.
Overall view: titotal pointed out a few mistakes and communication issues which we will mostly fix. We are therefore going to give titotal a $500 bounty to represent our appreciation. However, we continue to disagree on the core points regarding whether the model’s takeaways are valid and whether it was reasonable to publish a model with this level of polish. We think titotal’s critiques aren’t strong enough to overturn the core conclusion that superhuman coders by 2027 are a serious possibility, nor to significantly move our overall median (edit: I now think it’s plausible that changes made as a result of titotal’s critique will move our median significantly). Moreover, we continue to think that AI 2027’s timelines forecast is (unfortunately) the world’s state-of-the-art, and challenge others to do better. If instead of surpassing us, people simply want to offer us critiques, that’s helpful too; we hope to surpass ourselves every year in part by incorporating and responding to such critiques.
Clarification regarding the updated model
My apologies about quietly updating the timelines forecast with an update without announcing it; we are aiming to announce it soon. I’m glad that titotal was able to see it.
A few clarifications:
titotal says “it predicts years longer timescales than the AI2027 short story anyway.” While the medians are indeed 2029 and 2030, the models still give ~25-40% to superhuman coders by the end of 2027.
Other team members (e.g. Daniel K) haven’t reviewed the updated model in depth, and have not integrated it into their overall views. Daniel is planning to do this soon, and will publish a blog post about it when he does.
Most important disagreements
I’ll let titotal correct us if we misrepresent them on any of this.
Whether to estimate and model dynamics for which we don’t have empirical data. e.g. titotal says there is “very little empirical validation of the model,” and especially criticizes the modeling of superexponentiality as having no empirical backing. We agree that it would be great to have more empirical validation of more of the model components, but unfortunately that’s not feasible at the moment while incorporating all of the highly relevant factors.[1]
Whether to adjust our estimates based on factors outside the data. For example, titotal criticizes us for making judgmental forecasts for the date of RE-Bench saturation, rather than plugging in the logistic fit. I’m strongly in favor of allowing intuitive adjustments on top of quantitative modeling when estimating parameters.
[Unsure about level of disagreement] The value of a “least bad” timelines model. While the model is certainly imperfect due to limited time and the inherent difficulties around forecasting AGI timelines, we still think overall it’s the “least bad” timelines model out there and it’s the model that features most prominently in my overall timelines views. I think titotal disagrees, though I’m not sure which one they consider least bad (perhaps METR’s simpler one in their time horizon paper?). But even if titotal agreed that ours was “least bad,” my sense is that they might still be much more negative on it than us. Some reasons I’m excited about publishing a least bad model:
Reasoning transparency. We wanted to justify the timelines in AI 2027, given limited time. We think it’s valuable to be transparent about where our estimates come from even if the modeling is flawed in significant ways. Additionally, it allows others like titotal to critique it.
Advancing the state of the art. Even if a model is flawed, it seems best to publish to inform others’ opinions and to allow others to build on top of it.
The likelihood of time horizon growth being superexponential, before accounting for AI R&D automation. See this section for our arguments in favor of superexponentiallity being plausible, and titotal’s responses (I put it at 45% in our original model). This comment thread has further discussion. If you are very confident in no inherent superexponentiality, superhuman coders by end of 2027 become significantly less likely, though are still >10% if you agree with the rest of our modeling choices (see here for a side-by-side graph generated from my latest model).
How strongly superexponential the progress would be. This section argues that our choice of superexponential function is arbitrary. While we agree that the choice is fairly arbitrary and ideally we would have uncertainty over the best function, my intuition is that titotal’s proposed alternative curve feels less plausible than the one we use in the report, conditional on some level of superexponentiality.
Whether the argument for superexponentiality is stronger at higher time horizons. titotal is confused about why there would sometimes be a delayed superexponential rather than starting at the simulation starting point. The reasoning here is that the conceptual argument for superexponentiality is much stronger at higher time horizons (e.g. going from 100 to 1,000 years feels likely much easier than going from 1 to 10 days, while it’s less clear for 1 to 10 weeks vs. 1 to 10 days). It’s unclear that the delayed superexponential is the exact right way to model that, but it’s what I came up with for now.
Other disagreements
Intermediate speedups: Unfortunately we haven’t had the chance to dig deeply into this section of titotal’s critique, and it’s mostly based on the original version of the model rather than the updated one so we probably will not get to this. The speedup from including AI R&D automation seems pretty reasonable intuitively at the moment (you can see a side-by-side here).
RE-Bench logistic fit (section): We think it’s reasonable to set the ceiling of the logistic at wherever we think the maximum achievable performance would be. We don’t think it makes any sense to give weight to a fit that achieves a maximum of 0.5 when we know reference solutions achieve 1.0 and we also have reason to believe it’s possible to get substantially higher. We agree that we are making a guess (or with more positive connotation, “estimate”) about the maximum score, but it seems better than the alternative of doing no fit.
Mistakes that titotal pointed out
We agree that the graph we’ve tweeted is not closely representative of the typical trajectory of our timelines model conditional on superhuman coders in March 2027. Sorry about that, we should have prioritized making it more precisely faithful to the model. We will fix this in future communications.
They convinced us to remove the public vs. internal argument as a consideration in favor of superexponentiality (section).
We like the analysis done regarding the inconsistency of the RE-Bench saturation forecasts with an interpolation of the time horizons progression. We agree that it’s plausible that we should just not have RE-Bench in the benchmarks and gaps model; this is partially an artifact of a version of the model that existed before the METR time horizons paper.
In accordance with our bounties program, we will award $500 to titotal for pointing these out.
Communication issues
There were several issues with communication that titotal pointed out which we agree should be clarified, and we will do so. These issues arose from lack of polish rather than malice. 2 of the most important ones:
The “exponential” time horizon case still has superexponential growth once you account for automation of AI R&D.
The forecasts for RE-Bench saturation were adjusted based on other factors on top of the logistic fit.
Relatedly, titotal thinks that we made our model too complicated, while I think it’s important to make our best guess for how each relevant factor affects our forecast.
So I’m kind of not very satisfied with this defence.
Not-very-charitably put, my impression now is that all the technical details in the forecast were free parameters fine-tuned to support the authors’ intuitions[1], when they weren’t outright ignored. Now, I also gather that those intuitions were themselves supported by playing around with said technical models, and there’s something to be said about doing the math, then burning the math and going with your gut. I’m not saying the forecast should be completely dismissed because of that.
… But “the authors, who are smart people with a good track record of making AI-related predictions, intuitively feel that this is sort of right, and they were able to come up with functions whose graphs fit those intuitions” is a completely different kind of evidence compared to “here’s a bunch of straightforward extrapolations of existing trends, with non-epsilon empirical support, that the competent authors intuitively think are going to continue”.
Like… I, personally, didn’t put much stock in the technical-analysis part to begin with[2], I only updated on the “these authors have these intuitions” part (to which I don’t give trivial weight!). But if I did interpret the forecast as being based on intuitively chosen but non-tampered straightforward extrapolations of existing trends, I think I would be pretty disappointed right now. You should’ve maybe put a “these graphs are for illustrative purposes only” footnote somewhere, like this one did.
I don’t feel that “this is the least-bad forecast that exists” is a good defence. Whether an analysis is technical or vibes-based is a spectrum, but it isn’t graded on a curve.
I’m kind of split about this critique, since the forecast did end up as good propaganda if nothing else. But I do now feel that the marketing around it was kind of misleading, and we probably care about maintaining good epistemics here or something.
If you’ve picked which function to fit, and it’s very sensitive to small parameter changes, and you pick the parameters that intuitively feel right, I think you might as well draw the graph by hand.
Because I don’t think AGI/researcher-level AIs have been reduced to an engineering problem, I think theoretical insights are missing, which means no straight-lines extrapolation is possible and we can’t do better than a memoryless exponential distribution. And whether this premise is true is itself an intuitive judgement call, and even fully rigorous technical analyses premised on an intuitive judgement call are only as rigorous as the intuitive judgement call.
I think the actual epistemic process that happened here is something like:
The AI 2027 authors had some high-level arguments that AI might be a very big deal soon
They wrote down a bunch of concrete scenarios that seemed like they would follow from those arguments and checked if they sounded coherent and plausible and consistent with lots of other things they thought about the world
As part of that checking, one thing they checked was whether these scenarios would be some kind of huge break from existing trends, which I do think is a hard thing to do, but is an important thing to pay attention to
The right way to interpret the “timeline forecast” sections is not as “here is a simple extrapolation methodology that generated our whole worldview” but instead as a “here is some methodology that sanity-checked that our worldview is not in obvious contradiction to reasonable assumptions about economic growth”
But like, at least for me, it’s clear to me that the beliefs about takeoff and the exact timelines, could not be, and obviously should not be, considered the result of a straightforward and simple extrapolation exercise. I think such an exercise would be pretty doomed, and a claim to objectivity in that space seems misguided. I think it’s plausible that some parts of the Timelines Forecast supplement ended up communicating too much objectivity here, but IDK, I think AI 2027 as a whole communicated this process pretty well, I think.
But like, at least for me, it’s clear to me that the beliefs about takeoff and the exact timelines, could not be, and obviously should not be, considered the result of a straightforward and simple extrapolation exercise
Counterpoint: the METR agency-horizon doubling trend. It has its issues, but I think “the point at which an AI could complete a year-long software-engineering/DL research project” is a reasonable cutoff point for “AI R&D is automated”, and it seems to be the kind of non-overly-fine-tuned model with non-epsilon empirical backing that I’m talking about, in a way AI 2027 graphs are not.
Or maybe the distinction isn’t as stark in others’ minds as in mine, I dunno.
As part of that checking, one thing they checked was whether these scenarios would be some kind of huge break from existing trends, which I do think is a hard thing to do
Is it? See titotal’s six-stories section. If you’re choosing which function to fit, with a bunch of free parameters you set manually, it seems pretty trivial to come up with a “trend” that would fit any model you have.
Counterpoint: the METR agency-horizon doubling trend. It has its issues, but I think “the point at which an AI could complete a year-long software-engineering/DL research project” is a reasonable cutoff point for “AI R&D is automated”, and it seems to be the kind of non-overly-fine-tuned model with non-epsilon empirical backing that I’m talking about, in a way AI 2027 graphs are not.
I think the METR horizon doubling trend stuff doesn’t stand on its own, and it’s really not many datapoints.
I also really don’t think, without a huge number of assumptions, that “the point at which an AI could complete a year-long software-engineering/DL research project” is a good proxy for “AI R&D automation”, and indeed I want to avoid exactly that kind of sleight of hand. It only makes sense to someone who has a much more complicated worldview about how general AI is likely to be, how much the tasks METR measured are likely to generalize, and many other components. What it does make sense for is as a sanity-check on that broader worldview.
I think the METR horizon doubling trend stuff doesn’t stand on its own, and it’s really not many datapoints.
It’s less about the datapoints and more about the methodology.
I also really don’t think, without a huge number of assumptions, that “the point at which an AI could complete a year-long software-engineering/DL research project” is a good proxy for “AI R&D automation”
Fair, I very much agree. But my point here is that the METR benchmark works as some additional technical/empirical evidence towards some hypotheses over others, evidence that’s derived independently from one’s intuitions, in a way that more fine-tuned graphs don’t work.
Those two things sound extremely similar to me, I would appreciate some explanation/pointer to why they seem quite different.
Current guess: Is the idea that automation includes also a lot of (a) management, and (b) research taste in choosing projects, such that being able to complete a year-long project is only a lower-bound, not a central target?
Yeah, I mean, the task distribution is just hugely different. When METR measures software-developing tasks, they mean things in the reference class of well-specified tasks with tests basically already written.
As a concrete example, if you just use a random other distribution of tasks for horizon length as your base, like forecasting performance for unit of time, or writing per unit of time, or graphic design per unit of time, you get extremely drastically different time horizon curves.
This doesn’t make METR’s curves unreasonable as a basis, but you really need a lot of assumptions to get you from “these curves intersect one year here” to “the same year we will get ~fully automated AI R&D” (and indeed I would not currently believe the latter).
I don’t know the details of all of these task distributions, but clearly these are not remotely sampled uniformly from the set of all tasks necessary to automate AI R&D?
Yes, in particular the concern about benchmark tasks being well-specified remains. We’ll need both more data (probably collected from AI R&D tasks in the wild) and more modeling to get a forecast for overall speedup.
However, I do think if we have a wide enough distribution of tasks, AIs outperform humans on all of them at task lengths that should imply humans spend 1/10th the labor, but AI R&D has not been automated yet, something strange needs to be happening. So looking at different benchmarks is partial progress towards understanding the gap between long time horizons on METR’s task set and actual AI R&D uplift.
since the forecast did end up as good propaganda if nothing else
Just responding to this local comment you made: I think it’s wrong to make “propaganda” to reach end Y, even if you think end Y is important. If you have real reasons for believing something will happen, you shouldn’t have to lie, exaggerate, or otherwise mislead your audience to make them believe it, too.
So I’m arguing that you shouldn’t have mixed feelings because ~”it was valuable propaganda at least.” Again, not trying to claim that AI 2027 “lied”—just replying to the quoted bit of reasoning.
I phrased that badly/compressed too much. The background feeling there was that my critique may be of an overly nitpicky type that no normal person would care about, but the act-of-critiquing was still an attack on the report if viewed through the lens of a social-status game, which may (on the margins) unfairly bias someone against the report.
Like, by analogy, imagine a math paper involving a valid but hard-to-follow proof of some conjecture that for some reason gets tons of negative attention due to bad formatting. This may incorrectly taint the core message by association, even though it’s completely valid.
I’m kind of split about this critique, since the forecast did end up as good propaganda if nothing else. But I do now feel that the marketing around it was kind of misleading, and we probably care about maintaining good epistemics here or something.
I’m interested in you expanding on which parts of the marketing were misleading. Here are some quick more specific thoughts:
Overall AI 2027 comms
In our website frontpage, I think we were pretty careful not to overclaim. We say that the forecast is our “best guess”, “informed by trend extrapolations, wargames, …” Then in the “How did we write it?” box we basically just say it was written iteratively and informed by wargames and feedback. In “Why is it valuable?” we say “We have set ourselves an impossible task. Trying to predict how superhuman AI in 2027 would go is like trying to predict how World War 3 in 2027 would go, except that it’s an even larger departure from past case studies. Yet it is still valuable to attempt, just as it is valuable for the US military to game out Taiwan scenarios.” I don’t think we said anywhere that it was backed up by straightforward, strongly empirically validated extrapolations.
In our initial tweet, Daniel said it was a “deeply researched” scenario forecast. This still seems accurate to me, we spent quite a lot of time on it (both the scenario and supplements) and I still think our supplementary research is mostly state of the art, though I can see how people could take it too strongly.
In various follow-up discussions, I think Scott and others sometimes pointed to the length of all of the supplementary research as justification for taking the scenario seriously. I still think this mostly holds up but again I think it could be interpreted in the wrong way.
Probably there has been similar discussion in various podcast appearances etc., but I haven’t listened to most of those and don’t remember how this sort of thing was presented in the ones I did listen to.
Timelines forecast specific comms
We do not say prominently explicitly in the timelines forecast that it relies on a bunch of non-obvious parameter choices rather than just empirical trend extrapolation, so I agree that people could come away with the wrong impression.
Plausibly we should have had / I should add a disclaimer saying something like this.
I have been frustrated with previous forecasts for not communicating this well, so plausibly I’m being hypocritical.
One reason I’m hesitant to add this is that I think it might update non-rationalists too much toward thinking it’s useless, when in fact I think it’s pretty informative. But this might be motivated reasoning toward the choice I made before. I might add a disclaimer.
I didn’t explicitly consider adding a prominent disclaimer previously; perhaps because I was typical minding and thinking it was obvious that any AGI timelines forecast will rely on intuitively estimated parameters.
However, I think that including 3 different people/groups’ forecasts very prominently does implicitly get across the idea that different parameter estimations can lead to very different results. This is especially true for including the FutureSearch aggregate, which has a within-model median of 2032 rather than 2027 or 2028.
There’s a graph at the top of the timelines forecast with all 3 of our distributions, and in my tweet thread about the timelines forecast this was in my top tweet.
As I’ve said, I agree that we messed up to some extent re: the time horizon prediction graph. I might write more about this in response to TurnTrout.
Not-very-charitably put, my impression now is that all the technical details in the forecast were free parameters fine-tuned to support the authors’ intuitions, when they weren’t outright ignored. Now, I also gather that those intuitions were themselves supported by playing around with said technical models, and there’s something to be said about doing the math, then burning the math and going with your gut. I’m not saying the forecast should be completely dismissed because of that.
I tried not to just fine-tune the parameters to support my existing beliefs, though I of course probably implicitly did to some extent. I agree that the level of free parameters is a reason to distrust our forecasts.
FWIW, my and Daniel’s timelines beliefs have both shifted some as a result of our modeling. Mine initially got shorter then got a bit longer due to the most recent update, Daniel moved his timelines longer to 2028 in significant part because of our timelines model.
… But “the authors, who are smart people with a good track record of making AI-related predictions, intuitively feel that this is sort of right, and they were able to come up with functions whose graphs fit those intuitions” is a completely different kind of evidence compared to “here’s a bunch of straightforward extrapolations of existing trends, with non-epsilon empirical support, that the competent authors intuitively think are going to continue”.
Mostly agree. I would say we have more than non-epsilon empirical support though because of METR’s time horizons work and RE-Bench. But I agree that there are a bunch of parameters estimated that don’t have much empirical support to rely on.
But if I did interpret the forecast as being based on intuitively chosen but non-tampered straightforward extrapolations of existing trends, I think I would be pretty disappointed right now.
I don’t agree with the connotation of “non-tampered,” but otherwise agree re: relying on straightforward extrapolations. I don’t think it’s feasible to only rely on straightforward extrapolations when predicting AGI timelines.
You should’ve maybe put a “these graphs are for illustrative purposes only” footnote somewhere, like this one did.
I think “illustrative purposes only” would be too strong. The graphs are the result of an actual model that I think is reasonable to give substantial weight to in one’s timelines estimates (if you’re only referring to the specific graph that I’ve apologized for, then I agree we should have moved more in that direction re: more clear labeling).
I don’t feel that “this is the least-bad forecast that exists” is a good defence. Whether an analysis is technical or vibes-based is a spectrum, but it isn’t graded on a curve.
I’m not sure exactly how to respond to this. I agree that the absolute level of usefulness of the timelines forecast also matters, and I probably think that our timelines model is more useful than you do. But also I think that the relative usefulness does matter quite a bit on the decision of whether to release and publicize model. I think maybe this critique is primarily coupled with your points about communication issues.
[Unlike the top-level comment, Daniel hasn’t endorsed this, this is just Eli.]
I’m interested in you expanding on which parts of the marketing were misleading
Mostly this part, I think:
In various follow-up discussions, I think Scott and others sometimes pointed to the length of all of the supplementary research as justification for taking the scenario seriously. I still think this mostly holds up but again I think it could be interpreted in the wrong way.
Like, yes, the supplementary materials definitely represent a huge amount of legitimate research that went into this. But the forecasts are “informed by” this research, rather than being directly derived from it, and the pointing-at kind of conveys the latter vibe.
I have been frustrated with previous forecasts for not communicating this well
Glad you get where I’m coming from; I wasn’t wholly sure how legitimate my complaints were.
One reason I’m hesitant to add [a disclaimer about non-obvious parameter choices] is that I think it might update non-rationalists too much toward thinking it’s useless, when in fact I think it’s pretty informative
I agree that this part is tricky, hence my being hesitant about fielding this critique at all. Persuasiveness isn’t something we should outright ignore, especially with something as high-profile as this. But also, the lack of such a disclaimer opens you up to takedowns such as titotal’s, and if one of those becomes high-profile (which it already might have?), that’d potentially hurt the persuasiveness more than a clear statement would have.
There’s presumably some sort of way to have your cake and eat it too here; to correctly communicate how the forecast was generated, but in terms that wouldn’t lead to it being dismissed by people at large.
I think “illustrative purposes only” would be too strong.
Yeah, sorry, I was being unnecessarily hyperbolic there.
I’m leaving the same comment here and in reply to daniel on my blog.
First, thank you for engaging in good faith and rewarding deep critique. Hopefully this dialogue will help people understand the disagreements over AI development and modelling better, so they can make their own judgements.
I think I’ll hold off on replying to most of the points there, and make my judgement after Eli does an in-depth writeup of the new model. However, I did see that there was more argumentation over the superexponential curve, so I’ll try out some more critiques here: not as confident about these, but hopefully it sparks discussion.
The impressive achievements in LLM capabilities since GPT-2 have been driven by many factors, such as drastically increased compute, drastically increased training data, algorithmic innovations such as chain-of-thought, increases in AI workforce, etc. The extent that each contributes is a matter of debate, which we can save for when you properly write up your new model.
Now, let’s look for a second at what happens when the curve goes extreme: using median parameters and starting the superexponentional today, the time horizon of AI would improve from one-thousand work-years to ten-thousand work-years In around five weeks. So you release a model, and it scores 80% on 1000 work year tasks, but only like 40% on 10,000 work year tasks (the current ratio of 50% to 80% time horizons is like 4:1 or so). Then five weeks later you release a new model, and now the reliability on the much harder tasks has doubled to 80%.
Why? What causes the reliability to shoot up in five weeks? The change in the amount of available compute, reference data, or labor force will not be significant in that time, and algorithmic breakthroughs do not come with regularity. It can’t be due to any algorithmic speedups from AI development because that’s in a different part of the model: we’re talking about three weeks of normal AI development, like it’s being done by openAI as it currently stands.. If the AI is only 30x faster than humans, then the time required for the AI to do the thousand year task is 33 years! So where does this come from? Will we have developed the perfect algorithm, such that AI no longer needs retraining?
I think a mistake could be made in trying to transfer intuition about humans to AI here: perhaps the intuition is “hey, a human who is good enough to do a 1 year task well can probably be trusted to do a 10 year task”.
However, if a human is trying to reliably do a “100 year” task (a task that would take a team of a hundred about a year to do), this might involve spending several years getting an extra degree in the subject, read a ton of literature, improving their productivity, get mentored by an expert in the subject, etc. While they work on it, they learn new stuff and their actual neurons get rewired.
But the AI equivalent to this would be getting new algorithms, new data, new computing power, new training. ie, becoming an entirely new model, which would take significantly more than a few weeks to be built. I think there may be some double counting going on between this superexp and the superexp from algo speedups.
You currently have the pace of total progress growing exponentially as AI improves. And this leads the bad back-predictions that the pace of progress used to be much slower.
I think your back predictions would be fine if you said that total progress = human-driven progress + AI-driven progress, and then had only the AI part grow exponentially.
Then in the back prediction the AI part would rapidly shrink but the human part would remain.
Thanks titotal for taking the time to dig deep into our model and write up your thoughts, it’s much appreciated. This comment speaks for Daniel Kokotajlo and me, not necessarily any of the other authors on the timelines forecast or AI 2027. It addresses most but not all of titotal’s post.
Overall view: titotal pointed out a few mistakes and communication issues which we will mostly fix. We are therefore going to give titotal a $500 bounty to represent our appreciation. However, we continue to disagree on the core points regarding whether the model’s takeaways are valid and whether it was reasonable to publish a model with this level of polish. We think titotal’s critiques aren’t strong enough to overturn the core conclusion that superhuman coders by 2027 are a serious possibility,
nor to significantly move our overall median(edit: I now think it’s plausible that changes made as a result of titotal’s critique will move our median significantly). Moreover, we continue to think that AI 2027’s timelines forecast is (unfortunately) the world’s state-of-the-art, and challenge others to do better. If instead of surpassing us, people simply want to offer us critiques, that’s helpful too; we hope to surpass ourselves every year in part by incorporating and responding to such critiques.Clarification regarding the updated model
My apologies about quietly updating the timelines forecast with an update without announcing it; we are aiming to announce it soon. I’m glad that titotal was able to see it.
A few clarifications:
titotal says “it predicts years longer timescales than the AI2027 short story anyway.” While the medians are indeed 2029 and 2030, the models still give ~25-40% to superhuman coders by the end of 2027.
Other team members (e.g. Daniel K) haven’t reviewed the updated model in depth, and have not integrated it into their overall views. Daniel is planning to do this soon, and will publish a blog post about it when he does.
Most important disagreements
I’ll let titotal correct us if we misrepresent them on any of this.
Whether to estimate and model dynamics for which we don’t have empirical data. e.g. titotal says there is “very little empirical validation of the model,” and especially criticizes the modeling of superexponentiality as having no empirical backing. We agree that it would be great to have more empirical validation of more of the model components, but unfortunately that’s not feasible at the moment while incorporating all of the highly relevant factors.[1]
Whether to adjust our estimates based on factors outside the data. For example, titotal criticizes us for making judgmental forecasts for the date of RE-Bench saturation, rather than plugging in the logistic fit. I’m strongly in favor of allowing intuitive adjustments on top of quantitative modeling when estimating parameters.
[Unsure about level of disagreement] The value of a “least bad” timelines model. While the model is certainly imperfect due to limited time and the inherent difficulties around forecasting AGI timelines, we still think overall it’s the “least bad” timelines model out there and it’s the model that features most prominently in my overall timelines views. I think titotal disagrees, though I’m not sure which one they consider least bad (perhaps METR’s simpler one in their time horizon paper?). But even if titotal agreed that ours was “least bad,” my sense is that they might still be much more negative on it than us. Some reasons I’m excited about publishing a least bad model:
Reasoning transparency. We wanted to justify the timelines in AI 2027, given limited time. We think it’s valuable to be transparent about where our estimates come from even if the modeling is flawed in significant ways. Additionally, it allows others like titotal to critique it.
Advancing the state of the art. Even if a model is flawed, it seems best to publish to inform others’ opinions and to allow others to build on top of it.
The likelihood of time horizon growth being superexponential, before accounting for AI R&D automation. See this section for our arguments in favor of superexponentiallity being plausible, and titotal’s responses (I put it at 45% in our original model). This comment thread has further discussion. If you are very confident in no inherent superexponentiality, superhuman coders by end of 2027 become significantly less likely, though are still >10% if you agree with the rest of our modeling choices (see here for a side-by-side graph generated from my latest model).
How strongly superexponential the progress would be. This section argues that our choice of superexponential function is arbitrary. While we agree that the choice is fairly arbitrary and ideally we would have uncertainty over the best function, my intuition is that titotal’s proposed alternative curve feels less plausible than the one we use in the report, conditional on some level of superexponentiality.
Whether the argument for superexponentiality is stronger at higher time horizons. titotal is confused about why there would sometimes be a delayed superexponential rather than starting at the simulation starting point. The reasoning here is that the conceptual argument for superexponentiality is much stronger at higher time horizons (e.g. going from 100 to 1,000 years feels likely much easier than going from 1 to 10 days, while it’s less clear for 1 to 10 weeks vs. 1 to 10 days). It’s unclear that the delayed superexponential is the exact right way to model that, but it’s what I came up with for now.
Other disagreements
Intermediate speedups: Unfortunately we haven’t had the chance to dig deeply into this section of titotal’s critique, and it’s mostly based on the original version of the model rather than the updated one so we probably will not get to this. The speedup from including AI R&D automation seems pretty reasonable intuitively at the moment (you can see a side-by-side here).
RE-Bench logistic fit (section): We think it’s reasonable to set the ceiling of the logistic at wherever we think the maximum achievable performance would be. We don’t think it makes any sense to give weight to a fit that achieves a maximum of 0.5 when we know reference solutions achieve 1.0 and we also have reason to believe it’s possible to get substantially higher. We agree that we are making a guess (or with more positive connotation, “estimate”) about the maximum score, but it seems better than the alternative of doing no fit.
Mistakes that titotal pointed out
We agree that the graph we’ve tweeted is not closely representative of the typical trajectory of our timelines model conditional on superhuman coders in March 2027. Sorry about that, we should have prioritized making it more precisely faithful to the model. We will fix this in future communications.
They convinced us to remove the public vs. internal argument as a consideration in favor of superexponentiality (section).
We like the analysis done regarding the inconsistency of the RE-Bench saturation forecasts with an interpolation of the time horizons progression. We agree that it’s plausible that we should just not have RE-Bench in the benchmarks and gaps model; this is partially an artifact of a version of the model that existed before the METR time horizons paper.
In accordance with our bounties program, we will award $500 to titotal for pointing these out.
Communication issues
There were several issues with communication that titotal pointed out which we agree should be clarified, and we will do so. These issues arose from lack of polish rather than malice. 2 of the most important ones:
The “exponential” time horizon case still has superexponential growth once you account for automation of AI R&D.
The forecasts for RE-Bench saturation were adjusted based on other factors on top of the logistic fit.
Relatedly, titotal thinks that we made our model too complicated, while I think it’s important to make our best guess for how each relevant factor affects our forecast.
So I’m kind of not very satisfied with this defence.
Not-very-charitably put, my impression now is that all the technical details in the forecast were free parameters fine-tuned to support the authors’ intuitions[1], when they weren’t outright ignored. Now, I also gather that those intuitions were themselves supported by playing around with said technical models, and there’s something to be said about doing the math, then burning the math and going with your gut. I’m not saying the forecast should be completely dismissed because of that.
… But “the authors, who are smart people with a good track record of making AI-related predictions, intuitively feel that this is sort of right, and they were able to come up with functions whose graphs fit those intuitions” is a completely different kind of evidence compared to “here’s a bunch of straightforward extrapolations of existing trends, with non-epsilon empirical support, that the competent authors intuitively think are going to continue”.
Like… I, personally, didn’t put much stock in the technical-analysis part to begin with[2], I only updated on the “these authors have these intuitions” part (to which I don’t give trivial weight!). But if I did interpret the forecast as being based on intuitively chosen but non-tampered straightforward extrapolations of existing trends, I think I would be pretty disappointed right now. You should’ve maybe put a “these graphs are for illustrative purposes only” footnote somewhere, like this one did.
I don’t feel that “this is the least-bad forecast that exists” is a good defence. Whether an analysis is technical or vibes-based is a spectrum, but it isn’t graded on a curve.
I’m kind of split about this critique, since the forecast did end up as good propaganda if nothing else. But I do now feel that the marketing around it was kind of misleading, and we probably care about maintaining good epistemics here or something.
If you’ve picked which function to fit, and it’s very sensitive to small parameter changes, and you pick the parameters that intuitively feel right, I think you might as well draw the graph by hand.
Because I don’t think AGI/researcher-level AIs have been reduced to an engineering problem, I think theoretical insights are missing, which means no straight-lines extrapolation is possible and we can’t do better than a memoryless exponential distribution. And whether this premise is true is itself an intuitive judgement call, and even fully rigorous technical analyses premised on an intuitive judgement call are only as rigorous as the intuitive judgement call.
I think the actual epistemic process that happened here is something like:
The AI 2027 authors had some high-level arguments that AI might be a very big deal soon
They wrote down a bunch of concrete scenarios that seemed like they would follow from those arguments and checked if they sounded coherent and plausible and consistent with lots of other things they thought about the world
As part of that checking, one thing they checked was whether these scenarios would be some kind of huge break from existing trends, which I do think is a hard thing to do, but is an important thing to pay attention to
The right way to interpret the “timeline forecast” sections is not as “here is a simple extrapolation methodology that generated our whole worldview” but instead as a “here is some methodology that sanity-checked that our worldview is not in obvious contradiction to reasonable assumptions about economic growth”
But like, at least for me, it’s clear to me that the beliefs about takeoff and the exact timelines, could not be, and obviously should not be, considered the result of a straightforward and simple extrapolation exercise. I think such an exercise would be pretty doomed, and a claim to objectivity in that space seems misguided. I think it’s plausible that some parts of the Timelines Forecast supplement ended up communicating too much objectivity here, but IDK, I think AI 2027 as a whole communicated this process pretty well, I think.
Counterpoint: the METR agency-horizon doubling trend. It has its issues, but I think “the point at which an AI could complete a year-long software-engineering/DL research project” is a reasonable cutoff point for “AI R&D is automated”, and it seems to be the kind of non-overly-fine-tuned model with non-epsilon empirical backing that I’m talking about, in a way AI 2027 graphs are not.
Or maybe the distinction isn’t as stark in others’ minds as in mine, I dunno.
Is it? See titotal’s six-stories section. If you’re choosing which function to fit, with a bunch of free parameters you set manually, it seems pretty trivial to come up with a “trend” that would fit any model you have.
I think the METR horizon doubling trend stuff doesn’t stand on its own, and it’s really not many datapoints.
I also really don’t think, without a huge number of assumptions, that “the point at which an AI could complete a year-long software-engineering/DL research project” is a good proxy for “AI R&D automation”, and indeed I want to avoid exactly that kind of sleight of hand. It only makes sense to someone who has a much more complicated worldview about how general AI is likely to be, how much the tasks METR measured are likely to generalize, and many other components. What it does make sense for is as a sanity-check on that broader worldview.
It’s less about the datapoints and more about the methodology.
Fair, I very much agree. But my point here is that the METR benchmark works as some additional technical/empirical evidence towards some hypotheses over others, evidence that’s derived independently from one’s intuitions, in a way that more fine-tuned graphs don’t work.
Those two things sound extremely similar to me, I would appreciate some explanation/pointer to why they seem quite different.
Current guess: Is the idea that automation includes also a lot of (a) management, and (b) research taste in choosing projects, such that being able to complete a year-long project is only a lower-bound, not a central target?
Yeah, I mean, the task distribution is just hugely different. When METR measures software-developing tasks, they mean things in the reference class of well-specified tasks with tests basically already written.
As a concrete example, if you just use a random other distribution of tasks for horizon length as your base, like forecasting performance for unit of time, or writing per unit of time, or graphic design per unit of time, you get extremely drastically different time horizon curves.
This doesn’t make METR’s curves unreasonable as a basis, but you really need a lot of assumptions to get you from “these curves intersect one year here” to “the same year we will get ~fully automated AI R&D” (and indeed I would not currently believe the latter).
Preliminary work showing that the METR trend is approximately average:
I don’t know the details of all of these task distributions, but clearly these are not remotely sampled uniformly from the set of all tasks necessary to automate AI R&D?
Yes, in particular the concern about benchmark tasks being well-specified remains. We’ll need both more data (probably collected from AI R&D tasks in the wild) and more modeling to get a forecast for overall speedup.
However, I do think if we have a wide enough distribution of tasks, AIs outperform humans on all of them at task lengths that should imply humans spend 1/10th the labor, but AI R&D has not been automated yet, something strange needs to be happening. So looking at different benchmarks is partial progress towards understanding the gap between long time horizons on METR’s task set and actual AI R&D uplift.
(agree, didn’t intend to imply that they were)
Just responding to this local comment you made: I think it’s wrong to make “propaganda” to reach end Y, even if you think end Y is important. If you have real reasons for believing something will happen, you shouldn’t have to lie, exaggerate, or otherwise mislead your audience to make them believe it, too.
So I’m arguing that you shouldn’t have mixed feelings because ~”it was valuable propaganda at least.” Again, not trying to claim that AI 2027 “lied”—just replying to the quoted bit of reasoning.
I phrased that badly/compressed too much. The background feeling there was that my critique may be of an overly nitpicky type that no normal person would care about, but the act-of-critiquing was still an attack on the report if viewed through the lens of a social-status game, which may (on the margins) unfairly bias someone against the report.
Like, by analogy, imagine a math paper involving a valid but hard-to-follow proof of some conjecture that for some reason gets tons of negative attention due to bad formatting. This may incorrectly taint the core message by association, even though it’s completely valid.
I’m interested in you expanding on which parts of the marketing were misleading. Here are some quick more specific thoughts:
Overall AI 2027 comms
In our website frontpage, I think we were pretty careful not to overclaim. We say that the forecast is our “best guess”, “informed by trend extrapolations, wargames, …” Then in the “How did we write it?” box we basically just say it was written iteratively and informed by wargames and feedback. In “Why is it valuable?” we say “We have set ourselves an impossible task. Trying to predict how superhuman AI in 2027 would go is like trying to predict how World War 3 in 2027 would go, except that it’s an even larger departure from past case studies. Yet it is still valuable to attempt, just as it is valuable for the US military to game out Taiwan scenarios.” I don’t think we said anywhere that it was backed up by straightforward, strongly empirically validated extrapolations.
In our initial tweet, Daniel said it was a “deeply researched” scenario forecast. This still seems accurate to me, we spent quite a lot of time on it (both the scenario and supplements) and I still think our supplementary research is mostly state of the art, though I can see how people could take it too strongly.
In various follow-up discussions, I think Scott and others sometimes pointed to the length of all of the supplementary research as justification for taking the scenario seriously. I still think this mostly holds up but again I think it could be interpreted in the wrong way.
Probably there has been similar discussion in various podcast appearances etc., but I haven’t listened to most of those and don’t remember how this sort of thing was presented in the ones I did listen to.
Timelines forecast specific comms
We do not say prominently explicitly in the timelines forecast that it relies on a bunch of non-obvious parameter choices rather than just empirical trend extrapolation, so I agree that people could come away with the wrong impression.
Plausibly we should have had / I should add a disclaimer saying something like this.
I have been frustrated with previous forecasts for not communicating this well, so plausibly I’m being hypocritical.
One reason I’m hesitant to add this is that I think it might update non-rationalists too much toward thinking it’s useless, when in fact I think it’s pretty informative. But this might be motivated reasoning toward the choice I made before. I might add a disclaimer.
I didn’t explicitly consider adding a prominent disclaimer previously; perhaps because I was typical minding and thinking it was obvious that any AGI timelines forecast will rely on intuitively estimated parameters.
However, I think that including 3 different people/groups’ forecasts very prominently does implicitly get across the idea that different parameter estimations can lead to very different results. This is especially true for including the FutureSearch aggregate, which has a within-model median of 2032 rather than 2027 or 2028.
There’s a graph at the top of the timelines forecast with all 3 of our distributions, and in my tweet thread about the timelines forecast this was in my top tweet.
As I’ve said, I agree that we messed up to some extent re: the time horizon prediction graph. I might write more about this in response to TurnTrout.
I tried not to just fine-tune the parameters to support my existing beliefs, though I of course probably implicitly did to some extent. I agree that the level of free parameters is a reason to distrust our forecasts.
FWIW, my and Daniel’s timelines beliefs have both shifted some as a result of our modeling. Mine initially got shorter then got a bit longer due to the most recent update, Daniel moved his timelines longer to 2028 in significant part because of our timelines model.
Mostly agree. I would say we have more than non-epsilon empirical support though because of METR’s time horizons work and RE-Bench. But I agree that there are a bunch of parameters estimated that don’t have much empirical support to rely on.
I don’t agree with the connotation of “non-tampered,” but otherwise agree re: relying on straightforward extrapolations. I don’t think it’s feasible to only rely on straightforward extrapolations when predicting AGI timelines.
I think “illustrative purposes only” would be too strong. The graphs are the result of an actual model that I think is reasonable to give substantial weight to in one’s timelines estimates (if you’re only referring to the specific graph that I’ve apologized for, then I agree we should have moved more in that direction re: more clear labeling).
I’m not sure exactly how to respond to this. I agree that the absolute level of usefulness of the timelines forecast also matters, and I probably think that our timelines model is more useful than you do. But also I think that the relative usefulness does matter quite a bit on the decision of whether to release and publicize model. I think maybe this critique is primarily coupled with your points about communication issues.
[Unlike the top-level comment, Daniel hasn’t endorsed this, this is just Eli.]
Mostly this part, I think:
Like, yes, the supplementary materials definitely represent a huge amount of legitimate research that went into this. But the forecasts are “informed by” this research, rather than being directly derived from it, and the pointing-at kind of conveys the latter vibe.
Glad you get where I’m coming from; I wasn’t wholly sure how legitimate my complaints were.
I agree that this part is tricky, hence my being hesitant about fielding this critique at all. Persuasiveness isn’t something we should outright ignore, especially with something as high-profile as this. But also, the lack of such a disclaimer opens you up to takedowns such as titotal’s, and if one of those becomes high-profile (which it already might have?), that’d potentially hurt the persuasiveness more than a clear statement would have.
There’s presumably some sort of way to have your cake and eat it too here; to correctly communicate how the forecast was generated, but in terms that wouldn’t lead to it being dismissed by people at large.
Yeah, sorry, I was being unnecessarily hyperbolic there.
I’m leaving the same comment here and in reply to daniel on my blog.
First, thank you for engaging in good faith and rewarding deep critique. Hopefully this dialogue will help people understand the disagreements over AI development and modelling better, so they can make their own judgements.
I think I’ll hold off on replying to most of the points there, and make my judgement after Eli does an in-depth writeup of the new model. However, I did see that there was more argumentation over the superexponential curve, so I’ll try out some more critiques here: not as confident about these, but hopefully it sparks discussion.
The impressive achievements in LLM capabilities since GPT-2 have been driven by many factors, such as drastically increased compute, drastically increased training data, algorithmic innovations such as chain-of-thought, increases in AI workforce, etc. The extent that each contributes is a matter of debate, which we can save for when you properly write up your new model.
Now, let’s look for a second at what happens when the curve goes extreme: using median parameters and starting the superexponentional today, the time horizon of AI would improve from one-thousand work-years to ten-thousand work-years In around five weeks. So you release a model, and it scores 80% on 1000 work year tasks, but only like 40% on 10,000 work year tasks (the current ratio of 50% to 80% time horizons is like 4:1 or so). Then five weeks later you release a new model, and now the reliability on the much harder tasks has doubled to 80%.
Why? What causes the reliability to shoot up in five weeks? The change in the amount of available compute, reference data, or labor force will not be significant in that time, and algorithmic breakthroughs do not come with regularity. It can’t be due to any algorithmic speedups from AI development because that’s in a different part of the model: we’re talking about three weeks of normal AI development, like it’s being done by openAI as it currently stands.. If the AI is only 30x faster than humans, then the time required for the AI to do the thousand year task is 33 years! So where does this come from? Will we have developed the perfect algorithm, such that AI no longer needs retraining?
I think a mistake could be made in trying to transfer intuition about humans to AI here: perhaps the intuition is “hey, a human who is good enough to do a 1 year task well can probably be trusted to do a 10 year task”.
However, if a human is trying to reliably do a “100 year” task (a task that would take a team of a hundred about a year to do), this might involve spending several years getting an extra degree in the subject, read a ton of literature, improving their productivity, get mentored by an expert in the subject, etc. While they work on it, they learn new stuff and their actual neurons get rewired.
But the AI equivalent to this would be getting new algorithms, new data, new computing power, new training. ie, becoming an entirely new model, which would take significantly more than a few weeks to be built. I think there may be some double counting going on between this superexp and the superexp from algo speedups.
Re intermediate speed ups : a simple fix
You currently have the pace of total progress growing exponentially as AI improves. And this leads the bad back-predictions that the pace of progress used to be much slower.
I think your back predictions would be fine if you said that total progress = human-driven progress + AI-driven progress, and then had only the AI part grow exponentially.
Then in the back prediction the AI part would rapidly shrink but the human part would remain.