How ironic… Four days ago I wrote: “I doubt that I can convince SE Gyges that the AI-2027 forecast wasn’t influenced by OpenAI or other AI companies (italics added today—S.K.)” But one can imagine that the AI-2027 forecast was co-written with an OpenAI propagandist and try to point out inconsistencies of the SCENARIO’s IMPORTANT parts with reality or inside the scenario itself. The part about Thiel getting a flying car is most likely an unimportant joke referring to this quote.
Unfortunately, the only thing that you wrote about the SCENARIO’s IMPORTANT part is the following:
Part of SE Gyges’ comment
For a concrete example of this that I didn’t dig into in my review, from the AI 2027 timelines forecast.
We first show Method 1: time-horizon-extension, a relatively simple model which forecasts when SC will arrive by extending the trend established by METR’s report of AIs accomplishing tasks that take humans increasing amounts of time.
We then present Method 2: benchmarks-and-gaps, a more complex model starting from a forecast saturation of an AI R&D benchmark (RE-Bench), and then how long it will take to go from that system to one that can handle real-world tasks at the best AGI company.
Finally we then provide an “all-things-considered” forecast that takes into account these two models, as well as other possible influences such as geopolitics and macroeconomics.
Are either RE-Bench or the METR time horizon metrics good metrics, as-is? Will they continue to extrapolate? Will a model that saturates them accelerate research a lot?
I think the answer to all of these is maybe. If you’re OpenAI, it is pretty important that benchmarks are good metrics. It is worth a ton of money. So, institutionally, OpenAI has to believe in benchmarks, and vastly prefers if the answer is “yes” to all of these questions. And this is also what AI 2027 is assuming.
I made a point of running this point into the ground in writing it up, but essentially every time we break a “maybe” question in AI 2027, the answer seems to be the one that OpenAI is also likely to prefer. It’s a very specific thing to happen! It doesn’t seem very likely it happened by chance. In total the effect is that “this is a slight dissent from the OpenAI hype pitch”, in my opinion.
I agree that extrapolation continuing was one of the weakest aspects of the AI-2027 forecast. But I don’t think that anyone came up with better ways to predict the dates of AIs becoming superhuman. What alternative methods could the AI-2027 authors have used to understand when the AIs become capable of automating coding, then AI R&D?
The method using the METR time horizon relied on the intuition that real-world coding tasks useful in automating AI research take humans about a working month to complete. If and when the AIs do become that capable, humans could try to delegate coding to them. What the authors did not know was that METR would find major problems in AI-generated code, nor that Grok 4 and GPT-5 would demonstrate lower time horizons than the faster trend predicted.
As for the RE-bench, the authors explicitly claim that a model saturating the bench wouldn’t be enough, then tried to estimate the gaps between models saturating the RE-bench and models being superhuman at coding.
Why the AI-2027 authors chose the RE-bench
RE-Bench has a few nice properties that are hard to find in other benchmarks and which make it a uniquely good measure of how much AI is speeding up AI research:
Highly relevant to frontier AI R&D.
High performance ceiling. AI agents can achieve significantly above human-level, though in practice it will likely be very difficult to do more than roughly 2x higher than the human baseline solutions (for a score of 1.5). Median human baseline scores are 0.12 for 2 hours of effort and 0.66 for 8 hours. Current state of the art (SOTA) is Claude 3.7 Sonnet with a score of roughly 0.6 using Best-of-K scaffolding in a scaffold called “modular”.
Human baselines which allow for grounded comparisons between AI and human performance.
We expect that “saturation” under this definition will happen before (italics mine—S.K.) the SC milestone is hit. The first systems that saturate RE-Bench will probably be missing a few kinds of capabilities that are needed to hit the SC milestone, as described below.
I think these are blind guesses and relying on the benchmarks is the streetlight effect, as I think we talked about in another thread. I am mostly explaining in as much detail as I can the parts I think are relevant to Neel’s objection, since it is substantively the most common objection, ie, that paying attention to financial incentives or work history is irrelevant to anything. I am happy that I have addressed the scenario itself in enough detail
How ironic… Four days ago I wrote: “I doubt that I can convince SE Gyges that the AI-2027 forecast wasn’t influenced by OpenAI or other AI companies (italics added today—S.K.)” But one can imagine that the AI-2027 forecast was co-written with an OpenAI propagandist and try to point out inconsistencies of the SCENARIO’s IMPORTANT parts with reality or inside the scenario itself. The part about Thiel getting a flying car is most likely an unimportant joke referring to this quote.
Unfortunately, the only thing that you wrote about the SCENARIO’s IMPORTANT part is the following:
Part of SE Gyges’ comment
For a concrete example of this that I didn’t dig into in my review, from the AI 2027 timelines forecast.
Are either RE-Bench or the METR time horizon metrics good metrics, as-is? Will they continue to extrapolate? Will a model that saturates them accelerate research a lot?
I think the answer to all of these is maybe. If you’re OpenAI, it is pretty important that benchmarks are good metrics. It is worth a ton of money. So, institutionally, OpenAI has to believe in benchmarks, and vastly prefers if the answer is “yes” to all of these questions. And this is also what AI 2027 is assuming.
I made a point of running this point into the ground in writing it up, but essentially every time we break a “maybe” question in AI 2027, the answer seems to be the one that OpenAI is also likely to prefer. It’s a very specific thing to happen! It doesn’t seem very likely it happened by chance. In total the effect is that “this is a slight dissent from the OpenAI hype pitch”, in my opinion.
I agree that extrapolation continuing was one of the weakest aspects of the AI-2027 forecast. But I don’t think that anyone came up with better ways to predict the dates of AIs becoming superhuman. What alternative methods could the AI-2027 authors have used to understand when the AIs become capable of automating coding, then AI R&D?
The method using the METR time horizon relied on the intuition that real-world coding tasks useful in automating AI research take humans about a working month to complete. If and when the AIs do become that capable, humans could try to delegate coding to them. What the authors did not know was that METR would find major problems in AI-generated code, nor that Grok 4 and GPT-5 would demonstrate lower time horizons than the faster trend predicted.
As for the RE-bench, the authors explicitly claim that a model saturating the bench wouldn’t be enough, then tried to estimate the gaps between models saturating the RE-bench and models being superhuman at coding.
Why the AI-2027 authors chose the RE-bench
RE-Bench has a few nice properties that are hard to find in other benchmarks and which make it a uniquely good measure of how much AI is speeding up AI research:
Highly relevant to frontier AI R&D.
High performance ceiling. AI agents can achieve significantly above human-level, though in practice it will likely be very difficult to do more than roughly 2x higher than the human baseline solutions (for a score of 1.5). Median human baseline scores are 0.12 for 2 hours of effort and 0.66 for 8 hours. Current state of the art (SOTA) is Claude 3.7 Sonnet with a score of roughly 0.6 using Best-of-K scaffolding in a scaffold called “modular”.
Human baselines which allow for grounded comparisons between AI and human performance.
We expect that “saturation” under this definition will happen before (italics mine—S.K.) the SC milestone is hit. The first systems that saturate RE-Bench will probably be missing a few kinds of capabilities that are needed to hit the SC milestone, as described below.
I think these are blind guesses and relying on the benchmarks is the streetlight effect, as I think we talked about in another thread. I am mostly explaining in as much detail as I can the parts I think are relevant to Neel’s objection, since it is substantively the most common objection, ie, that paying attention to financial incentives or work history is irrelevant to anything. I am happy that I have addressed the scenario itself in enough detail