Gary Marcus asked me to make a critique of his 2024 predictions, for which he claimed that he got “7/7 correct”. I don’t really know why I did this, but here is my critique:
For convenience, here are the predictions:
7-10 GPT-4 level models
No massive advance (no GPT-5, or disappointing GPT-5)
Price wars
Very little moat for anyone
No robust solution to hallucinations
Modest lasting corporate adoption
Modest profits, split 7-10 ways
I think the best way to evaluate them is to invert every one of them, and then see whether the version you wrote, or the inverted version seems more correct in-retrospect.
We will see 7-10 GPT-4 level models.
Inversion: We will either see less than 7 GPT-4 level models, or more than 10 GPT-4 level models.
Evaluation: Conveniently Epoch did an evaluation of almost this exact question!
Training compute is not an ideal proxy for capabilities, but it’s better than most other simple proxies.
Models released in 2024 with GPT-4 level compute according to Epoch:
Inflection-2, GLM-4, Mistral Large, Aramco Metabrain AI, Inflection-2.5, Nemotron-4 340B, Mistral Large 2, GLM-4-Plus, Doubao-pro, Llama 3.1-405B, grok-2, Claude 3 Opus, Claude 3.5 Sonnett, Gemini 1.0 Ultra, Gemini 1.5 Pro, Gemini 2.0 Pro, GPT 4o, o1-mini, o1, o3 (o3 was announced and published but not released until later in the year)
They also list 22 models which might be over the 10^25 FLOP threshold that GPT-4 was trained with. Many of those will be at GPT-4 level capabilities, because compute-efficiency has substantially improved.
Counting these models, I get 20+ models at GPT-4 level (from over 10 distinct companies).
I think your prediction seems to me to have somewhat underestimated the number of GPT-4 level models that will be released in 2024. I don’t know whether you intended to put more emphasis on the number being low or high, but it definitely isn’t within your range.
No massive advance (no GPT-5, or disappointing GPT-5)
Inversion: There was a massive advance in frontier model AI in 2024.
Evaluation: Given that 2024 was the year of reasoning models, this really seems very straightforwardly false. We saw the biggest advance since roughly transformers itself, with huge changes in scaling laws. The o3 evaluations were even released in 2024, in-line with the final evaluations, suggesting the model had indeed largely finished training.
This was not a prediction about whether an advance would be deployed to consumer models in 2024. And it is very unambiguously the case that we saw a major advance in AI technologies in 2024. This one should be straightforwardly marked as false.
Price wars
Inversion: AI companies do not have to frequently lower their prices to stay competitive with other players trying to undercut them
Yep, seems like there is a lot of aggressive price competition, though not as much as I was honestly expecting. Both Claude and OpenAI have models that cost enormous amounts of money, and being at the frontier means you can charge a huge premium.
This is very much not operationalized enough to really judge it, but I think it’s fine.
Very little moat for anyone
Inversion: Being a leading AI company is a robust position that you will be able to extract large amounts of money from without needing to worry too much about competition
Evaluation: So, I have sympathy for this position, but also, the active players in the AI race have not been changing over the years, which I think would be a strong sign of actual weak moats. In as much as the moats are weak, so far every AI company has succeeded at defending themselves against the competitors.
But that not withstanding, I would probably resolve this ambiguously slightly favoring Gary? If someone wanted to argue that “being at a leading lab now is the most important thing because they will probably continue to be in the lead” I would see some pretty compelling arguments for that, which seems like the opposite of this prediction.
No robust solution to hallucinations
Yep, this seems unambiguously correct, no need to argue much.
Modest lasting corporate adoption
Inversion: AI corporate adoption will either be very close to zero, or among the most rapid adoptions of any technology in US history
Evaluation: We are absolutely, with no ambiguity, in the “most rapid adoptions of any technology in US history branch”. Every single corporation in the world is trying to adopt AI into their products. Even extremely slow-moving industries are rushing to adopt AI.
This one unambiguously resolves as false. I honestly have trouble imagining any world where corporate adoption is even faster. Trying to spin this as a positive prediction seems absolutely hilarious. I literally cannot think of a technology for which this could possibly be falsified, if it cannot be falsified for AI.
Modest profits, split 7-10 ways
Inversion: Companies investing heavily in AI will either see very low returns, or very high returns. There will only be either a very small or very large number of players who make up the majority of profit.
Evaluation: Profits are extremely high industry-wide, especially for companies that are not training frontier models themselves, but are providing inference compute and training compute for frontier model companies. Profitability happens to be masked by extremely high returns to investment, meaning that in the leading AI companies, we are not seeing that many payouts to shareholders.
Nvidia is making more profit than the operating costs of Anthropic and OpenAI combined, clearly driven by the AI boom. Maybe you can argue that this is fueled by an investment bubble, but clearly profits are flowing extremely aggressively towards AI companies.
One could try to cherry-pick OpenAI or Anthropic here, who are both investing extremely aggressively and so profit margins appear thin, but when you look at the whole industry, it clearly is making huge amounts of profit.
I think settling this is kind of hard, so I feel hesitant to mark this as a totally unambiguous red mark, but it seems pretty close to me. Anyone who would have listened to Gary for the purpose of investment advice, or for the purpose of estimating profit margins for anyone but the companies who are building the next generation of frontier models and have no issue attracting investment to do that, would have been extremely burned (and of course there the market is very much forecasting high future expected returns, that’s why the valuations of those companies are so high).
Ok, so where does this leave us?
7-10 GPT-4 level models: False, since we have many more than 10 GPT-4 level models, by more than 10 distinct companies. But IDK, it’s not like off by many factors (but false as stated). Let’s say a 0.35 on a 0-1 scale to indicate that it is more false than right, but was pointing at something real.
No massive advance: False, unless I am missing some definition that makes this true. o3 was fully finished training and was evaluated by end of 2024.
Price wars: Seems true enough.
Very little moat: IDK, I would resolve ambiguous, though slanted towards true. Let’s say 0.8 or so on a 0-1 scale.
No robust solution to hallucinations: True, unambiguously.
Modest profits, split 7-10 ways: False, though there are some AI companies that are choosing to re-invest. But we are seeing enough nearby companies make enormous amounts of money (like Nvidia) that this is clearly falsified. Maybe one could give it a 0.1 since the leading labs are still spending more than they are making. Profits are also highly concentrated in a smaller number than 7 players (it’s basically just OpenAI, Anthropic and Google at the model level, and Nvidia at the hardware level, I am not seeing 3-4 other players).
This overall gives me 3.25/7, as a best guess of what a fair evaluation of this specific set of predictions would arrive at. I think one could quibble over 1-2 points, so I think arguing a 4⁄7, or maybe even a 5⁄7 wouldn’t be completely crazy, though I think the latter would be seriously stretching things.
Huh, I didn’t expect to take Gary Marcus’s side against yours but I do for almost all of these. If we take your two strongest cases:
No massive advance (no GPT-5, or disappointing GPT-5)
There was no GPT-5 in 2024? And there is still no GPT 5? People were talking in late 2023 like GPT 5 might come out in a few months, and they were wrong. The magic of “everything just gets better with scale” really seemed to slow after GPT-4?
I’d say this is true? Read e.g. Dwarkesh talking about how he’s pretty AI forward but even he has a lot of trouble getting AIs to do something useful. Many corporations are trying to get AIs to be useful in California, fewer elsewhere, and I’m not convinced these will last.
I don’t think I really want to argue about these, more I find it weird people can in good faith have such different takes. I remember 2024 as a year I got continuously more bearish on LLM progress[1].
There was no GPT-5 in 2025? And there is still no GPT 5? People were talking in late 2023 like GPT 5 might come out in a few months, and they were wrong. The magic of “everything just gets better with scale” really seemed to slow after GPT-4?
Eh, reasoning models have replaced everything and seem like a bigger deal than GPT-5 to me. Also, I don’t believe you that anyone was talking in late 2023 that GPT-5 was coming out in a few months, that would have been only like 9 months after the release of GPT-4, and the gap between GPT-3 and GPT-4 was almost 3 full years. End of 2024 would have been a quite aggressive prediction even just on reference class forecasting grounds, and IMO still ended up true with the use of o3 (and previously o1, though I think o3 was a big jump on o1 in itself).
Until DeepSeek in late December.
I mean, yes, I think the central thing happening in 2024 is the rise of reasoning models. I agree that if we hadn’t seen those, some bearishness would be appropriate, but alas, such did not happen.
I don’t believe you that anyone was talking in late 2023 that GPT-5 was coming out in a few months
Out of curiosity, I went to check the prediction markets. Best I’ve found:
From March 2023 to January 2024, expectations that GPT-5 will come out/be announced in 2023 never rose above 13% and fell to 2-7% in the last three months (one, two, three).
Based on this series of questions, at the start of 2024, people’s median was September 2024.
I’d say this mostly confirms your beliefs, yes.
(Being able to check out the public’s past epistemic states like this is a pretty nifty feature of prediction-market data I haven’t realized before!)
End of 2024 would have been a quite aggressive prediction even just on reference class forecasting grounds
76% on “GPT-5 before January 2025” in January 2024, for what it’s worth.
reasoning models have replaced everything and seem like a bigger deal than GPT-5 to me.
Ehhh, there are scenarios under which they retroactively turn out not to be a “significant advance” towards AGI. E. g., if it actually proves true that RL training only elicits base models’ capabilities and not creates them; or if they turn out to scale really poorly; or if their ability to generalize to anything but the most straightforward verifiable domains disappoints[1].
And I do expect something from this cluster to come true, which would mean that they’re only marginal/no progress towards AGI.
That said, I am certainly not confident in this, and they are a nontrivial advance by standard industry metrics (if possibly not by the p(doom) metric). And if we benchmark “a significant advance” as “a GPT-3(.5) to GPT-4 jump”, and then tally up all progress over 2024 from GPT-4 Turbo to Sonnet 3.6 and o1/o3[2], this is probably a comparable advance.[3]
I don’t think we’ve seen much success there yet? I recall Noam Brown pointing to Deep Research as an example, but I don’t buy that.
Models have been steadily getting better across the board, but I think it’s just algorithmic progress/data quality + distillation from bigger models, not the reasoning on/off toggle?
Oh, hm, I guess we can count o3′s lying tendencies as a generalization of its reward-hacking behavior to “soft” domains from math/coding. I am not sure how to count this one, though. I mean, I’d like to make a dunk here, but it does seem to be weak-moderate evidence for the kind of generalization I didn’t want to see.
Though I’m given to understand the o3 announced at the end of 2024 and the o3 available now are completely different models, see here and here. So we don’t actually know how 2024!o3 “felt” like, beyond the benchmarks; and so assuming that the modern o3′s capability level was already reached by EOY 2024 is unjustified, I think.
This is the point where I would question whether “GPT-3.5 to GPT-4” was a significant advance towards AGI, and drop a hot take that no it wasn’t. But Gary Marcus’ wording implies that GPT-5 would count as a significant advance by his lights, so whatever.
reasoning models [...] seem like a bigger deal than GPT-5 to me.
Strong disagree. Reasoning models do not make every other trick work better, the way a better foundation model does. (Also I’m somewhat skeptical that reasoning models are actually importantly better at all; for the sorts of things we’ve tried they seem shit in basically the same ways and to roughly the same extent as non-reasoning models. But not sure how cruxy that is.)
Qualitatively, my own update from OpenAI releasing o1/o3 was (and still is) “Altman realized he couldn’t get a non-disappointing new base model out by December 2024, so he needed something splashy and distracting to keep the investor money fueling his unsustainable spend. So he decided to release the reasoning models, along with the usual talking points of mostly-bullshit evals improving, and hope nobody notices for a while that reasoning models are just not that big a deal in the long run.”
Also, I don’t believe you that anyone was talking in late 2023 that GPT-5 was coming out in a few months [...] End of 2024 would have been a quite aggressive prediction even just on reference class forecasting grounds
When David and I were doing some planning in May 2024, we checked the prediction markets, and at that time the median estimate for GPT5 release was at December 2024.
at that time the median estimate for GPT5 release was at December 2024.
Which was correct ex ante, and mostly correct ex post—that’s when OA had been dropping hints about releasing GPT-4.5, which was clearly supposed to have been GPT-5, and seemingly changed their mind near Dec 2024 and spiked it before it seems like the DeepSeek moment in Jan 2025 unchanged their minds and they released it February 2025. (And GPT-4.5 is indeed a lot better than GPT-4 across the board. Just not a reasoning model or dominant over the o1-series.)
I have seen people say this many times, but I don’t understand. What makes it so clear?
GPT-4.5 is roughly a 10x scale-up of GPT-4, right? And full number jumps in GPT have always been ~100x? So GPT-4.5 seems like the natural name for OpenAI to go with.
I do think it’s clear that OpenAI viewed GPT-4.5 as something of a disappointment, I just haven’t seen anything indicating that they at some point planned to break the naming convention in this way.
GPT-4.5 is roughly a 10x scale-up of GPT-4, right? And full number jumps in GPT have always been ~100x? So GPT-4.5 seems like the natural name for OpenAI to go with.
10x is what it was, but it wasn’t what it was supposed to be. That’s just what they finally killed it at, after the innumerable bugs and other issues that they alluded to during the livestream and elsewhere, which is expected given the ‘rocket waiting equation’ for large DL runs—after a certain point, no matter how much you have invested, it’s a sunk cost and you’re better off starting afresh, such as, say, with distilled data from some sort of breakthrough model… (Reading between the lines, I suspect that what would become ‘GPT-4.5’ was one of the several still-unknown projects besides Superalignment which suffered from Sam Altman overpromising compute quotas and gaslighting people about it, leading to an endless deathmarch where they kept thinking ‘we’ll get the compute next month’, and the 10x compute-equivalent comes from a mix of what compute they scraped together from failed runs/iterations and what improvements they could wodge in partway even though that is not as good as doing from scratch, see OA Rerun.)
If GPT-4.5 was supposed to be GPT-5, why would Sam Altman underdeliver on compute for it? Surely GPT-5 would have been a top priority?
Maybe Sam Altman just hoped to get way more compute in total, and then this failed, and OpenAI simply didn’t have enough compute to meet GPT-5′s demands no matter how high of a priority they made it? If so, I would have thought that’s a pretty different story from the situation with superalignment (where my impression was that the complaint was “OpenAI prioritized this too little” rather than “OpenAI overestimated the total compute it would have available, and this was one of many projects that suffered”).
There was o1-pro in 2024 (December). It might be argued that this came with caveats due to its slowness and high cost but the difference in science questions (GPQA diamond), math (AIME 2024), competition code (Codeforces) compared to GPT-4 Turbo available at the time of his post was huge. The API wasn’t available in 2024 so we didn’t get any benchmarks besides these from OpenAI. In 2025, I tested o1-pro on NYT Connections and it also improved greatly 1, 2. I would probably also consider regular o1 a massive advancement. I don’t think the naming is what matters.
Many people corporations are trying to get AIs to be useful in California, fewer elsewhere, and I’m not convinced these will last.
Lately, I’ve been searching for potential shorting opportunities in the stock market among companies likely to suffer from AI-first competition. But it’s been tougher than I expected, as nearly every company fitting this description emphasizes their own AI products and AI transformations. Of course, for many of these companies, adapting won’t be quite that easy, but the commitment is clearly there.
“On the firm side, the Chamber of Commerce recorded a 73 percent annualized growth rate between 2023 and 2024.4 The Census BTOS survey shows a 78.4 percent annualized growth rate.5 Lastly, the American Bar Association reported a 38 percent annualized growth rate.6 Among individual-level surveys, Pew is the only source showing changes over time, with an annualized growth rate of 145 percent.7 Overall, these findings suggest that regardless of measurement differences in the levels adoption is rising very rapidly both at the individual and firm-level. ”—Measuring AI Uptake in the Workplace.
Every single corporation in the world is trying to adopt AI into their products. Even extremely slow-moving industries are rushing to adopt AI.
is about attempts to adopt, not lasting adoption. Of course, we can’t make “lasting adoption” mean “adopted for 5 years” if we’re trying to evaluate the prediction right now. But are you saying that there’s lots of adoption that seems probably/plausibly lasting, just by eyeballing it? My vague is impression is no, but I’m curious if the answer is yes or somewhat.
(TBC I don’t have a particularly strong prediction or retrodiction about adoption of AI in general in industry, or LLMs specifically (which is what I think Marcus’s predictions are about). At a guess I’d expect robotics to continue steadily rising in applications; I’d expect LLM use in lots of “grunt information work” contexts; and some niche strong applications like language learning; but not sure what else to expect.)
I interpreted the statement to mean, “modest, lasting, adoption”. I.e. we will see modest adoption, which will be lasting. It’s plausible Gary meant “modest-lasting adoption” in which case I think there is a better case to be made!
I still think that case is weak, but of course it’s very hard to evaluate at the end of a year, because how do we know if the adoption is lasting. It seems fine to evaluate that in a year or two and see whether the adoptions that happened in 2024 were lasting. I don’t see any way to call that interpretation already, at least given the current state of evidence.
Well, like, if a company tried out some new robotics thing in one warehouse at a small scale in Q1, then in Q2 and Q3 scaled it up to most of that warehouse, and then in Q4 started work applying the same thing in another warehouse, and announced plans to apply to many warehouses, I think it’d be pretty fair to call this lasting adoption (of robotics, not LLMs, unless the robots use LLMs). On the other hand if they were stuck at the “small scale work trying to make a maybe-scalable PoC”, that doesn’t seem like lasting adoption, yet.
Judging this sort of thing would be a whole bunch of work, but it seems possible to do. (Of course, we can just wait.)
Agree, though I think, in the world we are in, we don’t happen to have that kind of convenient measurement, or at least not unambiguous ones. I might be wrong, people have come up with clever methodologies to measure things like this in the past that compelled me, but I don’t have an obvious dataset or context in mind where you could get a good answer (but also, to be clear, I haven’t thought that much about it).
We are absolutely, with no ambiguity, in the “most rapid adoptions of any technology in US history branch”. Every single corporation in the world is trying to adopt AI into their products.
Disagree with your judgement on this one. Agree that everyone is trying to adopt AI into their products, but that’s extremely and importantly different from actual successful adoption. It’s especially importantly different because part of the core value proposition of general AI is that you’re not supposed to need to retool the environment around it in order to use it.
Agree that the audience is still out on how lasting the adoption will be, but it’s definitely not “modest” (as I mentioned in another thread, it’s plausible to me Gary meant “modest-lasting” adoption instead of “modest and lasting adoption”, i.e. the modest is just modifying the “lasting”, not the “adoption” which was the interpretation I had. I would still take the under on that, but agree it’s less clear cut and would require a different analysis.)
This in an incredibly uncharitable read, biased and redolent of motivated reasoning.
If you applied the same to almost any other set of predictions I think you could nitpick those too. It also lacks context (e.g yes 7-10 was an underestimate, but at time when there were like too and people were surprised that I said such models would become widespread). Even @robo here sees that you have been uncharitable.
The one that annoys me the most and makes me not even want to talk about the rest is re GPT-5. Practically everybody thought GPT-5 was imminent; I went out on a limb and said it would not be. I used it as an explicit specific yardstick (which I should be credited for) and that explicit yardstick was not met. Yet you are giving me zero credit.
You are just wrong about profits, inventing your own definition. Re adoption, many corporations have TRIED, but proofs of concept are not adoption. there have been loads of articles and surveys written about companies trying stuff out and not getting the ROI they expected.
I would be happy to respond in more detailed to a serious, balanced investigation that evaluated my predictions over time, going back to my 1998 article on distribution shift, but this ain’t it.
One lesson you should maybe take away is that if you want your predictions to be robust to different interpretations (including interpretations that you think are uncharitable), it could be worthwhile to try to make them more precise (in the case of a tweet, this could be in a linked blog post which explains in more detail). E.g., in the case of “No massive advance (no GPT-5, or disappointing GPT-5)” you could have said “Within 2024 no AI system will be publicly released which is as much of a qualitative advance over GPT-4 in broad capabilites as GPT-4 is over GPT-3 and where this increase in capabilites appears to be due to scale up in LLM pretraining”. This prediction would have been relatively clearly correct (though I think also relatively uncontroversial at least among people I know as we probably should only have expected to get to ~GPT-4.65 in terms of compute scaling and algorithmic progress by the end of 2024). You could try to operationalize this further in terms of benchmarks or downstream tasks.
To the extent that you can make predictions in terms of concrete numbers or metrics (which is not always possible to be clear), this avoids ~any issues due to interpretation. You could also make predictions about metaculus questions when applicable as these also have relatively solid and well understood resolution criteria.
I think Oliver put in a great effort here, and that the two of you have very different information environments, which results in him reading your points (which are underspecified relative to, e.g., Daniel Kokotajlo’s predictions ) differently than you may have intended them.
For instance, as someone in a similar environment to Habryka, that there would soon be dozens of GPT-4 level models around was a common belief by mid-2023, based on estimates of the compute used and Nvidia’s manufacturing projections. In your information environment, your 7-10 number looks ambitious, and you want credit for guessing way higher than other people you talked to (and you should in fact demand credit from those who guessed lower!). In our information environment, 7-10 looks conservative. You were directionally correct compared to your peers, but less correct than people I was talking to at the time (and in fact incorrect, since you gave both a lower and upper bound—you’d have just won the points from Oli on that one if you said ‘7+’ and not 7-10’).
I’m not trying to turn the screw; I think it’s awesome that you’re around here now, and I want to introduce an alternative hypothesis to ‘Oliver is being uncharitable and doing motivated reasoning.’
Oliver’s detailed breakdown above looks, to me, like an olive branch more than anything (I’m pretty surprised he did it!), and I wish I knew how best to encourage you to see it that way.
I think it would be cool for you and someone in Habryka’s reference class to quickly come up with predictions for mid-2026, and drill down on any perceived ambiguities, to increase your confidence in another review to be conducted in the near-ish future. There’s something to be gained from us all learning how best to talk to each other.
I feel the issue with your GPT-5 prediction is that it specifies both “no massive advance” and “no GPT-5″. When there was a massive advance but no GPT-5, it makes it ambiguous which half of the prediction is more important.
It’s slightly weird to have the correctness of it depend on OpenAI’s branding choices, though. If we decided that the GPT part of the prediction was more important, then in an alternative world that was otherwise identical to our own but where OAI had chosen to call one of their reasoning models GPT-5, the prediction would flip from false to correct. So that makes me lean a bit toward weighting the “no massive advance” part more, though I also wouldn’t think it unreasonable to split the difference and give you half credit for having one part of a two-part prediction correct.
I agree with your point about profits; it seems pretty clear that you were not referring to money made by the people selling the shovels.
But I don’t see the substance in your first two points:
You chose to give a range with both a lower and an upper bound; the success of the prediction was evaluated accordingly. I don’t see what you have to complain about here.
In the linked tweet, you didn’t go out on a limb and say GPT-5 wasn’t imminent! You said it either was not imminent or would be disappointing. And you said this in a parenthetical to the claim “No massive advance”. Clearly the success of the prediction “No massive advance (no GPT-5, or disappointing GPT-5)” does not depend solely on the nonexistence of GPT-5; it can be true if GPT-5 arrives but is bad, and it can be false if GPT-5 doesn’t arrive but another “massive advance” does. (If you meant it only to apply to GPT-5, you surely would have just said that: “No GPT-5 or disappointing GPT-5.”)
Regarding adoption, surely that deserves some fleshing out? Your original prediction was not “corporate adoption has disappointing ROI”; it was “Modest lasting corporate adoption”. The word “lasting” makes this tricky to evaluate, but it’s far from obvious that your prediction was correct.
I thought price wars was false, although I haven’t been paying that much attention to companies’ pricings. GPT was $20/month in 2023 and it’s still $20/month. IIRC Gemini/Claude were available in 2023 but they only had free tiers so I don’t know how to judge them.
GPT was $20/month in 2023 and it’s still $20/month.
Those are buying wildly different things. (They are not even comparable in terms of real dollars. That’s like a 10% difference, solely from inflation!)
Gary Marcus asked me to make a critique of his 2024 predictions, for which he claimed that he got “7/7 correct”. I don’t really know why I did this, but here is my critique:
For convenience, here are the predictions:
7-10 GPT-4 level models
No massive advance (no GPT-5, or disappointing GPT-5)
Price wars
Very little moat for anyone
No robust solution to hallucinations
Modest lasting corporate adoption
Modest profits, split 7-10 ways
I think the best way to evaluate them is to invert every one of them, and then see whether the version you wrote, or the inverted version seems more correct in-retrospect.
We will see 7-10 GPT-4 level models.
Inversion: We will either see less than 7 GPT-4 level models, or more than 10 GPT-4 level models.
Evaluation: Conveniently Epoch did an evaluation of almost this exact question!
https://epoch.ai/data-insights/models-over-1e25-flop
Training compute is not an ideal proxy for capabilities, but it’s better than most other simple proxies.
Models released in 2024 with GPT-4 level compute according to Epoch:
They also list 22 models which might be over the 10^25 FLOP threshold that GPT-4 was trained with. Many of those will be at GPT-4 level capabilities, because compute-efficiency has substantially improved.
Counting these models, I get 20+ models at GPT-4 level (from over 10 distinct companies).
I think your prediction seems to me to have somewhat underestimated the number of GPT-4 level models that will be released in 2024. I don’t know whether you intended to put more emphasis on the number being low or high, but it definitely isn’t within your range.
No massive advance (no GPT-5, or disappointing GPT-5)
Inversion: There was a massive advance in frontier model AI in 2024.
Evaluation: Given that 2024 was the year of reasoning models, this really seems very straightforwardly false. We saw the biggest advance since roughly transformers itself, with huge changes in scaling laws. The o3 evaluations were even released in 2024, in-line with the final evaluations, suggesting the model had indeed largely finished training.
This was not a prediction about whether an advance would be deployed to consumer models in 2024. And it is very unambiguously the case that we saw a major advance in AI technologies in 2024. This one should be straightforwardly marked as false.
Price wars
Inversion: AI companies do not have to frequently lower their prices to stay competitive with other players trying to undercut them
Yep, seems like there is a lot of aggressive price competition, though not as much as I was honestly expecting. Both Claude and OpenAI have models that cost enormous amounts of money, and being at the frontier means you can charge a huge premium.
This is very much not operationalized enough to really judge it, but I think it’s fine.
Very little moat for anyone
Inversion: Being a leading AI company is a robust position that you will be able to extract large amounts of money from without needing to worry too much about competition
Evaluation: So, I have sympathy for this position, but also, the active players in the AI race have not been changing over the years, which I think would be a strong sign of actual weak moats. In as much as the moats are weak, so far every AI company has succeeded at defending themselves against the competitors.
But that not withstanding, I would probably resolve this ambiguously slightly favoring Gary? If someone wanted to argue that “being at a leading lab now is the most important thing because they will probably continue to be in the lead” I would see some pretty compelling arguments for that, which seems like the opposite of this prediction.
No robust solution to hallucinations
Yep, this seems unambiguously correct, no need to argue much.
Modest lasting corporate adoption
Inversion: AI corporate adoption will either be very close to zero, or among the most rapid adoptions of any technology in US history
Evaluation: We are absolutely, with no ambiguity, in the “most rapid adoptions of any technology in US history branch”. Every single corporation in the world is trying to adopt AI into their products. Even extremely slow-moving industries are rushing to adopt AI.
This one unambiguously resolves as false. I honestly have trouble imagining any world where corporate adoption is even faster. Trying to spin this as a positive prediction seems absolutely hilarious. I literally cannot think of a technology for which this could possibly be falsified, if it cannot be falsified for AI.
Modest profits, split 7-10 ways
Inversion: Companies investing heavily in AI will either see very low returns, or very high returns. There will only be either a very small or very large number of players who make up the majority of profit.
Evaluation: Profits are extremely high industry-wide, especially for companies that are not training frontier models themselves, but are providing inference compute and training compute for frontier model companies. Profitability happens to be masked by extremely high returns to investment, meaning that in the leading AI companies, we are not seeing that many payouts to shareholders.
Nvidia is making more profit than the operating costs of Anthropic and OpenAI combined, clearly driven by the AI boom. Maybe you can argue that this is fueled by an investment bubble, but clearly profits are flowing extremely aggressively towards AI companies.
One could try to cherry-pick OpenAI or Anthropic here, who are both investing extremely aggressively and so profit margins appear thin, but when you look at the whole industry, it clearly is making huge amounts of profit.
I think settling this is kind of hard, so I feel hesitant to mark this as a totally unambiguous red mark, but it seems pretty close to me. Anyone who would have listened to Gary for the purpose of investment advice, or for the purpose of estimating profit margins for anyone but the companies who are building the next generation of frontier models and have no issue attracting investment to do that, would have been extremely burned (and of course there the market is very much forecasting high future expected returns, that’s why the valuations of those companies are so high).
Ok, so where does this leave us?
7-10 GPT-4 level models: False, since we have many more than 10 GPT-4 level models, by more than 10 distinct companies. But IDK, it’s not like off by many factors (but false as stated). Let’s say a 0.35 on a 0-1 scale to indicate that it is more false than right, but was pointing at something real.
No massive advance: False, unless I am missing some definition that makes this true. o3 was fully finished training and was evaluated by end of 2024.
Price wars: Seems true enough.
Very little moat: IDK, I would resolve ambiguous, though slanted towards true. Let’s say 0.8 or so on a 0-1 scale.
No robust solution to hallucinations: True, unambiguously.
Modest lasting corporate adoption: Extremely false
Modest profits, split 7-10 ways: False, though there are some AI companies that are choosing to re-invest. But we are seeing enough nearby companies make enormous amounts of money (like Nvidia) that this is clearly falsified. Maybe one could give it a 0.1 since the leading labs are still spending more than they are making. Profits are also highly concentrated in a smaller number than 7 players (it’s basically just OpenAI, Anthropic and Google at the model level, and Nvidia at the hardware level, I am not seeing 3-4 other players).
This overall gives me 3.25/7, as a best guess of what a fair evaluation of this specific set of predictions would arrive at. I think one could quibble over 1-2 points, so I think arguing a 4⁄7, or maybe even a 5⁄7 wouldn’t be completely crazy, though I think the latter would be seriously stretching things.
Huh, I didn’t expect to take Gary Marcus’s side against yours but I do for almost all of these. If we take your two strongest cases:
No massive advance (no GPT-5, or disappointing GPT-5)
There was no GPT-5 in 2024? And there is still no GPT 5? People were talking in late 2023 like GPT 5 might come out in a few months, and they were wrong. The magic of “everything just gets better with scale” really seemed to slow after GPT-4?
On reasoning models: I thought of reasoning models happening internally at Anthropic in 2023 and being distilled into public models, which was why Claude was so good at programming. But I could be wrong or have my timelines messed up.
Modest lasting corporate adoption
I’d say this is true? Read e.g. Dwarkesh talking about how he’s pretty AI forward but even he has a lot of trouble getting AIs to do something useful. Many corporations are trying to get AIs to be useful in California, fewer elsewhere, and I’m not convinced these will last.
I don’t think I really want to argue about these, more I find it weird people can in good faith have such different takes. I remember 2024 as a year I got continuously more bearish on LLM progress[1].
Until DeepSeek in late December.
Eh, reasoning models have replaced everything and seem like a bigger deal than GPT-5 to me. Also, I don’t believe you that anyone was talking in late 2023 that GPT-5 was coming out in a few months, that would have been only like 9 months after the release of GPT-4, and the gap between GPT-3 and GPT-4 was almost 3 full years. End of 2024 would have been a quite aggressive prediction even just on reference class forecasting grounds, and IMO still ended up true with the use of o3 (and previously o1, though I think o3 was a big jump on o1 in itself).
I mean, yes, I think the central thing happening in 2024 is the rise of reasoning models. I agree that if we hadn’t seen those, some bearishness would be appropriate, but alas, such did not happen.
Out of curiosity, I went to check the prediction markets. Best I’ve found:
From March 2023 to January 2024, expectations that GPT-5 will come out/be announced in 2023 never rose above 13% and fell to 2-7% in the last three months (one, two, three).
Based on this series of questions, at the start of 2024, people’s median was September 2024.
I’d say this mostly confirms your beliefs, yes.
(Being able to check out the public’s past epistemic states like this is a pretty nifty feature of prediction-market data I haven’t realized before!)
76% on “GPT-5 before January 2025” in January 2024, for what it’s worth.
Ehhh, there are scenarios under which they retroactively turn out not to be a “significant advance” towards AGI. E. g., if it actually proves true that RL training only elicits base models’ capabilities and not creates them; or if they turn out to scale really poorly; or if their ability to generalize to anything but the most straightforward verifiable domains disappoints[1].
And I do expect something from this cluster to come true, which would mean that they’re only marginal/no progress towards AGI.
That said, I am certainly not confident in this, and they are a nontrivial advance by standard industry metrics (if possibly not by the p(doom) metric). And if we benchmark “a significant advance” as “a GPT-3(.5) to GPT-4 jump”, and then tally up all progress over 2024 from GPT-4 Turbo to Sonnet 3.6 and o1/o3[2], this is probably a comparable advance.[3]
I’d count it as “mostly false”. 0-0.2?
I don’t think we’ve seen much success there yet? I recall Noam Brown pointing to Deep Research as an example, but I don’t buy that.
Models have been steadily getting better across the board, but I think it’s just algorithmic progress/data quality + distillation from bigger models, not the reasoning on/off toggle?
Oh, hm, I guess we can count o3′s lying tendencies as a generalization of its reward-hacking behavior to “soft” domains from math/coding. I am not sure how to count this one, though. I mean, I’d like to make a dunk here, but it does seem to be weak-moderate evidence for the kind of generalization I didn’t want to see.
Though I’m given to understand the o3 announced at the end of 2024 and the o3 available now are completely different models, see here and here. So we don’t actually know how 2024!o3 “felt” like, beyond the benchmarks; and so assuming that the modern o3′s capability level was already reached by EOY 2024 is unjustified, I think.
This is the point where I would question whether “GPT-3.5 to GPT-4” was a significant advance towards AGI, and drop a hot take that no it wasn’t. But Gary Marcus’ wording implies that GPT-5 would count as a significant advance by his lights, so whatever.
This all seems pretty reasonable to me. Agree 0.2 seems like a fine call someone could make on this.
Strong disagree. Reasoning models do not make every other trick work better, the way a better foundation model does. (Also I’m somewhat skeptical that reasoning models are actually importantly better at all; for the sorts of things we’ve tried they seem shit in basically the same ways and to roughly the same extent as non-reasoning models. But not sure how cruxy that is.)
Qualitatively, my own update from OpenAI releasing o1/o3 was (and still is) “Altman realized he couldn’t get a non-disappointing new base model out by December 2024, so he needed something splashy and distracting to keep the investor money fueling his unsustainable spend. So he decided to release the reasoning models, along with the usual talking points of mostly-bullshit evals improving, and hope nobody notices for a while that reasoning models are just not that big a deal in the long run.”
When David and I were doing some planning in May 2024, we checked the prediction markets, and at that time the median estimate for GPT5 release was at December 2024.
Which was correct ex ante, and mostly correct ex post—that’s when OA had been dropping hints about releasing GPT-4.5, which was clearly supposed to have been GPT-5, and seemingly changed their mind near Dec 2024 and spiked it before it seems like the DeepSeek moment in Jan 2025 unchanged their minds and they released it February 2025. (And GPT-4.5 is indeed a lot better than GPT-4 across the board. Just not a reasoning model or dominant over the o1-series.)
I have seen people say this many times, but I don’t understand. What makes it so clear?
GPT-4.5 is roughly a 10x scale-up of GPT-4, right? And full number jumps in GPT have always been ~100x? So GPT-4.5 seems like the natural name for OpenAI to go with.
I do think it’s clear that OpenAI viewed GPT-4.5 as something of a disappointment, I just haven’t seen anything indicating that they at some point planned to break the naming convention in this way.
10x is what it was, but it wasn’t what it was supposed to be. That’s just what they finally killed it at, after the innumerable bugs and other issues that they alluded to during the livestream and elsewhere, which is expected given the ‘rocket waiting equation’ for large DL runs—after a certain point, no matter how much you have invested, it’s a sunk cost and you’re better off starting afresh, such as, say, with distilled data from some sort of breakthrough model… (Reading between the lines, I suspect that what would become ‘GPT-4.5’ was one of the several still-unknown projects besides Superalignment which suffered from Sam Altman overpromising compute quotas and gaslighting people about it, leading to an endless deathmarch where they kept thinking ‘we’ll get the compute next month’, and the 10x compute-equivalent comes from a mix of what compute they scraped together from failed runs/iterations and what improvements they could wodge in partway even though that is not as good as doing from scratch, see OA Rerun.)
If GPT-4.5 was supposed to be GPT-5, why would Sam Altman underdeliver on compute for it? Surely GPT-5 would have been a top priority?
Maybe Sam Altman just hoped to get way more compute in total, and then this failed, and OpenAI simply didn’t have enough compute to meet GPT-5′s demands no matter how high of a priority they made it? If so, I would have thought that’s a pretty different story from the situation with superalignment (where my impression was that the complaint was “OpenAI prioritized this too little” rather than “OpenAI overestimated the total compute it would have available, and this was one of many projects that suffered”).
If it’s not obvious at this point why, I would prefer to not go into it here in a shallow superficial way, and refer you to the OA coup discussions.
There was o1-pro in 2024 (December). It might be argued that this came with caveats due to its slowness and high cost but the difference in science questions (GPQA diamond), math (AIME 2024), competition code (Codeforces) compared to GPT-4 Turbo available at the time of his post was huge. The API wasn’t available in 2024 so we didn’t get any benchmarks besides these from OpenAI. In 2025, I tested o1-pro on NYT Connections and it also improved greatly 1, 2. I would probably also consider regular o1 a massive advancement. I don’t think the naming is what matters.
Lately, I’ve been searching for potential shorting opportunities in the stock market among companies likely to suffer from AI-first competition. But it’s been tougher than I expected, as nearly every company fitting this description emphasizes their own AI products and AI transformations. Of course, for many of these companies, adapting won’t be quite that easy, but the commitment is clearly there.
The data appears to support this:
“Adoption is deepening, too: The average number of use cases in production doubled between October 2023 and December 2024”—Bain Brief—Survey: Generative AI’s Uptake Is Unprecedented Despite Roadblocks.
“On the firm side, the Chamber of Commerce recorded a 73 percent annualized growth rate between 2023 and 2024.4 The Census BTOS survey shows a 78.4 percent annualized growth rate.5 Lastly, the American Bar Association reported a 38 percent annualized growth rate.6 Among individual-level surveys, Pew is the only source showing changes over time, with an annualized growth rate of 145 percent.7 Overall, these findings suggest that regardless of measurement differences in the levels adoption is rising very rapidly both at the individual and firm-level. ”—Measuring AI Uptake in the Workplace.
Echoing robo’s comment:
Has there been such adoption? Your remark
is about attempts to adopt, not lasting adoption. Of course, we can’t make “lasting adoption” mean “adopted for 5 years” if we’re trying to evaluate the prediction right now. But are you saying that there’s lots of adoption that seems probably/plausibly lasting, just by eyeballing it? My vague is impression is no, but I’m curious if the answer is yes or somewhat.
(TBC I don’t have a particularly strong prediction or retrodiction about adoption of AI in general in industry, or LLMs specifically (which is what I think Marcus’s predictions are about). At a guess I’d expect robotics to continue steadily rising in applications; I’d expect LLM use in lots of “grunt information work” contexts; and some niche strong applications like language learning; but not sure what else to expect.)
I interpreted the statement to mean, “modest, lasting, adoption”. I.e. we will see modest adoption, which will be lasting. It’s plausible Gary meant “modest-lasting adoption” in which case I think there is a better case to be made!
I still think that case is weak, but of course it’s very hard to evaluate at the end of a year, because how do we know if the adoption is lasting. It seems fine to evaluate that in a year or two and see whether the adoptions that happened in 2024 were lasting. I don’t see any way to call that interpretation already, at least given the current state of evidence.
Well, like, if a company tried out some new robotics thing in one warehouse at a small scale in Q1, then in Q2 and Q3 scaled it up to most of that warehouse, and then in Q4 started work applying the same thing in another warehouse, and announced plans to apply to many warehouses, I think it’d be pretty fair to call this lasting adoption (of robotics, not LLMs, unless the robots use LLMs). On the other hand if they were stuck at the “small scale work trying to make a maybe-scalable PoC”, that doesn’t seem like lasting adoption, yet.
Judging this sort of thing would be a whole bunch of work, but it seems possible to do. (Of course, we can just wait.)
Agree, though I think, in the world we are in, we don’t happen to have that kind of convenient measurement, or at least not unambiguous ones. I might be wrong, people have come up with clever methodologies to measure things like this in the past that compelled me, but I don’t have an obvious dataset or context in mind where you could get a good answer (but also, to be clear, I haven’t thought that much about it).
Disagree with your judgement on this one. Agree that everyone is trying to adopt AI into their products, but that’s extremely and importantly different from actual successful adoption. It’s especially importantly different because part of the core value proposition of general AI is that you’re not supposed to need to retool the environment around it in order to use it.
Agree that the audience is still out on how lasting the adoption will be, but it’s definitely not “modest” (as I mentioned in another thread, it’s plausible to me Gary meant “modest-lasting” adoption instead of “modest and lasting adoption”, i.e. the modest is just modifying the “lasting”, not the “adoption” which was the interpretation I had. I would still take the under on that, but agree it’s less clear cut and would require a different analysis.)
This in an incredibly uncharitable read, biased and redolent of motivated reasoning.
If you applied the same to almost any other set of predictions I think you could nitpick those too. It also lacks context (e.g yes 7-10 was an underestimate, but at time when there were like too and people were surprised that I said such models would become widespread). Even @robo here sees that you have been uncharitable.
The one that annoys me the most and makes me not even want to talk about the rest is re GPT-5. Practically everybody thought GPT-5 was imminent; I went out on a limb and said it would not be. I used it as an explicit specific yardstick (which I should be credited for) and that explicit yardstick was not met. Yet you are giving me zero credit.
You are just wrong about profits, inventing your own definition. Re adoption, many corporations have TRIED, but proofs of concept are not adoption. there have been loads of articles and surveys written about companies trying stuff out and not getting the ROI they expected.
I would be happy to respond in more detailed to a serious, balanced investigation that evaluated my predictions over time, going back to my 1998 article on distribution shift, but this ain’t it.
One lesson you should maybe take away is that if you want your predictions to be robust to different interpretations (including interpretations that you think are uncharitable), it could be worthwhile to try to make them more precise (in the case of a tweet, this could be in a linked blog post which explains in more detail). E.g., in the case of “No massive advance (no GPT-5, or disappointing GPT-5)” you could have said “Within 2024 no AI system will be publicly released which is as much of a qualitative advance over GPT-4 in broad capabilites as GPT-4 is over GPT-3 and where this increase in capabilites appears to be due to scale up in LLM pretraining”. This prediction would have been relatively clearly correct (though I think also relatively uncontroversial at least among people I know as we probably should only have expected to get to ~GPT-4.65 in terms of compute scaling and algorithmic progress by the end of 2024). You could try to operationalize this further in terms of benchmarks or downstream tasks.
To the extent that you can make predictions in terms of concrete numbers or metrics (which is not always possible to be clear), this avoids ~any issues due to interpretation. You could also make predictions about metaculus questions when applicable as these also have relatively solid and well understood resolution criteria.
I think Oliver put in a great effort here, and that the two of you have very different information environments, which results in him reading your points (which are underspecified relative to, e.g., Daniel Kokotajlo’s predictions ) differently than you may have intended them.
For instance, as someone in a similar environment to Habryka, that there would soon be dozens of GPT-4 level models around was a common belief by mid-2023, based on estimates of the compute used and Nvidia’s manufacturing projections. In your information environment, your 7-10 number looks ambitious, and you want credit for guessing way higher than other people you talked to (and you should in fact demand credit from those who guessed lower!). In our information environment, 7-10 looks conservative. You were directionally correct compared to your peers, but less correct than people I was talking to at the time (and in fact incorrect, since you gave both a lower and upper bound—you’d have just won the points from Oli on that one if you said ‘7+’ and not 7-10’).
I’m not trying to turn the screw; I think it’s awesome that you’re around here now, and I want to introduce an alternative hypothesis to ‘Oliver is being uncharitable and doing motivated reasoning.’
Oliver’s detailed breakdown above looks, to me, like an olive branch more than anything (I’m pretty surprised he did it!), and I wish I knew how best to encourage you to see it that way.
I think it would be cool for you and someone in Habryka’s reference class to quickly come up with predictions for mid-2026, and drill down on any perceived ambiguities, to increase your confidence in another review to be conducted in the near-ish future. There’s something to be gained from us all learning how best to talk to each other.
I feel the issue with your GPT-5 prediction is that it specifies both “no massive advance” and “no GPT-5″. When there was a massive advance but no GPT-5, it makes it ambiguous which half of the prediction is more important.
It’s slightly weird to have the correctness of it depend on OpenAI’s branding choices, though. If we decided that the GPT part of the prediction was more important, then in an alternative world that was otherwise identical to our own but where OAI had chosen to call one of their reasoning models GPT-5, the prediction would flip from false to correct. So that makes me lean a bit toward weighting the “no massive advance” part more, though I also wouldn’t think it unreasonable to split the difference and give you half credit for having one part of a two-part prediction correct.
I agree with your point about profits; it seems pretty clear that you were not referring to money made by the people selling the shovels.
But I don’t see the substance in your first two points:
You chose to give a range with both a lower and an upper bound; the success of the prediction was evaluated accordingly. I don’t see what you have to complain about here.
In the linked tweet, you didn’t go out on a limb and say GPT-5 wasn’t imminent! You said it either was not imminent or would be disappointing. And you said this in a parenthetical to the claim “No massive advance”. Clearly the success of the prediction “No massive advance (no GPT-5, or disappointing GPT-5)” does not depend solely on the nonexistence of GPT-5; it can be true if GPT-5 arrives but is bad, and it can be false if GPT-5 doesn’t arrive but another “massive advance” does. (If you meant it only to apply to GPT-5, you surely would have just said that: “No GPT-5 or disappointing GPT-5.”)
Regarding adoption, surely that deserves some fleshing out? Your original prediction was not “corporate adoption has disappointing ROI”; it was “Modest lasting corporate adoption”. The word “lasting” makes this tricky to evaluate, but it’s far from obvious that your prediction was correct.
I thought price wars was false, although I haven’t been paying that much attention to companies’ pricings. GPT was $20/month in 2023 and it’s still $20/month. IIRC Gemini/Claude were available in 2023 but they only had free tiers so I don’t know how to judge them.
Those are buying wildly different things. (They are not even comparable in terms of real dollars. That’s like a 10% difference, solely from inflation!)
Shouldn’t the inversion simply be “There was a massive advance”?
Sure, edited.