This in an incredibly uncharitable read, biased and redolent of motivated reasoning.
If you applied the same to almost any other set of predictions I think you could nitpick those too. It also lacks context (e.g yes 7-10 was an underestimate, but at time when there were like too and people were surprised that I said such models would become widespread). Even @robo here sees that you have been uncharitable.
The one that annoys me the most and makes me not even want to talk about the rest is re GPT-5. Practically everybody thought GPT-5 was imminent; I went out on a limb and said it would not be. I used it as an explicit specific yardstick (which I should be credited for) and that explicit yardstick was not met. Yet you are giving me zero credit.
You are just wrong about profits, inventing your own definition. Re adoption, many corporations have TRIED, but proofs of concept are not adoption. there have been loads of articles and surveys written about companies trying stuff out and not getting the ROI they expected.
I would be happy to respond in more detailed to a serious, balanced investigation that evaluated my predictions over time, going back to my 1998 article on distribution shift, but this ain’t it.
One lesson you should maybe take away is that if you want your predictions to be robust to different interpretations (including interpretations that you think are uncharitable), it could be worthwhile to try to make them more precise (in the case of a tweet, this could be in a linked blog post which explains in more detail). E.g., in the case of “No massive advance (no GPT-5, or disappointing GPT-5)” you could have said “Within 2024 no AI system will be publicly released which is as much of a qualitative advance over GPT-4 in broad capabilites as GPT-4 is over GPT-3 and where this increase in capabilites appears to be due to scale up in LLM pretraining”. This prediction would have been relatively clearly correct (though I think also relatively uncontroversial at least among people I know as we probably should only have expected to get to ~GPT-4.65 in terms of compute scaling and algorithmic progress by the end of 2024). You could try to operationalize this further in terms of benchmarks or downstream tasks.
To the extent that you can make predictions in terms of concrete numbers or metrics (which is not always possible to be clear), this avoids ~any issues due to interpretation. You could also make predictions about metaculus questions when applicable as these also have relatively solid and well understood resolution criteria.
I think Oliver put in a great effort here, and that the two of you have very different information environments, which results in him reading your points (which are underspecified relative to, e.g., Daniel Kokotajlo’s predictions ) differently than you may have intended them.
For instance, as someone in a similar environment to Habryka, that there would soon be dozens of GPT-4 level models around was a common belief by mid-2023, based on estimates of the compute used and Nvidia’s manufacturing projections. In your information environment, your 7-10 number looks ambitious, and you want credit for guessing way higher than other people you talked to (and you should in fact demand credit from those who guessed lower!). In our information environment, 7-10 looks conservative. You were directionally correct compared to your peers, but less correct than people I was talking to at the time (and in fact incorrect, since you gave both a lower and upper bound—you’d have just won the points from Oli on that one if you said ‘7+’ and not 7-10’).
I’m not trying to turn the screw; I think it’s awesome that you’re around here now, and I want to introduce an alternative hypothesis to ‘Oliver is being uncharitable and doing motivated reasoning.’
Oliver’s detailed breakdown above looks, to me, like an olive branch more than anything (I’m pretty surprised he did it!), and I wish I knew how best to encourage you to see it that way.
I think it would be cool for you and someone in Habryka’s reference class to quickly come up with predictions for mid-2026, and drill down on any perceived ambiguities, to increase your confidence in another review to be conducted in the near-ish future. There’s something to be gained from us all learning how best to talk to each other.
I feel the issue with your GPT-5 prediction is that it specifies both “no massive advance” and “no GPT-5″. When there was a massive advance but no GPT-5, it makes it ambiguous which half of the prediction is more important.
It’s slightly weird to have the correctness of it depend on OpenAI’s branding choices, though. If we decided that the GPT part of the prediction was more important, then in an alternative world that was otherwise identical to our own but where OAI had chosen to call one of their reasoning models GPT-5, the prediction would flip from false to correct. So that makes me lean a bit toward weighting the “no massive advance” part more, though I also wouldn’t think it unreasonable to split the difference and give you half credit for having one part of a two-part prediction correct.
I agree with your point about profits; it seems pretty clear that you were not referring to money made by the people selling the shovels.
But I don’t see the substance in your first two points:
You chose to give a range with both a lower and an upper bound; the success of the prediction was evaluated accordingly. I don’t see what you have to complain about here.
In the linked tweet, you didn’t go out on a limb and say GPT-5 wasn’t imminent! You said it either was not imminent or would be disappointing. And you said this in a parenthetical to the claim “No massive advance”. Clearly the success of the prediction “No massive advance (no GPT-5, or disappointing GPT-5)” does not depend solely on the nonexistence of GPT-5; it can be true if GPT-5 arrives but is bad, and it can be false if GPT-5 doesn’t arrive but another “massive advance” does. (If you meant it only to apply to GPT-5, you surely would have just said that: “No GPT-5 or disappointing GPT-5.”)
Regarding adoption, surely that deserves some fleshing out? Your original prediction was not “corporate adoption has disappointing ROI”; it was “Modest lasting corporate adoption”. The word “lasting” makes this tricky to evaluate, but it’s far from obvious that your prediction was correct.
This in an incredibly uncharitable read, biased and redolent of motivated reasoning.
If you applied the same to almost any other set of predictions I think you could nitpick those too. It also lacks context (e.g yes 7-10 was an underestimate, but at time when there were like too and people were surprised that I said such models would become widespread). Even @robo here sees that you have been uncharitable.
The one that annoys me the most and makes me not even want to talk about the rest is re GPT-5. Practically everybody thought GPT-5 was imminent; I went out on a limb and said it would not be. I used it as an explicit specific yardstick (which I should be credited for) and that explicit yardstick was not met. Yet you are giving me zero credit.
You are just wrong about profits, inventing your own definition. Re adoption, many corporations have TRIED, but proofs of concept are not adoption. there have been loads of articles and surveys written about companies trying stuff out and not getting the ROI they expected.
I would be happy to respond in more detailed to a serious, balanced investigation that evaluated my predictions over time, going back to my 1998 article on distribution shift, but this ain’t it.
One lesson you should maybe take away is that if you want your predictions to be robust to different interpretations (including interpretations that you think are uncharitable), it could be worthwhile to try to make them more precise (in the case of a tweet, this could be in a linked blog post which explains in more detail). E.g., in the case of “No massive advance (no GPT-5, or disappointing GPT-5)” you could have said “Within 2024 no AI system will be publicly released which is as much of a qualitative advance over GPT-4 in broad capabilites as GPT-4 is over GPT-3 and where this increase in capabilites appears to be due to scale up in LLM pretraining”. This prediction would have been relatively clearly correct (though I think also relatively uncontroversial at least among people I know as we probably should only have expected to get to ~GPT-4.65 in terms of compute scaling and algorithmic progress by the end of 2024). You could try to operationalize this further in terms of benchmarks or downstream tasks.
To the extent that you can make predictions in terms of concrete numbers or metrics (which is not always possible to be clear), this avoids ~any issues due to interpretation. You could also make predictions about metaculus questions when applicable as these also have relatively solid and well understood resolution criteria.
I think Oliver put in a great effort here, and that the two of you have very different information environments, which results in him reading your points (which are underspecified relative to, e.g., Daniel Kokotajlo’s predictions ) differently than you may have intended them.
For instance, as someone in a similar environment to Habryka, that there would soon be dozens of GPT-4 level models around was a common belief by mid-2023, based on estimates of the compute used and Nvidia’s manufacturing projections. In your information environment, your 7-10 number looks ambitious, and you want credit for guessing way higher than other people you talked to (and you should in fact demand credit from those who guessed lower!). In our information environment, 7-10 looks conservative. You were directionally correct compared to your peers, but less correct than people I was talking to at the time (and in fact incorrect, since you gave both a lower and upper bound—you’d have just won the points from Oli on that one if you said ‘7+’ and not 7-10’).
I’m not trying to turn the screw; I think it’s awesome that you’re around here now, and I want to introduce an alternative hypothesis to ‘Oliver is being uncharitable and doing motivated reasoning.’
Oliver’s detailed breakdown above looks, to me, like an olive branch more than anything (I’m pretty surprised he did it!), and I wish I knew how best to encourage you to see it that way.
I think it would be cool for you and someone in Habryka’s reference class to quickly come up with predictions for mid-2026, and drill down on any perceived ambiguities, to increase your confidence in another review to be conducted in the near-ish future. There’s something to be gained from us all learning how best to talk to each other.
I feel the issue with your GPT-5 prediction is that it specifies both “no massive advance” and “no GPT-5″. When there was a massive advance but no GPT-5, it makes it ambiguous which half of the prediction is more important.
It’s slightly weird to have the correctness of it depend on OpenAI’s branding choices, though. If we decided that the GPT part of the prediction was more important, then in an alternative world that was otherwise identical to our own but where OAI had chosen to call one of their reasoning models GPT-5, the prediction would flip from false to correct. So that makes me lean a bit toward weighting the “no massive advance” part more, though I also wouldn’t think it unreasonable to split the difference and give you half credit for having one part of a two-part prediction correct.
I agree with your point about profits; it seems pretty clear that you were not referring to money made by the people selling the shovels.
But I don’t see the substance in your first two points:
You chose to give a range with both a lower and an upper bound; the success of the prediction was evaluated accordingly. I don’t see what you have to complain about here.
In the linked tweet, you didn’t go out on a limb and say GPT-5 wasn’t imminent! You said it either was not imminent or would be disappointing. And you said this in a parenthetical to the claim “No massive advance”. Clearly the success of the prediction “No massive advance (no GPT-5, or disappointing GPT-5)” does not depend solely on the nonexistence of GPT-5; it can be true if GPT-5 arrives but is bad, and it can be false if GPT-5 doesn’t arrive but another “massive advance” does. (If you meant it only to apply to GPT-5, you surely would have just said that: “No GPT-5 or disappointing GPT-5.”)
Regarding adoption, surely that deserves some fleshing out? Your original prediction was not “corporate adoption has disappointing ROI”; it was “Modest lasting corporate adoption”. The word “lasting” makes this tricky to evaluate, but it’s far from obvious that your prediction was correct.