I notice that I am confused about the state of internal-only models at places like OpenAI. I wonder if people are trying to aggregate the informal reports and rumors on that.
In particular, I usually assume that internal models are usually ~6 months ahead of what’s released, but I don’t know if that’s a good estimate.
To make my confusion more concrete: I don’t quite understand how publicly available Claude Code can be useful for internal OpenAI developments if internal models from 6 months in the future are available. (Especially when taking into account that using those models potentially gives information to a competitor.) Internal models might be expensive to use, but with only a few thousand employees this should not matter much.
(I can see how Claude Code might be useful for personal projects by OpenAI employees, precisely because they might want to keep those projects private from their employer.)
Anyway, I wonder if there are some “interest groups” where people talk about rumors related to internal-only models. (The events like IMO gold do give us a bit of a window into what’s available in that sense.)
In a race for clout, they could at any time grab six months from thin air in benchmark graphs by closing the internal-release-external-release gap. No idea if they have made this one time play.
An unsafe model, not well tested, and exposing too many of latest tricks too early to their competitors.
I would not expect them to do that (they don’t have enough compute to serve slow huge models to large number of users anyway; that’s, in part, why GPT-5 is very different from GPT-4/4.5 in terms of price/capability trade-off).
Yes, that’s certainly true. (Although, with the original GPT-4 it is thought that the delay have been mostly dedicated to safety improvements and, perhaps, better instruction following, with shrinkage mostly occurring post initial release.)
In any case, they could have boosted capabilities even without relying on the future models, but just by offering less shrinked versions of GPT-5 in addition to the ones they did offer, and they have chosen not to do that.
Some part of this is that capabilities are not linear, and from what I gather the newer internal models may be less polished (if more capable) than the ones they make public. Especially now what more value add is in post training, I suspect using the work in progress models only feels good closer to release.
Yes, and, perhaps, one would usually want to shrink before post-training, both to make post-training more affordable per iteration, and because I am not sure if post-training-acquired capabilities survive shrinkage as well as pre-training-acquired capabilities (I wonder what is known about that; I want to understand that aspect better; is it insane to postpone shrinkage till after post-training, or is it something to try?).
Sometimes internal models are several months ahead in key benchmarks or capabilities. For example, an internal OpenAI model won gold on IMO but it might be a while before a public OpenAI model does as well at IMO or other math competitions. But you wouldn’t want to use this model, and I don’t think OpenAI uses the model a lot internally.
Also Anthropic is probably a few months ahead of OpenAI in coding.
I notice that I am confused about the state of internal-only models at places like OpenAI. I wonder if people are trying to aggregate the informal reports and rumors on that.
In particular, I usually assume that internal models are usually ~6 months ahead of what’s released, but I don’t know if that’s a good estimate.
To make my confusion more concrete: I don’t quite understand how publicly available Claude Code can be useful for internal OpenAI developments if internal models from 6 months in the future are available. (Especially when taking into account that using those models potentially gives information to a competitor.) Internal models might be expensive to use, but with only a few thousand employees this should not matter much.
(I can see how Claude Code might be useful for personal projects by OpenAI employees, precisely because they might want to keep those projects private from their employer.)
Anyway, I wonder if there are some “interest groups” where people talk about rumors related to internal-only models. (The events like IMO gold do give us a bit of a window into what’s available in that sense.)
In a race for clout, they could at any time grab six months from thin air in benchmark graphs by closing the internal-release-external-release gap. No idea if they have made this one time play.
That’s probably too expensive and too risky.
An unsafe model, not well tested, and exposing too many of latest tricks too early to their competitors.
I would not expect them to do that (they don’t have enough compute to serve slow huge models to large number of users anyway; that’s, in part, why GPT-5 is very different from GPT-4/4.5 in terms of price/capability trade-off).
It’s a matter of degree. There’s already shrinkage- gpt 4 took nearly a year to release
Yes, that’s certainly true. (Although, with the original GPT-4 it is thought that the delay have been mostly dedicated to safety improvements and, perhaps, better instruction following, with shrinkage mostly occurring post initial release.)
In any case, they could have boosted capabilities even without relying on the future models, but just by offering less shrinked versions of GPT-5 in addition to the ones they did offer, and they have chosen not to do that.
Some part of this is that capabilities are not linear, and from what I gather the newer internal models may be less polished (if more capable) than the ones they make public. Especially now what more value add is in post training, I suspect using the work in progress models only feels good closer to release.
Yes, and, perhaps, one would usually want to shrink before post-training, both to make post-training more affordable per iteration, and because I am not sure if post-training-acquired capabilities survive shrinkage as well as pre-training-acquired capabilities (I wonder what is known about that; I want to understand that aspect better; is it insane to postpone shrinkage till after post-training, or is it something to try?).
Internal models aren’t 6 months ahead in general.
Sometimes internal models are several months ahead in key benchmarks or capabilities. For example, an internal OpenAI model won gold on IMO but it might be a while before a public OpenAI model does as well at IMO or other math competitions. But you wouldn’t want to use this model, and I don’t think OpenAI uses the model a lot internally.
Also Anthropic is probably a few months ahead of OpenAI in coding.