I am happy that this suggests our policy is not primarily ‘try to murder or cripple Anthropic in particular,’ or at least that they will not be too hypocritical around that.
Whether GPT-5.6 Sol (Sol 5.6?) or Fable 5 gets to be explicitly cleared for release first (outside a strict whitelist, and not just via Anthropic implementing KYC) will be the next piece of evidence about this. The export control directive currently applies to Fable 5, but not to Sol 5.6, and this distinction might persist. Even as it seems like a silly possibility, that’s not obviously a good heuristic for ruling it out. (It’s possible OpenAI adopted the Sol/Terra/Luna naming convention in part to make it more likely that any Mythos-like restrictions get appropriately scoped to Sol 5 models only, rather than to GPT-5.6+ as a whole.)
Section 2 of the post talks about generalization in the classical sense, where an RLVRed model that didn’t train on a situation-class magically does well there, or maybe where a model was magically RLVRed on so many situations it never encounters something truly novel ever again. Rather than the sense in which I expect automated RLVR to solve generalization, which is by observing failures of generalization (that are themselves sufficiently comprehensible with the current model) and producing RLVR training data that addresses them individually, to be included when building the next model. Thus jaggedness is addressed by slowly working on fixing the gaps as they are encountered, when they happen to be near enough to the edge of the existing skills.
(Incidentally, a lot of what currently contributes to the felt impression of jaggedness, rather than this more technical RLVR-and-deep-skills distinction, should also go away as the rising tide of pretraining scale comes in, which after 2026 will happen most prominently by 2028-2029, and to some meaningful further effect by 2031-2032. And then it’s done, no more significant scaling of pretraining for many years.)
Section 1 of the post is some combination of making a distinction between engineering-type skills that can be RLVRed and those that can’t be with current methods, and making a distinction between skills where success is easy to verify and those where it takes too much time, requires experimental feedback, or can’t be packaged in a reproducible enough way to be usable with RLVR. The method has a scope of applicability and doesn’t magically work for everything. This doesn’t mean that the harder engineering-type things can’t be understood as a practice of more modular skills that can be directly RLVRed. You can turn any engineering-type activity into a collection of textbooks with exercises that are individually of manageable scope.
So I take Sections 1 and 2 of the essay as being importantly mistaken (about a priori vs. adaptive generalization; and about modularity of engineering-type skills), while Sections 3 and 4 are more promising, but not predictive of timing.