Thanks again for that (and the hat tip), I will link to your post in the future when this topic comes up again! As a side note, I would have laid out that section as a postscript because that’s what it basically is ;-)
Petropolitan
Thanks a lot (and big thanks to Helen Toner!), do you think you could add AI capable of a takeover, or rather a civilization-ending failed takeover attempt (as a lower and more practical bar) to the post?
they may be smart, agentic, and competent enough to be takeover-capable themselves
Might we just put aside AGI and ASI buzzwords and use “takeover-capable AI” as a go-to term instead? I believe it’s of secondary importance whether a takeover-capable AI system is “general” (whatever that means) and/or “superhuman”
Maybe, but a literal reading would assign Mythos Preview to “other Antropic models” and thus allow it to be used both by the US Government as well as a few non-American Glasswing partners
Doesn’t the Glasswing use Mythos Preview not Mythos 5?
There’s an argument you have likely encountered at least once that the empirical work on non-superintelligent alignment will be useful for aligning ASI (in Yudkowskian sense) as well, and since any human coordination is imperfect and we can only delay the development of the latter for a limited amount of time, this is the only realistic way to go.
Also, I’m pretty sure very few people in the field back at the time understood the “no historical track record” part. Seems likely to be a selection effect: the people who did probably abstained from entering AI safety in the first place
To the contrary, rockets are an excellent example confirming my thesis! Actually, the first rockets were built empirically in ancient China, and Early Modern Indians developed them into such an effective weapon that British copied them, improved them and literally burned Copenhagen with just ~300 of them in 1807. So all the theoretical insights on rocket ballistics in the early 20th century were built upon the military research of the 19th century.
If you have better counterexamples, please present them
Since at least the 18th century, every single new science and every major branch of sciences (not to speak of engineering itself and its branches) was build by “trial and error informed by a narrow understanding” (sometimes called “scientific trial and error”) and not some abstract theoretical insights which usually came much later.
In retrospect this assumption that AI would somehow be the other way round, in my opinion, looks quite silly
I have not seen any research backing this claim up, have you?
A very interesting paper quantifying the conventional wisdom that a large degree of progress is from inference scaling.
You have got access to GPT-3 and 3.5 from OpenAI, but have you tried to get the same for now publicly unavailable Claude models for a better coverage of the first half of 2025?
I liked your post, and particularly the section 5a a lot, and strongly upvoted it. However today I read somewhat of a counter-essay by Dwarkesh Patel: https://www.dwarkesh.com/p/the-sample-efficiency-black-hole It’s a good piece which I think I could recommend reading to all readers of this blogpost.
In particular, the counterargument I found strongest was about scaling:
The way the scaling law equations work is that parameter and data terms are added to the loss independently. If you have a model that is trained compute optimally, and suppose you ask, well what if I just wanna maximize sample efficiency and use less data—and I’ll throw in as many parameters as it takes to make that happen. With the constants from the Chinchilla scaling laws paper (and the nature of the result wouldn’t change even with different constants), even if you increased the number of parameters by infinity, that would only decrease by a factor of ~10 the amount of data you need in order to keep the same loss. Humans are somewhere between thousands to millions of times more sample efficient than these models. Scaling of current models simply can’t make up for that discrepancy. This really does suggest that humans are on a different scaling curve altogether.
I tried to refute that with some Fermi estimates but ran into difficulties regarding interconversions between bytes and tokens. Mathematically the logic is sound. I wonder what would you answer?
Or it’s a different quant, or the initial pricing included a big surcharge for the lack of anything even roughly comparable at the time (similar to o1, which probably wasn’t different in size from 4o or o3 despite almost an OOM cost difference), or they used to objectively lack hardware to serve it and the price reflected that
I think the practice is technically called code-switching in linguistics, why do you think it doesn’t apply?
If you compare Li’s IKP scores and SimpleBench scores across GPT-5.1 and before vs. 5.2-5.4, you will see random variation and often even regression with the model counter increment. Exactly the same thing happens with Opus 4-series models as well.
The reason is common: post-training quite commonly displaces obscure facts (for IKP) and spatial reasoning skills (for SB) occupying valuable “real estate” of weights/experts in favor of coding skills in commercial demand. The only alternative explanation I could think of is quantization while keeping param count the same, maybe one can devise more but the size increase you hypothesize for 5.1>5.2 seems to clearly contradict this phenomenon.
As of 30:1 sparsity for 4o, I have never suggested it. It seems reasonable to assume 4o, o1, o3, 4.1 and 5-5.4 all have around 100A2000B params, that is ~20:1 sparsity. Regarding the margins and the price of H100 hours, which time and which provider do you refer to? It varied a lot during recent years and even months. Also, I advise to treat prices very carefully and not as a source of the ultimate truth, as o1 is 7.5x costlier than o3 (or 4.1) but is clearly not very much different in size, if at all.
Nice to see you updating away from our disagreement on the sparsity and total param counts!
Mind a timecode for R. Pope? (Side note: I initially thought of Leo XIV when I saw that word =D)
There was plenty of evidence by the 1950s since birth rates fell below the replacement level in many European countries during the Interbellum, and the process clearly started before the Great Depression. But they recovered in the late 1930s for reasons still arguably unclear, and then skyrocketed during the Baby Boom, so no one was interested in analyzing them. Even now, AFAIK, the academic interest to that phenomenon is concentrated in topics having some application right now, such as which pro-natalist measures worked well, and not the fundamental questions
I included initially three, and now with an edit four white-collar tasks and only one blue-collar task (after all the white collars are facing a greater risk of automation for obvious reasons). In all the former ones you can’t substitute ground truth for an LLM judge, no matter how complicated. A general practitioner only finds out if the treatment works for the patient after they prescribe it, etc. Most of the white collar jobs in the world are like this!
As of washing machines, there are dozens of models in production, likely hundreds in operation, operating conditions vary (different installation, water, usage etc.), the number of possible malfunctions is significant for each machine in each case, the data on these malfunctions is collected very rarely, and when it is, it’s in practice inaccessible to AI companies beyond brief summaries in operating manuals. All this applies to any mechanical equipment in the need of maintenance and repair not owned by the AI companies, including robots
But aren’t most of the tasks in the world economy unverifiable in silico? How could LLMs pursue RLVR on, for example:
quality control in manufacturing, or
medical tasks like therapeutics, or
effective persuasion (important for world takeover!), or
management of human teams, or
blue-collar tasks like fixing a leaking washing machine?
The overall argument sounds convincing, but what exactly do you mean by “how smart” and “slightly superhuman geniuses”? If the future pretraining data is to be dominated by RLVR tasks, that leads to even more jaggedness of capabilities than we have now.
To illustrate, it’s not implausible that Mythos+3 will be a strongly superhuman hacker but still fail at basic physical tasks. It can be superhuman intelligence in Bostrom’s definition but arguably still not general intelligence (this is why I think it’s better to avoid AGI/ASI terminology)
No, 5.5 can’t be derived from 4.5 because they use wildly different image tokenizers: 4.5 use exactly that of 4o while 5.5 has a variation of a newer architecture introduced on 5.1 in October 2025! https://developers.openai.com/api/docs/guides/images-vision
(Also it’s now quite hard for me to imagine 5.1 adapted from 5 with such a difference)