I have not seen any research backing this claim up, have you?
Petropolitan
A very interesting paper quantifying the conventional wisdom that a large degree of progress is from inference scaling.
You have got access to GPT-3 and 3.5 from OpenAI, but have you tried to get the same for now publicly unavailable Claude models for a better coverage of the first half of 2025?
I liked your post, and particularly the section 5a a lot, and strongly upvoted it. However today I read somewhat of a counter-essay by Dwarkesh Patel: https://www.dwarkesh.com/p/the-sample-efficiency-black-hole It’s a good piece which I think I could recommend reading to all readers of this blogpost.
In particular, the counterargument I found strongest was about scaling:
The way the scaling law equations work is that parameter and data terms are added to the loss independently. If you have a model that is trained compute optimally, and suppose you ask, well what if I just wanna maximize sample efficiency and use less data—and I’ll throw in as many parameters as it takes to make that happen. With the constants from the Chinchilla scaling laws paper (and the nature of the result wouldn’t change even with different constants), even if you increased the number of parameters by infinity, that would only decrease by a factor of ~10 the amount of data you need in order to keep the same loss. Humans are somewhere between thousands to millions of times more sample efficient than these models. Scaling of current models simply can’t make up for that discrepancy. This really does suggest that humans are on a different scaling curve altogether.
I tried to refute that with some Fermi estimates but ran into difficulties regarding interconversions between bytes and tokens. Mathematically the logic is sound. I wonder what would you answer?
Or it’s a different quant, or the initial pricing included a big surcharge for the lack of anything even roughly comparable at the time (similar to o1, which probably wasn’t different in size from 4o or o3 despite almost an OOM cost difference), or they used to objectively lack hardware to serve it and the price reflected that
I think the practice is technically called code-switching in linguistics, why do you think it doesn’t apply?
If you compare Li’s IKP scores and SimpleBench scores across GPT-5.1 and before vs. 5.2-5.4, you will see random variation and often even regression with the model counter increment. Exactly the same thing happens with Opus 4-series models as well.
The reason is common: post-training quite commonly displaces obscure facts (for IKP) and spatial reasoning skills (for SB) occupying valuable “real estate” of weights/experts in favor of coding skills in commercial demand. The only alternative explanation I could think of is quantization while keeping param count the same, maybe one can devise more but the size increase you hypothesize for 5.1>5.2 seems to clearly contradict this phenomenon.
As of 30:1 sparsity for 4o, I have never suggested it. It seems reasonable to assume 4o, o1, o3, 4.1 and 5-5.4 all have around 100A2000B params, that is ~20:1 sparsity. Regarding the margins and the price of H100 hours, which time and which provider do you refer to? It varied a lot during recent years and even months. Also, I advise to treat prices very carefully and not as a source of the ultimate truth, as o1 is 7.5x costlier than o3 (or 4.1) but is clearly not very much different in size, if at all.
Nice to see you updating away from our disagreement on the sparsity and total param counts!
Mind a timecode for R. Pope? (Side note: I initially thought of Leo XIV when I saw that word =D)
There was plenty of evidence by the 1950s since birth rates fell below the replacement level in many European countries during the Interbellum, and the process clearly started before the Great Depression. But they recovered in the late 1930s for reasons still arguably unclear, and then skyrocketed during the Baby Boom, so no one was interested in analyzing them. Even now, AFAIK, the academic interest to that phenomenon is concentrated in topics having some application right now, such as which pro-natalist measures worked well, and not the fundamental questions
I included initially three, and now with an edit four white-collar tasks and only one blue-collar task (after all the white collars are facing a greater risk of automation for obvious reasons). In all the former ones you can’t substitute ground truth for an LLM judge, no matter how complicated. A general practitioner only finds out if the treatment works for the patient after they prescribe it, etc. Most of the white collar jobs in the world are like this!
As of washing machines, there are dozens of models in production, likely hundreds in operation, operating conditions vary (different installation, water, usage etc.), the number of possible malfunctions is significant for each machine in each case, the data on these malfunctions is collected very rarely, and when it is, it’s in practice inaccessible to AI companies beyond brief summaries in operating manuals. All this applies to any mechanical equipment in the need of maintenance and repair not owned by the AI companies, including robots
But aren’t most of the tasks in the world economy unverifiable in silico? How could LLMs pursue RLVR on, for example:
quality control in manufacturing, or
medical tasks like therapeutics, or
effective persuasion (important for world takeover!), or
management of human teams, or
blue-collar tasks like fixing a leaking washing machine?
The overall argument sounds convincing, but what exactly do you mean by “how smart” and “slightly superhuman geniuses”? If the future pretraining data is to be dominated by RLVR tasks, that leads to even more jaggedness of capabilities than we have now.
To illustrate, it’s not implausible that Mythos+3 will be a strongly superhuman hacker but still fail at basic physical tasks. It can be superhuman intelligence in Bostrom’s definition but arguably still not general intelligence (this is why I think it’s better to avoid AGI/ASI terminology)
Do you think you could do a 2026 update?
Contracts exist because commitments can’t be directly verified and trust must be manufactured.
How are AIs going to verify commitments and manufacture trust? Any imperfectly aligned self-modifying AI agent which learns how to scam will try to scam as much other AI agents as possible, and that’s orthogonal to the “generality” of AI. The amount of AI fraud is only going to rise with capabilities increasing!
Sun-synchronous orbits are already very popular, have you considered how large a constellation can they safely sustain before everyone placing satellites would have to pay some kind of congestion fees?
Where does one get ~90T of non-useless text tokens?
That is a useful thing if implemented well, and indeed it is a thing I use (from OpenAI and Anthropic) more often than I use Google Search. But that thing is not Google Search.
Several hours ago I googled an uncommon steel grade (an alphanumeric designation with the word steel). In the late 2010s Google would have given me search results in milliseconds and at least one of the first two links would have had the specs I needed.
Today I got a page of garbage links which happened to have same number in different contexts, and then 30 seconds later after a lot of tool calls and inference the AI overview provided me the links I actually needed. And this is not an isolated occurrence, it happened earlier this week several times!
I know Google is not actually a web search company but this is not a sustainable way to run web search, and I sincerely hope that they revert to the old algorithms which used to work so well (BM25, tf-idf etc., maybe with a bit of vector search added)
Strong upvote, the key here is “VR” in RLVR: there are no automatically verifiable rewards for good or convincing writing, only RLHF, the cost of which scales proportionally with the length of writing evaluated (and if you hire non-Americans as RLHF trainers for economy reasons the result is unlikely to fit well with stylistic preferences of Americans). The labs can use engagement as a metric but that will lead to “baiting” already very common in the social media and will not convince anyone
Somewhat tangential to the questions of whether this essay was AI-written and whether any human actually writes like a LLM, I think linguists now widely agree that LLMs picked a lot of traits from the formal register of African varieties of English when OpenAI (and later other American companies) hired Kenyans and Nigerians (possibly a lot of English teachers among them!) to do RLHF
And for the same copyright reasons the labs will never allow the users to see the pretraining data in any way
Since at least the 18th century, every single new science and every major branch of sciences (not to speak of engineering itself and its branches) was build by “trial and error informed by a narrow understanding” (sometimes called “scientific trial and error”) and not some abstract theoretical insights which usually came much later.
In retrospect this assumption that AI would somehow be the other way round, in my opinion, looks quite silly