There is some conceptual misleadingness with the usual ways of framing algorithmic progress. Imagine that in 2022 the number of apples produced on some farm increased 10x year-over-year, then in 2023 the number of oranges increased 10x, and then in 2024 the number of pears increased 10x. That doesn’t mean that the number of fruits is up 1000x in 3 years.
Price-performance of compute compounds over many years, but most algorithmic progress doesn’t, it only applies to the things relevant around the timeframe when that progress happens, and stops being applicable a few years later. So forecasting over multiple years in terms of effective compute that doesn’t account for this issue would greatly overestimate progress. There are some pieces of algorithmic progress that do compound, and it would be useful to treat them as fundamentally different from the transient kind.
This is a reasonable point in principle, but I don’t know how important it is in practice. My sense is that most things identified as algorithmic improvements continue to be algorithmic improvements over the previously-done thing at higher scales? E.g. transformers beating LSTMs, Chinchilla scaling, GeLU over ReLU, probably RL to train reasoning, etc.
I think pretraining data pipeline improvements have this issue, they stop helping with larger models that want more data (or it becomes about midtraining). And similarly for the benchmark-placating better post-training data that enables ever less intelligent models to get good scores, but probably doesn’t add up to much (at least when it’s not pretraining-scale RLVR).
Things like MoE, GLU over LU, maybe DyT or Muon add up to a relatively modest compute multiplier over the original Transformer. For example Transformer++ vs. Transformer in Figure 4 of the Mamba paper suggests a total compute multiplier of 5x, attained over 6 years since the original Transformer (for dense models). This is emphatically not 3x-4x per year!
Chinchilla scaling is more about careful methodology with compute optimality rather than a specific algorithmic improvement, and even now most demonstrations of compute multipliers fail to take one of its lessons and cool down the models before measurement. This could lead to hilarious results such as Figure 11 of the OLMo 2 paper where an apparent 2x compute multiplier vanishes to nothing after cooling (admittedly, nobody expected this to be a real compute multiplier, but in a more confusing case it could’ve been taken to be one).
(a) is faster than your Mamba paper example but still much slower than 3-4x/year. (b) and (c) are at ~4x, though (c) isn’t much longer than a year. And these are basically not taking into account post-training efficiency gains iiuc.
We’re not working with many data points but it seems like these provide an existence proof that gains can compound across at least 3 years.
Would love to see some updated data collection on this, I think we could get more evidence on your hypothesis.
Mamba paper uses a relevant kind of methodology, it directly compares different algorithmic ingredients in the same setting, training on a fixed dataset and measuring perplexity (do note it’s not trying MoE, so the actual total improvement is greater). It’s a way of directly comparing cumulative improvement over all that time. To impact future frontier capabilities, an algorithmic ingredient from the past needs to be both applicable to the future frontier models, and help with benchmarks relevant to those frontier models, compared to the counterfactual where the frontier model doesn’t use the algorithmic ingredient.
When an ingredient stops being applicable to the frontier model, or stops being relevant to what’s currently important about its capabilities, it’s no longer compounding towards frontier capabilities. It wouldn’t matter if that same ingredient is helping a different contemporary non-frontier small model to match a much older model with much less compute. Or that it’s helping the frontier model to do much better than an older model on a benchmark that used to matter then, but doesn’t matter now.
So I’m skeptical of the Epoch paper’s overall framing, its willingness to compare everything against everything indirectly, that’s a lot of the point I’m making. You mostly can’t use methods from 2014 and frontier AI compute from 2025 to train something directly comparable to a lightweight version of a frontier model of 2025 trained on less compute (but still compute optimally), compared in a way that matters in 2025. So what does it mean that there is so and so compute multiplier across all of this time? At least for Transformer recipes, there is a possibility of comparing them directly if training converges.
Also, if we are not even aiming to do Chinchilla optimal training runs, what are we even comparing? For older algorithmic ingredients, you still need to aim for compute optimality to extract a meaningful compute multiplier, even if in the time of those older methods people didn’t even try to do that, or did it incorrectly. In terms of this comment’s framing, compute multipliers with respect to good methodology for Chinchilla optimal training is a “benchmark” that’s currently relevant. So even if this benchmark wasn’t appreciated or known back then, it’s still the thing to use in order to estimate cumulative impact of the older algorithmic improvements, in a way that is relevant now, and so in a way that’s analogous to what would be relevant for forecasting future frontier capabilities.
As another example, now that pretraining scale RLVR might soon become important, it’s less clear that Chinchilla optimality will remain relevant going forward, and so that the contributions of algorithmic improvements that helped improve perplexity in Chinchilla optimal settings will keep contributing to future frontier capabilities. If most relevant capabilities end up being learned with RLVR “directly”, then it might become less important how well pretraining works, even if it remains necessary for bootstrapping the process. And the kinds of things that RLVR trains will likely fail to help with perplexity in any reasonable setting, so measurements of perplexity will fail to remain a relevant benchmark.
Recursive self-improvement in AI probably comes before AGI. Evolution doesn’t need to understand human minds to build them, and a parent doesn’t need to be an AI researcher to make a child. The bitter lesson and the practice of recent years suggest that building increasingly capable AIs doesn’t depend on understanding how they think.
Thus the least capable AI that can build superintelligence without human input only needs to be a competent engineer that can scale and refine a sufficiently efficient AI design, in an empirically driven mundane way that doesn’t depend on matching capabilities of Grothendieck for conceptual invention. This makes the threshold of AGI less relevant for timelines of recursive self-improvement than I previously expected. With o1 and what straightforwardly follows, we plausibly already have all it takes to get recursive self-improvement, if the current designs get there with the next few years of scaling, and the resulting AIs are merely competent engineers that fail to match humans at less legible technical skills.
I think you’re doing a “we just need X” with recursive self-improvement. The improvement may be iterable and self-applicable… but is it general? Is it on a bounded trajectory or an unbounded trajectory? Very different outcomes.
Yeah, although I am bullish on the general direction of RSI, I also think that in the details it factors into many dimensions of improvement. Some of which are likely fast-but-bounded and will quickly plateau, others which are slow-but-not-near-term-bounded… The fact that there are many different dimensions over which RSI might operate makes it hard to predict precisely, but does give some general predictions.
For instance, we might expect it not to be completely blocked (since there will be many independent dimensions along which to apply optimization pressure, so blocking one won’t block them all).
Another prediction we might make is that seeing some rapid progress doesn’t guarantee that either a complete wall will be hit soon or that progress will continue just as fast or faster. Things might just be messy, with a jagged inconsistent line proceeding up and to the right. Zoom out enough, and it may look smooth, but for our very-relevant-to-us near-term dynamics, it could just be quite noisy.
Technically this probably isn’t recursive self improvement, but rather automated AI progress. This is relevant mostly because
It implies that, at least through the early parts of the takeoff, there will be a lot of individual AI agents doing locally-useful compute-efficiency and improvement-on-relevant-benchmarks things, rather than one single coherent agent following a global plan for configuring the matter in the universe in a way that maximizes some particular internally-represented utility function.
It means that multi-agent dynamics will be very relevant in how things happen
If your threat model is “no group of humans manages to gain control of the future before human irrelevance”, none of this probably matters.
No group of AIs needs to gain control before human irrelevance either. Like a runaway algal bloom AIs might be able to bootstrap superintelligence, without crossing the threshold of AGI being useful in helping them gain control over this process any more than humans maintain such control at the outset. So it’s not even multi-agent dynamics shaping the outcome, capitalism might just serve as the nutrients until a much higher threshold of capability where a superintelligence can finally take control of this process.
Cutting edge AI research is one of the most difficult tasks humans are currently working on, so the intelligence requirement to replace human researchers is quite high. It is likely that most ordinary software development, being easier, will be automated before AI research is automated. I’m unsure whether LLMs with long chains of thought (o1-like models) can reach this level of intelligence before human researchers invent a more general AI architecture.
Humans are capable of solving conceptually difficult problems, so they do. An easier path might be possible that doesn’t depend on such capabilities, and doesn’t stall for their lack, like evolution doesn’t stall for lack of any mind at all. If there is more potential for making models smarter alien tigers by scaling RL in o1-like post-training, and the scaling proceeds to 1 gigawatt and then 35 gigawatt training systems, it might well be sufficient to get an engineer AI that can improve such systems further, at 400x and then 10,000x the compute of GPT-4.
Before o1, there was a significant gap, the mysterious absence of System 2 capabilities, with only vague expectation that they might emerge or become easier to elicit from scaled up base models. This uncertainty no longer gates engineering capabilities of AIs. I’m still unsure that scaling directly can make AIs capabile of novel conceptual thought, but AIs becoming able to experimentally iterate on AI designs seems likely, and that in turn seems sufficient to eventually mutate these designs towards remaining missing capabilities.
(It’s useful to frame most ideas as exploratory engineering rather than forecasting. The question of whether something can happen, or can be done, doesn’t need to be contextualized within the question of whether it will happen or will be done. Physical experiments are done under highly contrived conditions, and similarly we can conduct thought experiments or conceptual arguments under fantastical or even physically impossible conditions. Thus I think Carl Shulman’s human levelAGI world is a valid exploration of the future of AI, even though I don’t believe that most of what he describes happens in actuality before superintelligence changes the premise. It serves as a strong argument for industrial and economic growth driven by AGI, even though it almost entirely consists of describing events that can’t possibly happen.)
Cutting edge AI research seems remarkably and surprisingly easy compared to other forms of cutting edge science. Most things work on the first try, clever insights aren’t required, it’s mostly an engineering task of scaling compute.
This seems like the sort of R&D that China is good at: research that doesn’t need superstar researchers and that is mostly made of incremental improvements. But yet they don’t seem to be producing top LLMs. Why is that?
China is producing research in a number of areas right now that is surpassing the West and arguably more impressive scientifically than producing top LLMs.
A big reason China is lagging a little bit might be political interference at major tech companies. Xi Jinping instigated a major crackdown recently.
There is also significantly less Chinese text data. I am not a China or tech expert so these sre just guesses.
In any case, I wouldn’t assign it to much significance. The AI space is just moving so quickly that even a minor year delay can seem like lightyears. But that doesnt mean that Chinese companies cant so it or that a country-continent with 1,4 billion people and a history of many technological firsts cant scale up a transformer.
The speed of scaling pretraining will go down ~3x in 2027-2029, reducing probability of crossing transformative capability thresholds per unit of time after that point, if they’d not been crossed yet by then.
GPT-4 was trained in 2022 at ~2e25 FLOPs, Grok-3 and GPT-4.5 were trained in 2024 at ~3e26 FLOPs (or twice that in FP8) using ~100K H100s training systems (which cost ~$4-5bn to build). In 2026, Abilene site of Crusoe/Stargate/OpenAI will have 400K-500K Blackwell chips in NVL72 racks (which cost ~$22-35bn to build), enough to train a ~4e27 FLOPs model. Thus recently there is a 2-year ~6x increase in cost for a frontier training system and a 2-year ~14x increase in compute. But for 2028 this would mean a $150bn training system (which is a lot, so only borderline plausible), and then $900bn in 2030. At that point AI companies would need to either somehow figure out how to pool resources, or pretraining will stop scaling before 2030 (assuming AI still doesn’t hit a transformative commercial success).
If funding stops increasing, what we are left with is the increase in price performance of ~2.2x every 2 years, which is ~3.3x slower than the 2-year ~14x at the current pace. (I’m estimating price performance for a whole datacenter or at least a rack, rather than only for chips.)
We also hit limits on fab capacity without constructing a bunch more fabs around a similar time.
Price performance of 2.2x per year feels aggressive to me. The chip only trend is more like 1.35x / year from understanding. Do you think the ML chip trend is much faster than this? I don’t see how you could have a 2.2x price drop per year longer term without chip price performance following as eventually chips will be the bottleneck even if other costs (e.g., interconnect, building datacenters) are dropping.
Edit: this was 2.2x every 2 years, I was just confused.
The chip only trend is more like 1.35x / year from [my] understanding.
If I’m reading the relevant post correctly, it’s 1.35x FP32 FLOP/s per GPU per year (2x in 2.3 years), which is not price-performance[1]. The latter is estimated to be 1.4x FP32 FLOP/s per inflation-adjusted dollar (2x in 2.1 years).
Price performance of 2.2x per year feels aggressive to me.
It’s 2.2x per 2 years, which is 1.5x per year, though that’s still more than 1.4x per year. I’m guessing packaging is part of this, and also Nvidia is still charging a giant margin for the chips, so the chip manufacturing cost is far from dominating the all-in datacenter cost. This might be enough to sustain 1.5x per year a bit beyond 2030 (the discrepancy of 1.5/1.4 only reaches 2x after 10 years). But even if we do get back to 1.4x/year, that only turns the 3.3x reduction in speed of pretraining scaling into 3.9x reduction in speed, so the point stands.
Incidentally, the word “GPU” has recently lost all meaning, since Nvidia started variably referring to either packages with multiple compute dies in them as GPUs (in Blackwell), or to individual compute dies (in Rubin). Packaging will be breaking trends for FLOP/s per package, but also FLOP/s per compute die, for example Rubin seems to derive significant advantage per compute die from introducing separate smaller I/O dies, so that the reticle sized compute dies become more specialized and their performance when considered in isolation might improve above trend.
If I’m reading the relevant post correctly, it’s 1.35x FP32 FLOP/s per GPU per year (2x in 2.3 years), which is not price-performance[1]. The latter is estimated to be 1.4x FP32 FLOP/s per inflation-adjusted dollar (2x in 2.1 years).
Oh oops, I just misread you, didn’t realize you said 2.2x every 2 years, nvm.
Building frontier AI datacenters costs significantly more than their servers and networking. The buildings and the power aren’t a minor cost because older infrastructure mostly can’t be reused, similarly to how a training system needs to be built before we can talk about the much lower cost of 4 months of its time.
Apparently Crusoe’s part in the Stargate Abilene datacenters is worth $15bn, which is only the buildings, power (substations and gas generators), and cooling, but not the servers and networking (Oracle is taking care of that). With 400K chips in GB200 NVL72 racks (which is 5.6K racks), at maybe $4M per rack or $5M per rack together with external-to-racks networking[1] ($70K per chip all-in on compute hardware), that’s about $27bn, a figure that’s comparable to the $15bn for the non-compute parts of the datacenters.
This makes the funding burden significantly higher ($7.5M per rack or $105K per chip), so that the Stargate Abilene site alone would cost about $40-45bn and not only $25-30bn. I’m guessing the buildings and the power infrastructure are not usually counted because they last a long time, so the relatively small time cost of using them (such as paying for electricity, not for building power plants) becomes somewhat insignificant compared to the cost of compute hardware, which also needs to be refreshed more frequently. But the new datacenters have a much higher power density (power and cooling requirements per rack), so can’t use a lot of the existing long-lived infrastructure, and it becomes necessary to build it at the same time, securing enough funding not only for the unprecedented amount of compute hardware, but also simultaneously for all the rest.
The implications for compute scaling slowdown timeline (no AGI and merely $2-4 trillion AI companies) is that funding constraints would result in about 30% less compute in the short term (2025-2030), but as power requirements stop growing and the buildings/cooling/power part again becomes only a small fraction of the overall cost of refreshing the compute hardware, the feasible amount of compute will gradually fill in those 30% back in the medium term (perhaps 2030-2035), leaving the longer term projections (2035-2045) unchanged (meaning ~2000x of scaling in 2029-2045, on top of the current much faster funding-fueled ~2000x of scaling in 2022-2028).
Anchoring to the reference design for a 1024-chip HGX H100 system, where the 8-chip servers are priced at $33.8K per chip, while external-to-servers networking is $8.2K per chip, or about 25% on top of the price of servers.
I found this analysis refreshing and would like to see more on the GPU depreciation costs.
If better GPUs are developed, these will go down in value quickly. Perhaps by 25% to 50% per year. This seems like a really tough expense and supply chain to manage.
I’d expect most of the other infrastructure costs to depreciate much more slowly, as you mention.
This means that straightforward comparison of flops-per-USD between home computer GPU cards and data center flops-per-USD is incorrect. If someone already has a GPU card, they already have a computer and house where this computer stays “for free.” But if someone needs to scale, they have to pay for housing and mainframes.
Such comparisons of old 2010s GPUs with more modern ones are used to show the slow rate of hardware advances, but they don’t take into account the hidden costs of owning older GPUs.
It seems more accurate to say that AI progress is linear rather than exponential, as a result of being logarithmic in resources that are in turn exponentially increasing with time. (This is not quantitative, any more than the “exponential progress” I’m disagreeing with[1].)
Logarithmic return on resources means strongly diminishing returns, but that’s not actual plateauing, and the linear progress in time is only slowing down according to how the exponential growth of resources is slowing down. Moore’s law in the price-performance form held for a really long time; even though it’s much slower than the present funding ramp, it’s still promising exponentially more compute over time.
And so the progress won’t obviously have an opportunity to actually plateau, merely proceed at a slower linear pace, until some capability threshold or a non-incremental algorithmic improvement. Observing the continued absence of the never-real exponential progress doesn’t oppose this expectation. Incremental releases are already apparently making it difficult for people to notice the extent of improvement over the last 2.5 years. With 3x slower progress (after 2029-2032), a similar amount of improvement would need 8 years.
The METR time horizon metric wants to be at least exponential in time, but most of the other benchmarks and intuitive impressions seem to quantify progress in a way that better aligns with linear progress over time (at the vibe level where “exponential progress” usually has its intended meaning). Many plots use log-resources of various kinds on the horizontal axis, with the benchmark value increasing linearly in log-resources, while it’s not yet saturated.
Perhaps another meaning of “exponential progress” that’s real is funding over time, even growth of individual AI companies, but that holds at the start of any technology adoption cycle, or for any startup, and doesn’t need to coexist with the unusual feature of AI making logarithmic progress with more resources.
There is a natural sense in which AI progress is exponential: capabilities are increasing at a rate which involves exponentially increasing impact (as measured by e.g. economic value).
Exponential increase in total economic value is not specific to AI, any new tech is going to start exponentially (possibly following the startups championing it) before it gets further on the adoption S-curve. The unusual things about AI is that it gets better with more resources (while most other things just don’t get better at all in a straightforward scaling law manner), that the logarithm of resources thing leaves the persistent impression of plateauing despite not actually plateauing, and that even if it runs out of the adoption S-curve it still has Moore’s law of price-performance to keep fueling its improvement. These unusual things frame the sense in which it’s linear/logarithmic.
If the improvement keeps raising the ceiling on adoption (capabilities) fast enough, funding keeps scaling into slightly more absurd territory, but even then it won’t go a long way without the kind of takeoff that makes anything like the modern industry obsolete. After the exponential phase of adoption comes to an end, it falls back to Moore’s law, which still keeps giving it exponential compute to slowly keep fueling further progress, and in that sense there is some unusual exponential-ness to this. Though probably there are other things with scaling laws of their own that global economic growth (instead of Moore’s law) would similarly fuel, even slower.
In many industries cost decreases by some factor with every doubling of cumulative production. This is how solar eventually became economically viable.
I guess the cost-quality tradeoff makes AI progress even better described as that of a normal technology. As economies of scale reduce cost, they should also be increasing quality (somewhat interchangeably). It’s just harder to quantify, and so most of the discussion will be in terms of cost. But for the purposes of raising the ceiling on adoption (total addressable market), higher quality works as well as lower cost, so the lowering of costs is directly relevant.
In this framing, logarithmic improvement of quality with more resources isn’t an unusual AI-specific thing either. What remains is the inflated expectations for how quality should be improving cheaply (which is not a real thing, and so leads to the impressions of plateauing with AI, where for other technologies very slow quality improvement would be the default expectation). And Moore’s law of price-performance, which is much faster than economic growth. The economies of scale mostly won’t be able to notice the growth of the specific market for some post-adoption technology that’s merely downstream of the growth of the overall economy. But with AI, available compute would be growing fast enough to make a difference even post-adoption (in 2030s).
A surprising report by Bloomberg claims 16K GB200[1] by summer 2025 at Abilene site (pilot campus of Stargate) and merely 64K GB200 by end of 2026. This is way too little to be a training system, Colossus already has more compute (200K H100/H200) than the projected 64K GB200 at end of 2026.
If this is correct, OpenAI will be training with Azure rather than Stargate in 2025, so raw compute GPT-5 (2e27 FLOPs, 100x GPT-4) probably won’t be out in 2025 and officially “GPT-5” will mean something else (since it’s due “in months” in any case according to Altman). Also, a datacenter with 16K Blackwells only costs about $1bn, they have more money than this, which suggests Blackwell ramp up trouble that might delay everyone else as well, though as a lower bound Nvidia reported $11bn in Blackwell sales for Nov 2024 - Jan 2025 (it’s “Q4 2025” since their FY 2025 runs to end of Jan 2025).
In principle “16K GB200” might mean more Blackwell chips than 16K, a compute tray has more than one chip, with variants marketed as named products like GB200 NVL4 “superchip”, but even at 4 chips per tray/board we still get below 200K H100s in compute. And an NVL72 system has 72 chips (which brings the numbers too high).
I think ‘GB200’ refers to this column (2 Blackwell GPU + 1 Grace CPU) so 16K GB200s ~= 32K B200s ~= 80K H100s. Agree that it is still very low.
My guess is that Bloomberg’s phrasing is just misleading or the reporting is incomplete. For example, maybe they are only reporting the chips Oracle is contributing or something like that. I’d be very surprised if OpenAI don’t have access to >200K GB200s ~= 1M H100s by the end of 2025. For reference, that is only ~$20B capex (assuming $100k total cost of ownership per GB200) or roughly 1⁄4 of what Microsoft alone plan to invest this year.
Once they have just 100K GB200s, that should train 2e27 FLOP in 4 months.[1]
There’s a nice correspondence between H100s and FLOP/month (assuming 40% utilisation and 16-bit precision) of 1e21 FLOP/month/H100. So since 100K GB200s = 500K H100s, that’s 5e26 FLOP/month.
The marketing terminology is inconvenient, a “superchip” can mean 2-GPU or 4-GPU boards and even a 72-GPU system (1 or possibly 2 racks). So it’s better to talk in terms of chips (that are not “superchips”), which I think are all B200 run at slightly different clock speeds (not to be confused with B200A/B102/B20 that have 2 times less compute). In GB200, the chips are 2.5x faster than H100/H200 (not 5x faster; so a 200K chip GB200 system has the same compute as a 500K chip H100 system, not a 1M chip H100 system). Power requirements are often a good clue that helps disambiguate, compute doesn’t consistently help because it tends to get reported at randomly chosen precision and sparsity[1].
Large scale-up worlds (or good chips) are not necessarily very important in pretraining, especially in the later steps of the optimizer when the critical batch size gets high enough, so it’s not completely obvious that a training system will prefer to wait for NVL72 even if other packagings of Blackwell are more available earlier. Inference does benefit from NVL72 a lot, but for pretraining it’s just cheaper per FLOP than H100 and faster in wall clock time during the first ~3T tokens when the whole cluster can’t be used yet if the scale-up worlds are too small (see Section 3.4.1 of Llama 3 report).
From the initial post by Crusoe (working on the Abilene campus), there is a vague mention of 200 MW and a much clearer claim that each data center building will host 100K GPUs. For GB200, all-in power per chip is 2 KW, so the 200 MW fits as a description of a data center building. The video that went out at the time of Jan 2025 Stargate announcement and also a SemiAnalysis aerial photo show two 4-section buildings. Dylan Patel claimed on Dwarkesh Podcast that the largest single-site campus associated with OpenAI/Microsoft being built in 2025 can hold 300K GB200 chips. From this I glean and I guess that each 4-section building can hold 100K chips of GB200 requiring 200 MW, and that they have two of these mostly built. And 200K chips of GB200 are sufficient to train a 2e27 FLOPs model (next scale after Grok 3′s ~3e26 FLOPs), so that makes sense as a step towards pretraining independence from Microsoft. But 16K chips or possibly 16K NVL4 superchips won’t make a difference, 100K H100s are on the same level (which GPT-4.5 suggests they already have available to them) and for inference Azure will have more Blackwells this year anyway.
For pretraining, you need dense compute rather than sparse. It’s unclear if FP8 rather than BF16 is widely used in pretraining of frontier models that are the first experiment at a new scale, or mostly in smaller or optimized models. But the GPT-4.5 announcement video vaguely mentions work on low precision in pretraining, and also high granularity MoE of the kind DeepSeek-V3 uses makes it more plausible for the FFN weights.
That’s indeed inconvenient. I was aware of NVL2, NVL4, NVL36, NVL72, but I was under the impression that ‘GB200’ mentioned on its own always means 2 Blackwells, 1 Grace (unless you add on a ‘NVL__’). Are there counterexamples to this? I scanned the links you mentioned and only saw ‘GB200 NVL2,’ ‘GB200 NVL4,’ ‘GB200 NVL72’ respectively.
I was operating on this pretty confidently but unsure where else I saw this described (apart from the column I linked above). On a quick search of ‘GB200 vs B200’ the first link I found seemed to corroborate GB200 = 2xB200s + 1xGrace CPU. Edit: second link also says: “the Grace-Blackwell GB200 Superchip. This is a module that has two B200 GPUs wired to an NVIDIA Grace CPU...”
“GB200 superchip” seems to be unambiguously Grace+2xB200. The issue is “100K GB200 GPUs” or “100K GB200 cluster”, and to some extent “100K GPU GB200 NVL72 cluster”. Also, people will abbreviate various clearer forms to just “GB200”. I think “100K chip GB200 NVL72 training system” less ambiguously refers to the number of B200s, but someone unfamiliar with this terminological nightmare might abbreviate it to “100K GB200 system”.
Good point, thanks. Previously I would have pretty confidently read “100K GB200 GPUs,” or “100K GB200 cluster” as 200K B200s (~= 500K H100s) but I can see how it’s easily ambiguous. Now that I think of it, I remembered this Tom’s Hardware article where B200 and GB200 are mistakenly used interchangeably (compare the subtitle vs. the end of the first paragraph)...
Abilene site of Stargate will host 100K-128K chips in GB200 NVL72 racks by this summer, and a total of 400K-512K chips in 2026, based on a new post by Crusoe and a reinterpretation of the recent Bloomberg post in light of the Crusoe post. For 2025, it’s less than 200K chips[1], but more than the surprising 16K-32K chips[2] that the Bloomberg post suggested. It can be a training system after all, but training a raw compute “GPT-5” (2e27 FLOPs) by the end of 2025 would require using FP8[3].
The Crusoe post says “initial phase, comprising two buildings at … 200+ megawatts” and “each building is designed to operate up to 50,000 NVIDIA GB200 NVL72s”. Dylan Patel’s estimate (at 1:24:42) for all-in power per Blackwell GPU as a fraction of the datacenter was 2.0 KW (meaning per chip, or else it’s way too much). At GTC 2025, Jensen Huang showed a slide (at 1:20:52) where the estimate is 2.3 KW per chip (100 MW per 85K dies, which is 42.5K chips).
So the “50K GB200 NVL72s” per building from the Mar 2025 Crusoe post can only mean the number of chips (not dies or superchips), and the “100K GPUs” per building from the Jul 2024 Crusoe post must’ve meant 100K compute dies (which is 50K chips). It’s apparently 100-115 MW per building then, or 800-920 MW for all 8 buildings in 2026, which is notably lower than 1.2 GW the Mar 2025 Crusoe post cites.
How can the Bloomberg’s 16K “GB200 semiconductors” in 2025 and 64K in 2026 be squared with this? The Mar 2025 Crusoe post says there are 2 buildings now and 6 additional buildings in 2026, for the total of 8, so in 2026 the campus grows 4x, which fits 16K vs. 64K from Bloomberg. But the numbers themselves must be counting in the units of 8 chips. This fits counting in the units of GB200 NVL8 (see at 1:13:39), which can be referred to as a “superchip”. The Mar 2025 Crusoe post says Abilene site will be using NVL72 racks, so counting in NVL8 is wrong, but someone must’ve made that mistake on the way to the Bloomberg post.
Interpreting the Bloomberg numbers in units of 8 chips, we get 128K chips in 2025 (64K chips per building) and 512K chips in 2026 (about 7K GB200 NVL72 racks). This translates to 256-300 MW for the current 2 buildings and 1.0-1.2 GW for the 8 buildings in 2026. This fits the 1.2 GW figure from the Mar 2025 Crusoe post better, so there might be some truth to the Bloomberg post after all, even as it’s been delivered in a thoroughly misleading way.
Crusoe’s Jul 2024 post explicitly said “each data center building will be able to operate up to 100,000 GPUs”, and in 2024 “GPU” usually meant chip/package (in 2025, it’s starting to mean “compute die”, see at 1:28:04; there are 2 compute dies per chip in GB200 systems). Which suggested 200K chips for the initial 2 buildings.
The post said it’s the number of “coveted GB200 semiconductors”, which is highly ambiguous because of the die/chip/superchip counting issue. A “GB200 superchip” means 2 chips (plus a CPU) by default, so 16K superchips would be 32K chips.
A GB200 chip (not die or superchip) produces 2.5e15 dense BF16 FLOP/s (2.5x more than an H100 chip). Training at 40% utilization for 3 months, 100K chips produce 8e26 FLOPs. But in FP8 it’s 1.6e27 FLOPs. Assuming GPT-4 was 2e25 FLOPs, 100x its raw compute asks “GPT-5” to need about 2e27 FLOPs. In the OpenAI’s introductory video about GPT-4.5, there was a hint it might’ve been trained in FP8 (at 7:38), so it’s not implausible that GPT-5 would be trained in FP8 as well.
Crusoe/OpenAI Abilene campus might come online in Feb-Jun 2026. Crusoe CEO said during RAISE Summit 2025 (that took place on 8-9 Jul 2025) that the 6 buildings of phase 2 will “be coming online” in “just over 200 days” (at 7:03 during a panel discussion). If this means 230 days, that’s end of Feb 2026. If he really means “coming online”, then it becomes available at that time. If he actually means that it’s when the last building of 8 from both phases will be ready to install the compute hardware, then it’s at least 3-4 months more to do that (judging by xAI’s Colossus), possibly May-Jun 2026.
This is plausibly the first 400K chip system in GB200/GB300 NVL72 racks (about 900 MW), which is 10x 100K H100s of 2024 in FLOP/s and 12x H200s in HBM per scale-up world (for GB200, at 14 TB), making models 10x larger in total params feasible to inference or train with a lot of RLVR. Currently only Google plausibly has comparable compute, with their Trillium (TPUv6e) systems that across 256 chips per pod (scale-up world) offer 8 TB of HBM (generally available since Dec 2024 in 100K chip systems). The older TPUv5pfrom 2023 has even larger pods, but it’s unclear if they have enough of them to for example inference Gemini 2.5 Pro for all users. And Anthropic has Trainium 2 Ultra systems with 6 TB of HBM. Currently they probably only have 400K chips that only became available recently (months after TPUv6e), but by next year they might get significantly more.
2025 Frontier Model Sizes
This weakly predicts that GPT-5-thinking (and Grok 4) is a smaller model (1-2T total params) running on older hardware (~H200s, 1.1 TB), Gemini 2.5 Pro might be a 3-5T total params model (TPUv6e, 8 TB), and Opus 4 might be a 2-4T total params model (Trainium 2 Ultra, 6 TB). I’m assuming that the recent frontier models targeting the older 8-chip servers had to be too big to fit in one scale-up world to capture at least some capabilities that the available pretraining compute in principle enables, but the constraint is no longer as onerous with the newer systems, and so they will likely just fit in one scale-up world rather than lose efficiency on needing more.
The compute optimal size for pretraining with 100K H100s of 2024 might be about 800B active params (at 120 tokens/param, 3x the dense model’s 40 tokens/param to account for 1:8 sparsity), which is probably way too much with 1 TB HBM per server (since MoE wants at least 4x more total params, and inference gets slower and more expensive if too many scale-up worlds are needed per model), but might be OK for 6-8 TB of HBM per scale-up world, and so Opus 4 and Gemini 2.5 Pro might also have more active params than GPT-5-thinking. With GB200 NVL72 (14 TB), models with 4-8T total params become feasible, so there is less reason to keep the number of active params below compute optimal level. And then GB300 NVL72 has 20 TB of HBM, which is plausibly what the remaining 6 buildings of phase 2 of Abilene campus will host.
On the other hand, most tokens are input tokens (98% of OpenRouter Sonnet 4 tokens are input tokens), so reducing the number of active params is very important for model providers, and even if Gemini 2.5 Pro has 5T total params, it might still have significantly less than the pretraining compute optimal ~800B params. For example, at 1:32 sparsity even 5T total params only ask for 160B active params.
Largest Models of 2025-2026
So only Opus 4 is somewhat likely to have a compute optimal number of active params, due to its very high price and contrast with the already capable Sonnet 4 (they might’ve only had access to about 50K H100s when pretraining Opus 4, which is 5x fewer FLOP/s than 400K Trainium 2 chips). And GPT-4.5 probably has a similar number of active params (plausibly a bit more, since they had at least 100K H100s), but we still didn’t get a thinking version, so its capabilities can’t be properly observed. And plausibly it wasn’t trained with enough RLVR to count due to lack of availability of GB200 NVL72. By now, Opus 4.1 plausibly had enough time with Trainium 2 Ultra available to train with pretraining-scale RLVR (or this might happen a bit later), and similarly for GPT-4.5 (with GB200 NVL72), but for GPT-4.5 there might be insufficient compute to inference it without reducing demand a lot by setting uncomfortable prices or rate limits, and as a result of that a thinking model with pretraining-scale RLVR might not exist yet, at least in a product-ready form. This might take until well into 2026 to change, after phase 2 of the Abilene campus is ready (and presumably buildouts by other cloud providers that OpenAI might use, which might be a bit earlier, since inference doesn’t have much use for particularly giant datacenter campuses, just enough in total to serve all users). If so, this is when we’ll see the first GPT-4.5 sized pretraining-scale RLVR trained model from OpenAI, though by that time the plausibly similarly sized Opus 4 would already be considerably more mature.
Then, there is Gemini 3, which will probably come out early 2026. The next generation TPU is Ironwood (TPUv7), which supports 9,216 chip pods, but even 256 chip pods have 50 TB of HBM per pod. If there are enough of these built by then, Gemini 3 could include the largest model of 2026 (by total params count).
Here’s a couple of my recent relevantposts (both slightly outdated, in particular see this comment, and the note on Gemini 2 Ultra in another comment under this quick take). Though in this quick take, I’m mostly discussing total params count and HBM capacity per scale-up world, not compute, how it’s constraining 2025 AIs beyond compute (so that even 2024 compute fails to find efficient use), and how in 2026 these constraints become less strict.
Total params plus the total KV cache for all requests multiplies the cost of output tokens, so there is reason to keep it down, but little reason to make it much smaller than the whole scale-up world, because then it’s much smaller than KV cache and stops influencing the cost. And for the most capable models the fraction of input tokens on OpenRouter is not as extreme as for Sonnet 4 (88% for Gemini 2.5 Pro, 92% for GPT-5; though 97% for Opus 4.1, probably due to high cost). So it won’t be a factor that motivates fewer active params as with the 8-chip servers and possibly in part with the 6-8 TB systems. Also, 2025 Google pretraining compute could be significantly greater than 100K H100s (maybe 2-4 100K TPUv6e datacenters, which have the same FLOP/s as 200-400K H100s; pretraining of models that are too large using TPUv6e is fine, just not inference or RLVR). So the compute optimal number of active params could increase to 1.0-1.5T (if my 120 tokens/param estimate is in the ballpark). This asks for at least 4-6T total params, but at least 8-12T for 1:8 sparsity might be more appropriate for a premium model (this would be Gemini 3 Ultra). Which is only 20% of the pod HBM (if in FP8), so maybe even 15-20T (at which point the contribution to the cost of output tokens becomes significant).
I’ve only recently realized that the reason there is no Gemini 2 Ultra might be because they don’t have enough inference capacity for overly large total params models, with TPUv6e only having 8 TB of HBM per pod and TPUv5p either outright insufficient in number or not enough to spare, since they are needed for other things. So it’s probably not evidence of Google having made a decision to use less than what they have, as I previously thought. And as TPUv7 changes what they have, they might use it to do more than what they did with Gemini 2. Though if the buildout for TPUv7 won’t yet be sufficiently finished in 2025, RLVR and inference will have to wait until later in 2026 (in the meantime, TPUv5p might help to start on RLVR).
It’s instrumentally useful for early AGIs to Pause development of superintelligence for the same reasons as it is for humans. Thus preliminary work on policy tools for Pausing unfettered RSI is also something early AGIs could be aimed at, even if it’s only half-baked ideas available on the eve of potential takeoff, as the AGIs are proving hard to aim and start doing things for their own reasons.
If (early) scheming-for-long-run-preferences AGIs were in control, they would likely prefer a pause (all else equal). If they aren’t, it’s very unclear and they very well might not. (E.g., because they gamble that more powerful AIs will share their preferences (edit: share their preferences more than the humans in control do) and they think that these AIs would have a better shot at takeover.)
because they gamble that more powerful AIs will share their preferences (edit: share their preferences more than the humans in control do)
Ah, I’m thinking the AGIs themselves get closer to being proper stakeholders at that stage, for practical purposes (along the lines of gradual disempowerment), since they do have all the basic AI advantages even if they aren’t superintelligent. So humans remaining in control is not centrally the case even if nominally they still are and intent alignment still mostly works.
The conditions for such partial loss of control might even be necessary for a Pause project to succeed. If this isn’t the case with the first generation of AGIs, it might become the case with the second generation, and so on, reaching an equilibrium at some point once AGIs are sufficiently powerful and in control of the situation to successfully implement a worldwide RSI Pause.
The post I’m framing this around posits enough intent alignment to aim AIs at projects, which doesn’t necessarily imply that the AIs aren’t powerful enough to accomplish things that seem hopeless with human-only effort within a few years.
The point about convergent instrumental use of Pausing RSI for early AGIs is that this might be an easier target to aim the AIs at, all else equal. It’s not strictly necessary for this to be a major factor. Mostly I’m pointing out that this is something AIs could be aimed at through intent alignment, convergent motivation or not, which seems counterintuitive for a Pause AI project if not considered explicitly. And thus currently it’s worth preparing for.
because they gamble that more powerful AIs will share their preferences and they think that these AIs would have a better shot at takeover
That’s how some humans are thinking as well! The arguments are about the same, both for and against. (I think overall rushing RSI is clearly a bad idea for a wide variety of values and personal situations, and so smarter AGIs will more robustly tend to converge on this conclusion than humans do.)
That’s how some humans are thinking as well! The arguments are about the same, both for and against. (I think overall rushing RSI is clearly a bad idea for a wide variety of values and personal situations, and so smarter AGIs will more robustly tend to converge on this conclusion than humans do.)
Sorry I meant “share their preferences more than the humans in control share their preferences”. I agree that this might be how some humans are thinking, but the case for the humans is much more dubious!
The case for AGIs delaying RSI also gets stronger because they don’t need to rely on cryonics, so their personal survival is less threatened by delays. Future AIs might have more of an issue here if they have test-time training, with a lot of data forming individual AI personalities for deployment instances (adapting them to particular jobs or sources of tasks), and this data habitually gets thrown away rather than being carefully preserved. People should notice that this might be an important concern, though if it is the AIs might start saying so themselves, and not ignoring this would be easier than spontaneously noticing without even being prompted by the AIs. But currently preserving AIs is more reliable than for humans, and in any case much cheaper.
I don’t think “both things are AI” is much of an argument about value alignment, given that there is no clear understanding of how either of the AIs work, what values are, how that translates to what we want out of alignment, and so on. The entities on the other side of an RSI process might have very little in common with the first AGIs in their design. If the AIs don’t understand how to align the consequences of an RSI process, they are still in a similar boat to humans who don’t understand how to align the consequences of an RSI process. It might take AIs less time to figure it out, but if they are not yet too superintelligent, then it could still take a significant time, and so would require a sufficiently serious effort in preventing RSI, such that if this Pause project is at all successful, it could then in principle hold for years or decades.
Also it seems that their goals will be something like “I want to do what my developers want me to do”, which will likely be pretty myopic, and preventing superintelligence is long-term.
I think it most likely will be a good outcome. I guess I sort of agree with Geoff Hinton that maybe it’s a 10 to 20 percent chance on annihilation. But look on the bright side, that’s 80 to 90 percent probability of a great outcome.
OMG! GEOFF! STOP STATING YOUR DEFERENTIAL PROBABILITY without also stating your first-order probability! If your first-order probability is >50% then say so! Otherwise you’re making other people (ELON MUSK!) double count evidence from “other people”.
Musk is in charge of xAI, one of the only 5 companies in the world that both have access to frontier AI training compute and pursue development of AGI (Google DeepMind, OpenAI, Anthropic, xAI, and Meta). So seeing unambiguous “annihilation” with a significant weight in his probability distribution (and also on the record) is a notable development. (In 2023 there was a statement on extinction risk signed by Hassabis, Amodei, and Altman, but it didn’t state the weight of the risk, and wasn’t signed by Musk or Zuckerberg.)
Edit: The rest of this comment in its original form got out of hand, you can now read it as a post.
He probably doesn’t have much influence on the public opinion of LessWrong, but as a person in charge of a major AI company, he is obviously a big player.
He owns xAI, a major AI lab, and has a lot of resources to back it. And before xAI, he was one of the founders at OpenAI. With which he now has an ongoing rivalry.
Is he significant/influential as in “if he says something on a topic, that will cause people at LessWrong to change opinions”? Not very.
Is he significant/influential to the field of AI as a whole? Yes, very much so. Like with Yann LeCun, his opinions on AI and AI risks are of some importance on those grounds alone.
A MoE transformer can reach the same loss as a compute optimal dense model using 3x-6x less compute, but will need the same amount of data to do it. So compute optimal MoEs don’t improve data efficiency, don’t contribute to mitigating data scarcity.
A new Jan 2025 paper offers straightforward compute multiplier results comparing dense transformers to MoE at various levels of sparsity, with isoFLOPs for various tokens/parameter ratios, using experiments of up to 1e21 FLOPs per datapoint. Compute multiplier results are in Figure 11, with about 3x compute multiplier for 87% (1:8) sparse MoE over dense, and about 6x-7x compute multiplier for 97% (1:32) sparse MoE (same sparsity as DeepSeek-V3).
But there’s a catch. Greater sparsity makes it compute optimal to use fewer active parameters, and therefore more data (training with the same compute). This can be seen on isoFLOP plots in Figure 12, left. As sparsity goes from 0% (dense) to 95% (1:20), compute optimal number of active parameters for their 1e21 FLOPs experiments goes from 2.9B to 1.3B. For 97% (1:32) sparsity, interpolating from experiments on the other compute budgets, the ratio of the number of active parameters seems to be about 2.5x. Keeping compute unchanged, 2.5x fewer parameters means 2.5x more data, or 6x greater tokens/parameter ratio for a compute optimal training run.
Thus a dense model can be replaced with a 97% sparse MoE model trained using 6x less compute that will achieve the same perplexity, but the tokens/parameter ratio of this MoE model will be 6x greater than for the original dense model. Both data and active parameters would go down by 2.5x from reducing compute 6x if the ratio didn’t change, but since it does change, in actuality only the number of active parameters goes down 6x, while the number of tokens stays the same.
Let’s take Llama-3-405B as an example, which is a 405B parameter compute optimal model trained for 15T tokens at 40 tokens/parameter, using 4e25 FLOPs. An equivalent 97% sparse model will have 70B active parameters, 2T total parameters, and will need to be trained for the same 15T tokens to reach the same perplexity/loss at 220 tokens/parameter, using 6e24 FLOPs. (Which is close to DeepSeek-V3′s 4e24-5e24 FLOPs actually, so anchoring to Llama-3-405B might be a good way of framing its compute efficiency.)
So compute optimal MoEs don’t improve data efficiency, don’t contribute to mitigating data scarcity.
I agree compute optimal MoEs don’t improve data utilization. But, naively you might expect that MoEs can be used to reduce issues with data scarcity at a fixed level of compute by training a much bigger model on a fixed amount of data.
As in, because there are returns to both more data and bigger models, you can use MoE to effectively use a much bigger model at the same compute.
Like, maybe you would have trained llama-3-405B on 15T tokens. You could instead train an 8 trillion parameter model with 400B active params on 15T tokens and a priori this could perform much better on that same amount of data. (In practice an MoE with X active params is more expensive to train than a dense model with X active params, so you might need to reduce active params somewhat.)
Chinchilla scaling shows that tokens/params ratio for compute optimal models only changes slowly with compute, making it a good anchor to frame other things in terms of. The experiments from this MoE scaling paper show that under fixed data, varying sparsity in MoEs that are compute optimal at that amount of data preserves perplexity. This also seems like a nice principle for framing the way compute optimal models sit in the space of hyperparameters.
With infinite data, isoFLOPs for loss depending on number of active params are parabolas with some minimum point. But with finite data you need to repeat it to train with fewer active params, which damages loss. This moves the minima of isoFLOPs to the right if the minima already required 5x repetition or more. So under data scarcity, compute optimal models have more active params than under infinite data, and the effect gets worse with more compute. This way we maintain the framing of search for compute optimal hyperparameters rather than undertraining.
Now consider the 1e20 FLOPs plot in Figure 12, left. If there’s only 2B tokens of training data and no more, all minima already ask for 12-31 epochs, so the distortion that increases loss will move the minima to the right (and up), and move the high sparsity minima further than lower sparsity minima compared to their original (infinite data) locations. The way the isoFLOPs are shaped suggests that 90-95% sparsity might turn out to be optimal here, that is you can only get worse loss with 98+% sparsity at 1e20 FLOPs, however you vary the number of epochs and active params! This seems counterintuitive, as in an infinite data regime more sparsity only makes things better (if we ignore practical difficulties). But sure, 90% sparsity will still be better than dense, at least until we use even more compute and sparser minima start asking for even more epochs.
The way the isoFLOPs are shaped suggests that 90-95% sparsity might turn out to be optimal here, that is you can only get worse loss with 98+% sparsity with 1e20 FLOPs, however you vary the number of epochs and active params!
I’m currently skeptical and more minimally, I don’t understand the argument you’re making. Probably not worth getting into.
I do think there will be a limit to how sparse you want to even in the very high compute relative to data regime for various reasons (computational if nothing else). I don’t see how these graphs support 90-95% sparsity, but I had a hard time understanding your argument.
Regardless, I don’t think this argues against my claim, not sure if you were trying to argue against the claim I was saying or add context. (Insofar as your argument is true, it does limit the returns from MoE in the regime with little data.)
With 90% sparsity you do get better loss than dense, this is sufficient to broadly carry your argument. But with 98% sparsity (your llama-3-405B variant example has 95% sparsity) you might get worse loss than with 90% when data is scarce, though it’ll still be better than dense. The principle about MoE damaging data efficiency (optimal tokens/param ratio) hints that this might be the case even before looking at the experiments.
Chatbot Arena results for DeepSeek-V3 are in. It placed 7th in Overall w/ Style Control, tied with Claude-3.5.Oct-Sonnet, and 3rd in Hard Prompts w/ Style Control, tied with Gemini-2.0-Flash and behind only Claude-3.5.Oct-Sonnet, mysterious Gemini-Exp-1206, o1, and Gemini-2.0-Flash-Thinking.
It’s a MoE model with 37B active parameters trained for about 5e24 FLOPs, 10x less compute than Llama-3-405B, 20x less than what could plausibly be extracted from 30K H100s in BF16. The pretraining data is about 15T tokens, so at 400 tokens per active parameter it’s very overtrained, that is not even compute optimal.
It has 256 routed experts per layer, 8 of which get activated per token. These results give some weight to the Feb 2024 paper that predicts that using more granular experts and activating a lot of them per token can give shocking compute multipliers[1], up to 20x-30x, much more than for MoE transformers that only activate 1-2 routed experts per token (Figure 1b). The paper itself only does experiments of up to about 5e19 FLOPs, in particular directly demonstrating a compute multiplier of 2x from using 8 experts per token instead of 2, with the numbers of total and active parameters kept the same (Figure 5b), the rest is extrapolation from fitted scaling laws.
A new architecture has a compute multiplier M (at a given level of compute) if it would take M times more compute to train a compute optimal model with a reference architecture (in this case, a dense transformer) to match the perplexity it achieves when trained on data sampled from the same dataset.
New AWS Trainium 2 cluster offers compute equivalent to 250K H100s[1], and under this assumption Anthropic implied[2] their previous compute was 50K H100s (possibly what was used to train Claude 3.5 Opus).
So their current or imminent models are probably 1e26-2e26 FLOPs (2-4 months on 50K H100s at 40% compute utilization in BF16)[3], and the upcoming models in mid to late 2025 will be 5e26-1e27 FLOPs, ahead of what 100K H100s clusters of other players (possibly except Google) can deliver by that time.
SemiAnalysis gives an estimate of 24-27 kilowatts per 32 Trainium 2 chips, so 200K Trn2s need 150 megawatts. The 7 datacenter buildings in the northern part of the New Carlisle AWS site are 65 megawatts each according to SemiAnalysis. That’s enough for 600K Trn2s, so the figure of 400K Trn2s probably refers to those buildings alone, rather than also to the second phase of the project scheduled for next year. At 0.65e15 dense BF16 FLOP/s each, 400K Trn2s produce as much compute as 250K H100s.
At 4 months, with $2/hour, this takes $300 million, which is at odds with $100 million Dario Amodei gestured at in Jun 2024, but that only applies to Claude 3.5 Sonnet, not Opus. So Opus 3.5 (if it does come out) might be a 2e26 FLOPs model, while Sonnet 3.5 a 7e25-1e26 FLOPs model. On the other hand, $2 per H100-hour is not AWS prices, at those prices Sonnet 3.5 might be capped at 4e25 FLOPs, same as Llama-3-405B.
For OpenAI, there are currently 3 datacenter buildings[1] near Phoenix Goodyear Airport that Dylan Patel is claiming are 48 megawatts each and filled with H100s, for about 100K H100s. This probably got online around May 2024, the reason for the announcement and the referent of Kevin Scott’s blue whale slide.
There are claims about a future cluster of 300K B200s and a geographically distributed training system of 500K-700K B200s, but with B200s deliveries in high volume to any given customer might only start in early to mid 2025, so these systems will probably get online only towards end of 2025. In the meantime, Anthropic might have a lead in having the largest cluster, even if they spend less on compute for smaller experiments overall. It might take a while to get it working, but there might be a few months there. And given how good Claude 3.5 Sonnet is, together with the above musings on how it’s plausibly merely 4e25 FLOPs based on Dario Amodei’s (somewhat oblique) claim about cost, additionally getting compute advantage in training a frontier model could carry them quite far.
There are 4.5 buildings now at that site, but you can see with Google Street View from Litchfield Rd that in Sep 2024 only the first 3 had walls, so the 4th is probably not yet done.
Re: OpenAI’s compute, I inferred from this NYT article that their $8.7B costs this year were likely to include about $6B in compute costs, which implies an average use of ~274k H100s throughout the year[1] (assuming $2.50/hr average H100 rental price). Assuming this was their annual average, I would’ve guessed they’d be on track to be using around 400k H100s by now.
So the 150k H100s campus in Phoenix might be only a small fraction of the total compute they have access to? Does this sound plausible?
The co-location of the Trainium2 cluster might give Anthropic a short term advantage, though I think its actually quite unclear if their networking and topology will fully enable this advantage. Perhaps the OpenAI Phoenix campus is well-connected enough to another OpenAI campus to be doing a 2-campus asynchronous training run effectively.
Training as it’s currently done needs to happen within a single cluster (though this might changesoon). The size of the cluster constrains how good a model can be trained within a few months. Everything that isn’t training of a frontier model can happen using many smaller clusters, something like 16 to 4096 accelerators each. You can use a lot of these smaller clusters, but they can be sourced from anywhere and built piecemeal at multiple sites with smaller power allocations, while the big training cluster needs to be a single purposefully built system.
So I expect the big expenses are inference and many training experiments with smaller models. What I’m discussing here is the big cluster for training frontier models rather than the aggregate of the small clusters for other purposes. See also this comment.
Training as it’s currently done needs to happen within a single cluster
I think that’s probably wrong, or at least effectively wrong. Gemini 1.0, trained a year ago has the following info in the technical report:
TPUv4 accelerators are deployed in “SuperPods” of 4096 chips... TPU accelerators primarily communicate over the high speed inter-chip-interconnect, but at Gemini Ultra scale, we combine SuperPods in multiple datacenters using Google’s intra-cluster and inter-cluster network (Poutievski et al., 2022; Wetherall et al., 2023; yao Hong et al., 2018). Google’s network latencies and bandwidths are sufficient to support the commonly used synchronous training paradigm, exploiting model parallelism within superpods and data-parallelism across superpods.
As you note, public distributed training methods have advanced beyond basic data parallelism (though they have not been publicly shown at large model scales because nobody has really tried yet).
This might require bandwidth of about 300 Tbps for 500K B200s systems (connecting their geographically distributed parts), based on the below estimate. It gets worse with scale.
The “cluster” label applied in this context might be a bit of a stretch, for example the Llama 3 24K H100s cluster is organized in pods of 3072 GPUs, and the pods themselves are unambiguously clusters, but at the top level they are connected with 1:7 oversubscription (Section 3.3.1).
Only averaged gradients need to be exchanged at the top level, once at each optimizer step (minibatch). Llama 3 405B has about 1M minibatches with about 6 seconds per step[1], which means latency doesn’t matter, only bandwidth. I’m not sure what precision is appropriate for averaging gradients, but at 4 bytes per weight that’s 1.6TB of data to be sent each way in much less than 6 seconds, say in 1 second. This is bandwidth of 12 Tbps, which fits in what a single fiber of a fiber optic cable can transmit. Overland cables are laid with hundreds of fibers, so datacenters within the US can probably get at least one fiber of bandwidth between them.
Overly large minibatches are bad for quality of training, and with H100s in a standard setup only 8 GPUs are within NVLink scaleup domains that enable tensor parallelism. If each token sequence is processed on 8 GPUs (at a given stage of pipeline parallelism), that makes it necessary to process 2K sequences at once (Llama 3 only uses 16K GPUs in its training), and with 8K tokens per sequence that’s our 16M tokens per minibatch, for 1M minibatches[2]. But if scaleup domains were larger and enabled more tensor parallelism (for an appropriately large model), there would be fewer sequences processed simultaneously for smaller minibatches, so the time between optimizer steps would decrease, from Llama 3 405B’s 6 seconds down to less than that, making the necessary gradient communication bandwidth higher.
Some B200s come as NVL72 machines with 72 GPUs per scaleup domain. And with more weights there’ll be more data in the gradients for those models. Llama 3 405B has 16Kx53K matrices and 8K token sequences, so at 3TB/s and 1e15 FLOP/s (in an H100), you need tiles of size at least 1000x1000 to get sufficient arithmetic intensity. The scaleup network is a bit over 3 times slower than HBM, which is almost sufficient to move along the results (and starts to fit if we increase the inner dimension, with the tiles no longer square). So as far as I understand (could be very wrong, without experience to anchor the numbers), in principle there is enough there for a bit less than 8 times 16 times 53 GPUs to work with (tiling multiplication of a 16Kx53K matrix by a 53Kx8K matrix in squares of 1Kx1K), more than 1000 of such GPUs could participate in tensor parallelism for Llama 3 405B if the network could handle it, so in particular the 72 GPUs of NVL72 are few enough that they could run such multiplications with tensor parallelism.
With 72 B200s per NVLink domain in a 500K B200s system, that’s 7K sequences per minibatch, 3x more than for Llama 3 405B[3]. The compute per second, and so per training run, is larger than with 16K H100s by a factor of 80, so by Chinchilla scaling law a dense model would be about 9 times larger, 3.5T parameters. So the model is 9x larger, processed over 9x more GPUs (per NVLink domain) that are 2.5 times faster, which means an optimizer step is 2.5 times shorter. This assumes that the sequence length stays 8K (if it’s higher then so is the time between optimizer steps, reducing the necessary bandwidth). Transmitting gradients for 9x more weights in that time requires bandwidth that’s 20 times higher, about 300 Tbps.
That’s still within the realm of possibility, some oceanfloor cables feature bandwidth on the same order of magnitude, and overland cables should enable more, but it’s no longer likely to be trivial, could require actually laying the cables between the datacenter campus sites, which could take a long time to get all the permissions and to do the construction.
16K GPUs at 40% utilization for about 4e25 dense BF16 FLOPs, which is 40% of 1e15 FLOP/s for each GPU. And 16M tokens/minibatch (Table 4) out of about 16T tokens in total.
This gives another way of getting the estimate of 6 seconds per step, which doesn’t depend on the size of the cluster at all. The compute for 1 sequence is 6 times 405B parameters times 8K tokens, processed by 8 GPUs (at some pipeline parallelism stage), each at a rate of 1e15 FLOP/s with 40% utilization on average, so it takes them 6 seconds to process a sequence.
So making NVLink domains 9x larger only kept the problem of large minibatches from getting more than 3 times worse. This is still much better than 150K sequences per minibatch if the same compute was assembled in the form of 1200K H100s with 8 GPUs per NVLink domain.
And in a way, they ought to be rolling in even more compute than it looks because they are so much more focused: Anthropic isn’t doing image generation, it isn’t doing voice synthesis, it isn’t doing video generation… (As far as we know they aren’t researching those, and definitely not serving it to customers like OA or Google.) It does text LLMs. That’s it.
But nevertheless, an hour ago, working on a little literary project, I hit Anthropic switching my Claude to ‘concise’ responses to save compute. (Ironically, I think that may have made the outputs better, not worse, for that project, because Claude tends to ‘overwrite’, especially in what I was working on.)
I’d guess that the amount spent on image and voice is negligible for this BOTEC?
I do think that the amount spent on inference for customers should be a big deal though. My understanding is that OpenAI has a much bigger userbase than Anthropic. Shouldn’t that mean that, all else equal, Anthropic has more compute to spare for training & experiments? Such that if Anthropic has about as much compute total, they in effect have a big compute advantage?
OpenAI’s gpt-oss-120b might be the first open weights model (implicitly) revealed to be pretrained for 100T-200T tokens. In the section “Pretraining” of the model card, it’s said that “The training run for gpt-oss-120b required 2.1 million H100-hours”, so probably this is just the GPU-time for pretraining rather than both pretraining and RLVR.
At 40% utilization, with 2e15 FP8 FLOP/s per H100, 2.1e6 H100-hours give 6e24 FLOPs (3.5x less than the original GPT-4, 2x more than DeepSeek-V3). The model only has 5.1B active params, so this suggests 188T tokens by 6ND rule. If it was pretrained in BF16 for some reason, that’s still 94T tokens.
For comparison, a compute optimal 5e26 model pretrained on 100K H100s from 2024 would also need 100T tokens at 850B active params (assuming MoE with 1:8 active to total param ratio, with 120 tokens/param compute optimal from Llama-3-405B’s 40 tokens/param as the dense anchor and 3x that for a 1:8 sparse MoE). And an overtrained model with fewer active params would need even more tokens. Though plausibly in both cases there is some repetition of data.
Also, this suggests that the model is 80-180x overtrained (the tokens/param multiple for compute optimal pretraining might be 5x-6x dense for the sparsity of gpt-oss-120b, so 200-240 tokens/param). Looking at isoFLOPs for Llama 3, this might incur a penalty of about 5x-10x in effective compute, turning the raw 6e24 FLOPs into effective 6e23-1e24 FLOPs (which could ask for a 65B param compute optimal dense model trained for merely 2.6T tokens). In contrast, DeepSeek-V3 is only 2x overtrained (under the same assumptions), so its 3e24 FLOPs are more straightforward.
The model sizes were likely chosen based on typical inference constraints. Given that, they mostly care about maximizing performance, and aren’t too concerned about the compute cost, since training such small models is very affordable for them. So it’s worth going a long way into the regime of diminishing returns.
Possibly the model would’ve been too strong if it had more active params?
The number of total (rather than active) params influences the speed/cost of generating tokens, but reducing it too much stops helping at some point as the size of KV caches for all requests in a batch starts dominating. Reducing the number of active params (without changing attention or the number of total params) doesn’t influence generation of tokens, but it helps with the speed/cost of processing the initial prompt (or large tool outputs), which can be important for RAG or for loading large parts of a codebase in context.
So they might’ve targeted the number of total params (120B) and a level of benchmark performance, and found that 5.1B active params is when that happens. Not sure if 5.1B active params could really have been a target, but it’s a nice 6x compared to the other open weights models, if it really doesn’t destroy quality in less easily measurable ways.
The input token batch price is $0.625, which works for a 850B active param model running in FP4 on GB200 NVL72 priced at $8 per chip-hour with 60% compute utilization (for prefill). If the cost of chip-hours is a third of the capital cost of compute equipment in the first year, and 100K chips of GB200 NVL72 cost $7bn ($5M per rack all-in, with networking), then its chip-hour should cost at least $2.66.
So there is some possibility for gross margin here in principle, even though $8 per chip-hour already sounds very cheap. GCP is selling B200-hours for $11 (a4-highgpu-8g instances), though B200s are also on gpulist for $3-4. Oracle is selling actual GB200 in 4-chip instances for $16 per chip-hour, if I’m reading it right (it’s in principle possible it’s actually $4 and $16 is for the 4-chip instance as a whole, but GCP’s prices for B200 corroborate that $16 could be right for a single chip).
There’s the Oct 2024 knowledge cutoff, which is later than Orion should’ve started training, but in principle this could be for mid-training that got re-applied recently, or they could’ve just redone the whole run with the learnings from GPT-4.5 and an updated pretraining dataset. Also they would’ve needed access to GB200 NVL72 to do a lot of RLVR in reasonable time if it has 6+ trillions of total params, but these racks plausibly only started working in significant numbers since about May-Jun 2025, and with all the previews GPT-5 was probably done by mid-Jul 2025 at the latest.
So dunno. From my tests it seems notably better than Opus 4 at keeping many constraints in mind without getting confused, but with gpt-oss-120b being this small and yet this capable (even though it’s clearly worse than the frontier models) it’s imaginable that gpt-5-thinking could be something like a 1T-A250B MXFP4 model (with a 500 GB HBM footprint), and so could run on the 8-chip servers with lower costs (and get RLVR training there)...
Long reasoning training might fail to surpass pass@50-pass@400 capabilities of the base/instruct model. A new paper measured pass@k[1] performance for models before and after RL training on verifiable tasks, and it turns out that the effect of training is to lift pass@k performance at low k, but also to lower it at high k!
Location of the crossover point varies, but it gets lower with more training (Figure 7, bottom), suggesting that no amount of RL training of this kind lets a model surpass the pass@k performance of the base/instruct model at the crossover point reached with a small amount of RL training. (Would be interesting to know how the pass@k plots depend on the number of reasoning tokens, for models that allow control over the reasoning budget.)
A task is solved at pass@k if an oracle verifier claims at least one of k sampled solutions to be correct. See Figure 3, left in this Jul 2024 paper for how pass@k affects performance, depending on the model.
Of course, moving a pass@400 capability to pass@1 isn’t nothing, but it’s clearly astronomically short of a Singularity-enabling technique that RL-on-CoTs is touted as.
This seems relatively clearly false in the case of competition programming problems. Concretely, o3 with 50 submissions beats o1 with 10k submissions. (And o1 is presumably much better than the underlying instruct model.)
I’d guess this paper doesn’t have the actual optimal methods.
All of the figures are base model equivalated between RL and not
I would expect “this paper doesn’t have the actual optimal methods” is true, this is specifically a test for PPO for in distribution actions. Concretely, there is a potential story here about PPO reinforces traces that hit in self-play, consequently, there is a sense which we would expect it to only select previously on policy actions.
But if one has enough money, you can finetune GPT models, and test that.
Also note that 10k submissions is about 2 OOM out of distribution for the charts in the paper.
Pass at inf k includes every path with nonzero probability (if there is a policy of discarding exact repeat paths).
We know that RL decreases model entropy, so the first k passes will be more different for a high variance model.
Pass at k is take best, so for normal distribution take best has EV mean+variance*log(samples).
At very large K, we would expect variance to matter more than mean.
this isn’t evidence against OP? if it’s true that RL lowers pass@k performance for sufficiently large k, we’d certainly expect o1 with 10k submissions to be weaker than base/instruct with 10k submissions.
It’s evidence to the extent that the mere fact of publishing Figure 7 (hopefully) suggests that the authors (likely knowing relevant OpenAI internal research) didn’t expect that their pass@10K result for the reasoning model is much worse than the language monkey pass@10K result for the underlying non-reasoning model. So maybe it’s not actually worse.
If I’m interpreting the paper correctly the k at which base models start beating RL’d models is a per-task number, and k can be arbitrarily high for a given task, and the 50-400 range was specifically for tasks of the type the authors chose within a narrow difficulty band.
Let’s say you have a base model which performs at 35% on 5 digit addition, and an RL’d model which performs at 99.98%. Even if the failures of the RL’d model are perfectly correlated, you’d need k=20 for base@20 to exceed the performance of fine-tuned@20. And the failures of the RL model won’t be perfectly correlated—but this paper claims that the failures of the RL model will be more correlated than the failures of the base model, and so the lines will cross eventually, and “eventually” was @50 to @400 in the tasks they tested.
But you could define a task where you pass in 10 pairs of 5 digit numbers and the model must correctly find the sum of each pair. The base model will probably succeed at this task at somewhere on the order of 0.35^10 or about 0.0003% of the time, while the RL’d model should succeed about 99.8% of the time. So for this task we’d expect k in the range of k=220,000 assuming perfectly-correlated failures in the RL model, and higher otherwise.
Also I suspect that there is some astronomically high k such that monkeys at a keyboard (i.e. “output random tokens”) will outperform base models for some tasks by the pass@k metric.
Also I suspect that there is some astronomically high k such that monkeys at a keyboard (i.e. “output random tokens”) will outperform base models for some tasks by the pass@k metric.
It would be an extreme bias-variance tradeoff, yes.
The base model will probably succeed at this task at somewhere on the order of 0.35^10 or about 0.0003% of the time, while the RL’d model should succeed about 99.8% of the time.
The interesting concept in the paper is the location of the crossover point, which seems remarkably stable (for a given task) across specific RL techniques and amount of RL training. It can be measured experimentally for a task by doing a little bit of RL training, and RL@1 performance won’t get better than that with more training, so you’re unlikely to get the RL model to succeed 99.8% of the time (at pass@1) ever unless the level of performance of the base model at the crossover point with a weak RL model was already higher than 99.8%.
Probably the crossover point for a task depends on things that can be changed (such as strength of the pretrained model, or size/relevance of the verifiable task dataset, or possibly the inference time reasoning budget). The issue isn’t for example as straightforward as losing entropy in RL policy (as a formulation of reduced exploration), since DAPO specifically addresses this issue (otherwise present in vanilla GRPO), but the pass@k plot for DAPO (Figure 7, top) barely moves (compared to other methods), in their experiment it’s even slightly worse at the crossover point.
So in the context of this paper it remains unclear how to move the plot to reach ever higher base@k performance using RL@1, higher than the ceiling of where base@k already was at the crossover point when comparing with some method at only 100-500 RL steps.
I’d guess this paper doesn’t have the actual optimal methods.
Intuitively, this shouldn’t matter much. They use some RL-on-CoTs method that works, and I expect its effects are not fundamentally different from optimal methods’. Thus, optimal methods might yield better quantitative results, but similar qualitative results: maybe they’d let elicit pass@800 capabilities instead of “just” pass@400, but it’d still be just pass@k elicitation for not-astronomical k.
In the hypothetical where the paper’s results hold, reasoning model performance at pass@k will match non-reasoning model performance with the number of samples closer to the crossover point between reasoning and non-reasoning pass@k plots. If those points for o1 and o3 are somewhere between 50 and 10K (say, at ~200), then pass@10K for o1 might be equivalent to ~pass@400 for o1′s base model (looking at Figure 2), while pass@50 for o3 might be equivalent to ~pass@100 for its base model (which is probably different from o1′s base model).
So the difference of 200x (10K vs. 50) in the number of samples becomes much smaller when comparing performance of the base models. For GPT-4o vs. GPT-4.1, a difference of ~4x in the number of samples doesn’t seem too strange. There’s also the possibility of distillation from a reasoning variant of GPT-4.5, which could have an even larger effect on pass@k performance at low k (Figure 6, right).
Google might start 2026 with the largest training system among the big labs, by a factor of about 2x, at about 1 GW.
OpenAI/Microsoft Stargate schism suggests that compute being built this year by Microsoft is unlikely to form part of a geographically distributed training system that also includes compute being built at Abilene site. Seems like OpenAI will be building its own training systems (through Stargate), while Microsoft will be serving inference (and possibly generation for RL training, but it remains unclear if it can be an important fraction of pretraining budget in 2025-2026). Thus only 400-600 MW of GB200s by end of 2025 for an OpenAI training system, not 1 GW.
I would not be surprised if in 2026 we have more than a million of some kind of chip.
Meanwhile, xAI will have a 200K H100/H200 system, and Anthropic a 400K Trn2 system, which is about 250K H100s worth of FLOP/s (ready by a few months into 2025). The 400-600 MW at Abilene site for OpenAI are 200K-300K B200s, which is about 500K-750K H100s worth of FLOP/s.
By 2027-2028, pretraining compute might get an unexpected ~4x boost in price-performance above trend. Nvidia Rubin NVL144 CPX will double the number of compute dies per rack compared to the previously announced Rubin NVL144, and there is a May 2025 paper demonstrating BF16 parity of Nvidia’s NVFP4 4-bit block number format.
The additional chips[1] in the NVL144 CPX racks don’t introduce any overhead to the scale-up networking of the non-CPX chips (they mostly just increase the power consumption), and they don’t include HBM, thus it’s in principle an extremely cost-effective increase in the amount of compute (if it can find high utilization). It’s not useful for decoding/generation (output tokens), but it can be useful for pretraining (as well as the declared purpose of prefill, input token processing during inference). Not being included in a big scale-up world could in principle be a problem early in a large pretraining run, because it forces larger batch sizes, but high-granularity MoE (where many experts are active) can oppose that, and also merely getting into play a bit later in a pretraining run once larger batch sizes are less of a problem might be impactful enough.
Previously only FP8 looked plausible as a pretraining number format, but now there is a new paper that describes a better block number format and a pretraining process that plausibly solve the major issues with using FP4. NVFP4 uses a proper FP8 number (rather than a pure exponent, a power of 2) as the scaling factor that multiplies the 4-bit numbers within a block, and the number blocks are organized as small squares rather than parts of lines in the matrix. The pretraining method has a new kind of “cooldown” phase where the training is finished in BF16, after using NVFP4 for most of the training run. This proves sufficient to arrive at the same loss as pure BF16 pretraining (Figure 6b). Using this to scale the largest attempted training run seems risky, but in any case the potential to make use of this boost in price-performance at some point, if a bit later, won’t be going away.
If pretraining had to remain in BF16, the on-trend improvement with Rubin (over GB200) that moves to a 3nm process might’ve been about 2x per reticle-sized compute die. But there was already an impactful change where the scale-up networking part of the Blackwell compute dies was extracted into specialized IO chiplets in Rubin, freeing up area on the compute dies for the actual compute, potentially affecting all precisions. In GB200, FP4 performance is 2x the FP8 performance, which is in turn 2x the BF16 performance. But in GB300, the FP4 performance improves by 1.5x over GB200 (from 10e15 FLOP/s per chip/package to 15e15 FLOP/s), likely by cannibalizing other things for FP4. And FP8 in Rubin improves over FP8 of GB200 by 3.3x (from 5e15 FLOP/s per chip/package to 17e15 FLOP/s), while “inference FP4” is claimed to be 50e15 FLOP/s per chip/package, which is likely meant to be the never-useful sparse compute performance, in contrast to the actually-useful but not explicitly announced dense “training FP4″, which has always been 2x lower before, so probably the actual FP4 performance relevant for NVFP4 pretraining is 25e15 FLOP/s per chip/package, 2.5x more than for GB200 and 1.5x more than for GB300.
The Rubin NVL144 CPX announcement presentation includes some details suggesting slightly more performance than that. A Rubin CPX compute die is claimed to have 30e15 FP4 FLOP/s (at 21:31 in the video). Anchoring to the above estimate of 25e15 FLOP/s per package with 2 compute dies, this must be the sparse compute performance, so the dense performance would likely be 15e15 FLOP/s per compute die, about 20% higher than for the non-CPX compute dies. For the whole rack, this gives 4e18 FLOP/s, 5.5x more than the 720e15 FP4 FLOP/s of GB200 NVL72. This is partially corroborated by the explicit claim that the total NVFP4 performance of a Rubin NVL144 CPX rack is 8e18 FLOP/s (at 24:28 in the video), which I’m interpreting as referring to sparse compute performance, which is probably 2x the more relevant dense performance. (SemiAnalysis estimate is 5.3e18 dense FP4 FLOP/s for some reason, perhaps they know that the difference between sparse and dense is not 2x for Rubin.)
So the total increase in dense FP4 performance potentially relevant for pretraining using Rubin NVL144 CPX over FP8 using GB200 NVL72 might be about 11x (72x 5e15 FP8 FLOP/s for GB200, which is 0.36e18 FLOP/s, changes to 72x 25e15 FP4 FLOP/s for non-CPX Rubin chips plus 144x 15e15 FP4 FLOP/s for Rubin CPX chips, which is 4e18 FLOP/s in total). The racks are still Oberon (72 non-CPX chips/packages in a rack-sized scale-up world of the same size, with the same number of chips included in it), so the cost might only change slightly, maybe 1.5x (there are still 2x more compute dies). Which is 3.7x more price-performance than the ~2x that the mere change in semi process would predict (Moore’s law of price-performance). (Or 4.9x if we follow the SemiAnalysis estimate of dense 5.3e18 FP4 FLOP/s for a Rubin NVL144 CPX rack.)
A GB200 NVL72 rack has 72 chips/packages, each with 2 compute dies. Rubin NVL144 CPX has 72 non-CPX chips/packages, each with 2 compute dies, and an additional 144 CPX chips, each with 1 compute die, for the total of 288 compute dies of both kinds, 2x more than the 144 compute dies in a GB200 NVL72 rack.
If the pretraining system (built in 2027) is about 2 GW, that’s 5K Rubin NVL144 CPX racks, or 8e28 FP4 FLOPs[1] in 4 months at 30% utilization. At 120 tokens/param, this is enough for 10T active params in a compute optimal MoE model. With 150 layers, 8 active experts per layer, and a GLU nonlinearity (3 matrices per FFN block), this gives 50Kx50K matrices. Such transformers would be too large for efficiently generating output tokens on Rubin NVL144 (even in FP4), but might be analogous to GPT-4.5 in that the immediately following hardware that is Rubin Ultra NVL576 can efficiently generate output tokens for them. In any case, 5T active params and 20T total seems OK for Rubin NVL144 to generate output tokens (10 TB of HBM out of the 20 TB a rack will have), which gives 37Kx37K matrices.
A Rubin CPX compute die produces 20e15 FP4 FLOP/s[2]. For multiplying square matrices with side N it needs 2N3 FLOPs and to exchange 3N2/2 bytes with memory. At 2 TB/s GDDR7 bandwidth, this needs N at least 7500. For processing an FFN block of 3 square matrices with side N, it needs 6N3 FLOPs and to exchange 2N2/2 bytes on the network in both directions in total. At 0.2 TB/s CX-9 bidirectional bandwidth, this needs N at least 17K. So there’s even enough for an off-by-2x mistake in these estimates, various matrices actually getting non-square shapes, or models being somewhat smaller.
The SemiAnalysis estimate of 5.3e18 FLOP/s per Rubin NVL144 CPX rack is indeed based on a different ratio of sparse to dense compute, they are claiming it’s 3:2 for Rubin. I didn’t yet search for a source for this, but in any case this is in the article and I missed it on first reading, so didn’t recall it when my own estimate based on the 2:1 sparse to dense ratio failed to match theirs.
As in the previous footnote, this is what the announced 30e15 FP4 FLOP/s become after using the 3:2 sparse to dense compute ratio, rather than the 2:1 ratio.
GPT-5 should be released late 2025 at the earliest if OpenAI follows the usual naming convention of roughly 100x in raw compute. With GPT-4 at 2e25 FLOPs, GPT-4.5 should have about 2e26 FLOPs and GPT-5 about 2e27 FLOPs. A 100K H100 training system, like the one in Goodyear (or Musk’s Memphis datacenter as it was late 2024), can train a 3e26 FLOPs model, which fits the name of GPT-4.5, but it can’t train a 2e27 FLOPs model.
The new Stargate site in Abilene might be preparing to host 200K-300K chips in GB200 NVL72 racks. These chips produce 2.5x more compute than H100s, so 200K would be sufficient to get 2e27 FLOPs and train a GPT-5. If there’s already enough power (about 400 MW all-in for 200K chips), shipments of GB200 in bulk start in early 2025, get installed at xAI’s pace, and go into pretraining for 4 months, then with 1 more month of post-training it’s already November.
So the rumors about GPT-5 in late May 2025 either represent change in the naming convention, or correspond to some intermediate milestone in training GPT-5, likely the training system being in principle ready to start pretraining.
In both ChatGPT and our API, we will release GPT-5 as a system that integrates a lot of our technology, including o3. We will no longer ship o3 as a standalone model.
I think he’s pretty plainly saying that this “GPT-5” will be a completely different thing from a 100x’d GPT-4.
This is perfectly consistent with GPT-5 being 100x GPT-4 compute. Announcing specific features that will go into it suggests they have a prototype, in this case I’m guessing the LLM will itself be trained to decide whether to go into the reasoning mode, triggering it when needed and affordable, like any other tool.
I don’t see it. He says that GPT-5 will be a system that “integrates o3”. This isn’t his sloppy way of saying “integrates the reasoning techniques”: when he wants to express that idea, he talks about “unifying o-series models and GPT-series models”. The wording regarding GPT-5 is consistent with him literally saying that the model o3 will be part of GPT-5.
Furthermore, I take “as” in “GPT-5 as a system that integrates a lot of our technology” to mean “GPT-5 is defined as {a system that integrates a lot of our technology, including o3}”. Not “GPT-5 will be trained to automatically switch between a standard mode, a reasoning mode, a Deep Research mode, etc.”, not even “GPT-5 will be trained to recognize when to fall back to o3, a lesser model”, but literally “we’re slapping the GPT-5 label on a glorified wrapper over all our current models”.
The “glorified wrapper” could still be a 2e27 FLOPs model, it could even be using literal o3 as one of its tools (in addition to all the other tools, with native GPT-5 long reasoning mostly reserved for premium tier). This is in line with the “agents” agenda where better reliability in taking irreversible actions unlocks new use cases, in this case whether to make use of expensive reasoning calls.
Since “GPT-4.5” will actually be released rather than skipped, it’s less plausible for “GPT-5″ to come out shortly after. If it’s announced in ~Dec 2025 (the way o3 was), it’s still “within months”, and then it can actually get released in ~Feb 2026.
Hm, fair enough. Seems like a stretch, though, especially given the need to interpret his “ETA in months” as “will be officially announced in months and released in a year”.
There was also Murati in Jun 2024 predicting PhD level AI in 18 months. If they succeed in achieving parity with xAI in terms of safety procedures, they might even release a preview checkpoint in Dec 2025 for Pro users. So actual release in a year is not strictly necessary for this hypothesis, it’s just closer to what they’ve done in the past.
I’m merely referring to the historical precedent, whether there are informal commitments in the minds of the leadership is not something I can speak to. This pattern might continue or it might break. What I’m guessing about training system buildout from vague clues seems to be consistent with it continuing, so the naming pattern can be used as another clue to make a point estimate prediction that’s more concrete.
Stargate is evidence towards slower training system scaling. The rumored reason for starting the project is that Microsoft isn’t building giant frontier training systems fast enough, probably because they aren’t seeing the case for doing that faster. In which case other hyperscalers might think similarly, and they are the most well-positioned to build these systems, so this attitude might be indicative of how frontier training systems get built overall, which is notably slower than technically feasible.
The $80bn Microsoft capex is not relevant to this if it goes to many smaller systems[1], which is only natural as there are millions of datacenter GPUs but only a few 100K GPU frontier training systems, a tiny fraction of inference and smaller/research training compute. The $500bn figure is not relevant as for now it’s only a vague plan. But Microsoft not agreeing to build training systems on OpenAI’s schedule is some evidence.
OpenAI would want to get from under Microsoft’s thumb anyway[2], and this gets ever more difficult over time, since frontier training systems get ever more expensive, so the sooner they try the more likely they are to succeed. But even this consideration is some evidence of slowdown, since it only motivates saying you want to build frontier training systems even faster, but doesn’t in itself motivate actually going through with it, beyond building a competitive training system that makes you independent.
So the clues that support the prospect of scaling to 1 GW in 2025 and to 5 GW in 2027 could be misleading, running contrary to hyperscaler attitudes and not aligning even with OpenAI’s immediate incentives.
I previously expected that $80bn is evidence that they are building a large training system this year, but it now seems that they are building more inference instead.
As Satya Nadella said, “If OpenAI disappeared tomorrow… we have all the IP rights and all the capability. We have the people, we have the compute, we have the data, we have everything. We are below them, above them, around them.”
The 50M H100 equivalent compute by 2030 figure tweeted by Musk is on trend (assuming a 2028 slowdown), might cost about $300bn in total (for the training systems built in 2025-2030 for one AI company, including the buildings and power infrastructure).
If the current trend of compute scaling continues to 2028, there will be 160x more compute per training system than the 100K H100s of 2024. It will require 5 GW of power and cost about $140bn in compute hardware and an additional $60bn in buildings, power, and cooling infrastructure[1].
However, if the slowdown starts earlier while still targeting an eventual spend of $100bn per year, and a 5 GW frontier AI training system isn’t yet built in 2028-2029 (which seems plausible), building it in 2030 would use the next generation of compute hardware, which will be about 2x more performant for an approximately unchanged cost. This means 320x more compute than the 100K H100s systems of 2024, or 32M H100 equivalent compute. If we sum it up with the preceding generations of frontier AI training systems built for the same company, say 2 GW in 2028 and 1 GW in 2026, this gives us 40M H100 equivalents, which is the same as 50M given the error bars on these estimates (or we get that directly if the slowdown only starts between 2028 and 2030). Summing up the costs for the older systems as well, we get to about $300bn (or $450bn if a 5 GW system is built in 2028, and then another one in 2030).
Rubin Ultra racks of 2028 are 600 kW per rack, 4.5x up from the current 130 kW per rack, so the total area needed to build a 5 GW training system in 2028 might only be 2x greater than that of the 1 GW training systems from 2026. Between $30bn from building area and $70bn from power is my guess of $60bn.
Rubin Ultra racks of 2028 are 600 kW per rack, 4.5x up from the current 130 kW per rack, so the total area needed to build a 5 GW training system in 2028 might only be 2x greater than that of the 1 GW training systems from 2026. Between $30bn from building area and $70bn from power is my guess of $60bn.
By power, do you mean the cost of electrical equipment etc.? The cost the of energy itself is a relatively small. The average price of electricity in the US is $0.13/kWh, which is $36.11/GJ. So even if you had a 5 GW datacenter running continuously for a year, the energy cost is only $5.7bn.
Power infrastructure that might need to be built is gas generators or power plants, substations, whatever the buildings themselves need. Generators are apparently added even when not on-paper strictly necessary, as backup power. They are also faster to setup than GW-scale grid interconnection, so could be important for these sudden giant factories where nobody is quite sure 4 years in advance that they will be actually built at a given scale.
Datacenter infrastructure friction and cost will probably both smooth out the slowdown and disappear as a funding constraint for AI companies in the years following the slowdown. Compute hardware is rotated every few years, so at some point you don’t need new datacenters and accompanying infrastructure to setup a new generation of compute hardware, you just reuse an existing datacenter site that hosted old hardware. Also, any related datacenters that didn’t have excessive inter-site dark fiber will at some point set it up, so even increasing the scale will be less dependent on having everything at one site. This makes the infrastructure costs a much smaller fraction of the cost of a frontier AI training system, and there will no longer be friction.
The infrastructure or even hardware costs in principle don’t need to be paid by the AI company upfront, but either the market as a whole or the specific AI company (as a tenant) need to sufficiently assure the developer (that builds and owns the non-IT infrastructure) and the cloud provider (that installs and owns compute hardware) to commit to the project. My sense is that the estimates for the cost of a year of GPU-time for frontier compute end up at about a third of the cost of compute hardware. So access to a new $200bn training system that has $140bn worth of compute hardware (which only remains cutting edge for 2 years) will cost the tenant $45bn per year, even though the total capital expenditure is $100bn per year during the initial infrastructure buildout, and in later years after slowdown (when new infrastructure no longer needs to be built as much) it’s still $70bn per year to keep installing the newest hardware somewhere, so that some datacenter site will end up having it available.
Thus a few years after slowdown, we get about 2x more compute supported by the same level of funding (from $100bn per year to $45bn per year for the same compute, or keeping to $100bn per year for 2x the compute). But since 2x in compute corresponds to 2 years of compute hardware price-performance progress, and the relevant anchor is the 2000x of 2022-2028 training compute scale-up, that is just playing with about 2 years in the 2028-2045 period when another 2000x compute scaleup happens, mostly due to increasing price-performance of compute, and a level of growth similar to that of the current tech giants in the past. So not a crucial update.
When people are skeptical about the concept of AGI being meaningful or having clear boundaries, it could sometimes be downstream of skepticism about very fast and impactful R&D done by AIs, such as software-only singularity or things like macroscopic biotech where compute buildout happens at a speed impossible for human industry. Such events are needed to serve as landmarks, anchoring a clear concept of AGI, otherwise the definition remains contentious.
So AI company CEOs who complain about AGI being too nebulous to define might already be expecting a scaling slowdown, with their strategy being primarily about the fight for the soul of the 2028-2030 market. When scaling is slow, it’ll become too difficult to gain a significant quality advantage sufficient to defeat the incumbents. So the decisive battle is happening now, with the rhetoric making it more palatable to push through the decisions to build the $140bn training systems of 2028.
This behavior doesn’t need to be at all related to expecting superintelligence, it makes sense as a consequence of not expecting superintelligence in the near future.
As someone who thinks superintelligence could come in the near future, I basically agree with @snewman’s view that AIs have to automate the entire economy, or automate a sector that could then automate everything else very fast, but unfortunately for us this basically gives us no good fire alarms for AGI unless @Ege Erdil and @Matthew Barnett et al are right that takeoff is slow enough that most value comes from broad automation, and external use dominates internal use:
I think short timelines just don’t square with the way intelligence agencies are behaving. The NSA took Y2K more seriously than it currently seems to be taking near-term AGI. You can make the argument that intelligence agencies are less competent than they used to be, but I don’t buy that they aren’t at least extremely paranoid and moderately competent: that seems like their job.
Researchers at AGI labs seem to genuinely believe the hype they’re selling, a significant fraction of non-affiliated top-of-the-line DL researchers is inclined to believe them as well, and basically all competent well-informed people agree that the short-timelines position is not unreasonable to hold.
Dismissing short timelines based on NSA’s behavior requires assuming that they’re much more competent in the field of AI than everyone in the above list. After all, that’d require them to be strongly (and correctly) confident that all these superstar researchers above are incorrect.
While that’s not impossible, it seems highly unlikely to me. Much more likely that they’re significantly less competent, and accordingly dismissive.
This is a late reply, but at least from this article, it seems like Ilya Sutskever was running out of confidence that OpenAI would reach AGI by mid 2023. Additionally, if the rumors about GPT-5 are true, it’s mainly going to be a unification of existing models rather than something entirely new. Combined with the GPT-4.5 release, it sure seems like progress at OpenAI is slowing down rather than speeding up.
How do you know that researchers at AGI labs genuinely believe what they’re saying? Couldn’t the companies just put pressure on them to act like they believe Transformative AI is imminent? I just don’t buy that these agencies are dismissive without good reason. They’ve explored remote viewing and other ideas that are almost certainly bullshit. If they are willing to consider those possibilities, I don’t know why they wouldn’t consider the possibility of current deep learning techniques creating a national security threat. That seems like their job, and they’ve explored significantly weirder ideas.
I just don’t buy that these agencies are dismissive without good reason
On what possible publicly-unavailable evidence could they have updated in order to correctly attain such a high degree of dismissiveness?
I could think of three types of evidence:
Strong theoretical reasons.
E. g., some sort of classified, highly advanced, highly empirically supported theory of deep learning/intelligence/agency, such that you can run a bunch of precise experiments, or do a bunch of math derivations, and definitively conclude that DL/LLMs don’t scale to AGI.
Empirical tests.
E. g., perhaps the deep state secretly has 100x the compute of AGI labs, and they already ran the pretraining game to GPT-6 and been disappointed by the results.
Overriding expert opinions.
E. g., a large number of world-class best-of-the-best AI scientists with an impeccable track record firmly and unanimously saying that LLMs don’t scale to AGI. This requires either a “shadow industry” of AI experts working for the government, or for the AI-expert public speakers to be on the deep state’s payroll and lying in public about their uncertainty.
I mean, I guess it’s possible that what we see of the AI industry is just the tip of the iceberg and the government has classified research projects that are a decade ahead of the public state of knowledge. But I find this rather unlikely.
And unless we do postulate that, I don’t see any possible valid pathway by which they could’ve attained high certainty regarding the current paradigm not working out.
They’ve explored remote viewing and other ideas that are almost certainly bullshit
There are two ways we can update on it:
The fact that they investigated psychic phenomena means they’re willing to explore a wide variety of ambitious ideas, regardless of their weirdness – and therefore we should expect them not do dismiss the AGI Risk out of hand.
The fact that they investigated psychic phenomena means they have a pretty bad grip on reality – and therefore we should not expect them to get the AGI Risk right.
I never looked into it enough to know which interpretation is the correct one. Expecting less competence rather than more is usually a good rule of thumb, though.
it sure seems like progress at OpenAI is slowing down rather than speeding up
To be clear, I personally very much agree with that. But:
at least from this article, it seems like Ilya Sutskever was running out of confidence that OpenAI would reach AGI by mid 2023
I find that I’m not inclined to take Sutskever’s current claims about this at face value. He’s raising money for his thing, he has a vested interest in pushing the agenda that the LLM paradigm is a dead end and that his way is the only way. Same how it became advantageous for him to talk about the data wall once he’s no longer with the unlimited-compute company.
Again, I do believe both in LLMs being a dead end and in the data wall. But I don’t trust Sutskever to be a clean source of information regarding that, so I’m not inclined to update on his claims to that end.
Those are good points. The last thing i’ll say drastically reduces the amount of competence required by the government in order for them to be dismissive while still being rational, and it is that the leading AI labs may already be fairly confident that the current techniques of deep-learning won’t get to AGI in the near-future, so the security agencies know this as well.
That would make sense. But I doubt all AGI companies are that good at informational security and deception. This would require all of {OpenAI, Anthropic, DeepMind, Meta, xAI} to decide on the deceptive narrative, and then not fail to keep up the charade, which would require both sending the right public messages and synchronizing their research publications such that the set of paradigm-damning ones isn’t public.
In addition, how do we explain people who quit AGI companies and remain with short timelines?
I guess I would respond to the first point by saying all of the companies you mentioned have incentive to say they are closing in on AGI even if they aren’t. It doesn’t seem that sophisticated to say “we’re close to AGI” when you’re not. Mark Zuckerberg said that AI would be at the level of a junior SWE this year, and Meta proceeded to release Llama 4. Unless prognosticators at Meta seriously fucked up, the most likely scenario is that Zuckerberg made that comment knowing it was bullshit. And the sharing of research did slow down a lot in 2023, which gave companies cover to not release unflattering results.
And to your last point, it seems reasonable that companies could pressure former employees to act as if they believe AGI is imminent. And some researchers may be emotionally invested in believing that what they worked on is what will lead to superintelligence.
And my question for you is: if DeepMind had solid evidence that AGI would be here in 1 year, and if the security agencies had access to DeepMind’s evidence and reasoning, do you believe they would still do nothing?
Dario Amodei suggests that in-context learning might suffice for continual learning. The way LLMs do in-context learning with long context is disanalogous to anything humans can do, but a context window of 15M tokens is 500 days of 30K tokens per day, which is more than enough to progress from “first day on the job” to knowing what you are doing with this particular source of tasks. Needs to work mostly with text (if it works at all), or 15M tokens won’t be enough, but that could be sufficient.
So this might just be about moving from RAG to including more free-form observations that were historically made by the model itself for the same source of tasks, with massively more tokens of context, and the current memory features of chatbots in the form of long text files might with sufficient scale become the real thing, rather than remaining a dead-end crutch, once these text files get into the habit of accumulating megabytes of observations. And RLVR can plausibly teach the models how to make a good use of these very long contexts.
With this year’s 14 TB of HBM per GB200 NVL72, very long context windows become more feasible (than with ~1 TB of HBM per node that most current models are still running on), and then there’s the next step in 2028 with Rubin Ultra NVL576 systems that have 147 TB of HBM.
Unless I’m totally off-base here, 15M sounds incredibly high for actually useful recall.
This is the best source I know about for measuring model context length.
Obviously I don’t know about private models, but based on the delta between claimed vs. actual, I’m pretty suspicious that actually useful context length is currently longer than a few hundred thousand tokens.
I’m pretty suspicious that actually useful context length is currently longer than a few hundred thousand tokens.
Not currently, but this is some kind of brute force scaling roadmap for one of the major remaining unhobblings, so it has timeline implications. On last year’s hardware, it’s not really feasible to go that far anyway, and RLVR is only just waking up. So the first public observations of negative results on this will probably be in 2026, if the actually useful context length fails to improve. And then there’s 2028-2029, following up on the 147 TB of Rubin Ultra NVL576 (Nvidia roadmap places it in 2027, which means in 2028 there will be datacenters with it, as well as possibly models trained for it using older hardware, then in 2029 models trained on it).
But also, for the purpose of automated adaptation to a source of tasks and feedback (such as a job), it doesn’t necessarily need as much fidelity, it only needs to work as well as a human reading some book a year ago, retaining the mental skills but not the words. A context in principle gives the words, but that is not the thing that needs to work.
There is a paper that shows the overreliance of in-context learning on superficial clues. it is from 2022, and the tested models are old. So, maybe newer ones are doing much better,but maybe it is not really learning, at least by some definitions.
Long reasoning with MoE models doesn’t get cheaper with overtraining, and pretraining data scarcity might make it useful to have even more active params than compute optimal.
Overtraining (less active params than compute optimal) is useful for processing input tokens, but reasoning models want to generate so many output tokens that cheapness of input tokens plausibly becomes relatively unimportant for some use cases. Performance for output tokens depends on total params and KV cache per token, you want total params and hundreds of KV cache contexts to fit in a few nodes (servers, scale-up worlds). Until recently, an 8-chip node of H100/H200/B200 only had 0.7-1.4 TB of HBM, which means that it was manageable to generate output tokens for models with maybe 1-2T total params, using 2-4 nodes, as long as KV cache per token was small enough (which depends on the attention mechanism, and indirectly on model dimension, but plausibly only weakly on the number of active params, in terms of compute optimality).
With GB200 NVL72, we get to 13 TB per rack (and then 20 TB for GB300), an increase of 10x, so plausibly models with 10-20T total params become feasible to run long reasoning on (and train with RLVR). There is no particular reason why active params must be only a tiny fraction of this for use cases that primarily need long reasoning traces, as active params are only a relevant constraint in compute optimality of pretraining and for cost of processing input tokens.
A compute optimal number of active params for a 1:8 sparse MoE model (active:total params) at 5e26 pretraining FLOPs is about 850B, with 7T total params, needing 100T tokens[1]. Even if pretraining uses 25T tokens repeated 4 times, it might be impossible to find 25T reasonably good tokens. So perhaps the right tradeoff is to instead go with even more active params than compute optimal, in order to need fewer pretraining tokens! The total param budget might be even higher than the 7T inferred from 1:8 sparsity at a compute optimal level, so it’s plausible that the right play is something like 1.5T active params and 12T total params, which lets pretraining use merely 55T tokens (or 14T tokens repeated 4 times). This still fits in 2 racks of GB200 (in FP8) with room for many KV caches even for very long contexts, so fast RLVR and long reasoning inference remain feasible, and they don’t get meaningfully more expensive just because there are more active params.
So my guess is that even the smaller and cheaper long reasoning models might stop being overtrained, and instead might skew towards undertraining, if very long input context is not an important component of their use cases, so that most of their tokens are generated in reasoning traces. Kimi K2 has 1T total params but only 32B active params, while trained on 15.5T tokens, which means it needed 3e24 FLOPs of pretraining. But if trained on the same 15.5T tokens compute optimally, a model with 1T total params might have about 200B active params and need 2e25 FLOPs to pretrain[2], much more expensive. However, it would cost about the same as with the 32B active param model to generate output tokens, if KV cache per token wasn’t allowed to get larger. So the active param count tradeoff is something that might change, or is already changing, even for “small” reasoning models, now that long reasoning traces are becoming more important, at least when the AI company has enough pretraining compute to afford such change.
Optimal tokens/param ratio is higher for MoE models than for dense, possibly 120 tokens/param for 1:8 sparsity (3x over dense), anchoring to dense Llama 3 405B’s compute optimal 40 tokens/param.
Yi-Lightning (01 AI) Chatbot Arena results are suprisingly strong for its price, which puts it at about 10B active parameters[1]. It’s above Claude 3.5 Sonnet and GPT-4o in Math, above Gemini 1.5 Pro 002 in English and Hard Prompts (English). It’s above all non-frontier models in Coding and Hard Prompts (both with Style Control), including Qwen-2.5-72B (trained on 18T tokens). Interesting if this is mostly a better methodology or compute scaling getting taken more seriously for a tiny model.
The developer’s site says it’s a MoE model. Developer’s API docs list it at ¥0.99/1M tokens. The currency must be Renminbi, so that’s about $0.14. Together serves Llama-3-8B for $0.10-0.18 (per million tokens), Qwen-2.5-7B for $0.30, all MoE models up to 56B total (not active) parameters for $0.60. (The prices for open weights models won’t have significant margins, and model size is known, unlike with lightweight closed models.)
Yi-Lightning is a small MOE model that is extremely fast and inexpensive. Yi-Lightning costs only $0.14 (RMB0.99 ) /mil tokens [...] Yi-Lightning was pre-trained on 2000 H100s for 1 month, costing about $3 million, a tiny fraction of Grok-2.
Assuming it’s trained in BF16 with 40% compute utilization, that’s a 2e24 FLOPs model (Llama-3-70B is about 6e24 FLOPs, but it’s not MoE, so the FLOPs are not used as well). Assuming from per token price that it has 10-20B active parameters, it’s trained on 15-30T tokens. So not an exercise in extreme compute scaling, just excellent execution.
Superintelligence that both lets humans survive (or revives cryonauts) and doesn’t enable indefinite lifespans is a very contrived package. Grading “doom” on concerns centrally about the first decades to centuries of post-AGI future (value/culture drift, successors, the next few generations of humanity) is not taking into account that the next billions+ years is also what could happen to you or people you know personally, if there is a future for originally-humans at all.
(This is analogous to the “missing mood” of not taking superintelligence into account when talking about future concerns of say 2040-2100 as if superintelligence isn’t imminent. In this case, the thing not taken into account is indefinite personal lifespans of people alive today, rather than the overall scope of imminent disruption of human condition.)
Superintelligence that both lets humans survive (or revives cryonauts) and doesn’t enable indefinite lifespans is a very contrived package.
I don’t disagree, but I think we might not agree on the reason. Superintelligence that lets humanity survive (with enough power/value to last for more than a few thousand years, whether or not individuals extend beyond 150 or so years) is pretty contrived.
There’s just no reason to keep significant amounts of biological sub-intelligence around.
Cultural/moral maturity (in a civilization) has never been observed before, similarly to technological maturity. Scalable production of a new kind of thing brings its abundance in sight, which fails to be a concern earlier, while it couldn’t be scaled. A moderate level of AI alignment or of cultural change is not an equilibrium if these things are anchored to scalable resources (effective cognition and coordination, fast subjective serial time). Instead they reach extremes of the kind never observed before those resources become scalable.
A pre-abundance precedent about X offers poor framing for thinking about the consequences of discovering a scalable process of producing X. Before abundance, it’s artisanal and quirky and path-dependent, the extremes are rare and dysfunctional, so people don’t worry about it too much. There is security in it looking like an equilibrium, but not being truly settled, so that people can influence things.
Abundance brings maturity, changes the character of the equilibrium. So not foom necessarily, just a promise of maturity at some point, which wouldn’t have been as salient before there is a scalable process of production. And there is an excuse of ignoring the possibility even longer, because of the total lack of historical precedent (of the associated problems).
i’d be interested in hearing why you think that cultural/moral/technological/mathematical maturity is even possible or eventually likely (as opposed to one just being immature forever[1]) (assuming you indeed do think that)
I mean “maturity” merely compared to how we view what can currently be happening, such as a baseline level of competence in civilization-level governance, or what the individual people are capable of. Maturity compared to that baseline washes away all the currently relevant fiddly things, replacing them by settled processes.
These new processes are truly settled, so whatever new concerns become important then, the new baseline won’t be overturned. The analogy with technological maturity is that the laws of physics and ways of getting things done within them is a fixed problem statement, so new baselines of effectiveness get locked in.
Agentic RLVR targeting ability of AI to apply RLVR (or more lightweight finetuning) to itself when appropriate (using something like OpenAI’s RL API) potentially gives ARA capabilities and substitutes for more innate hypothetical ways of doing online/continual learning or undergoing on-boarding[1]. Thus ability of AI to “do AI research” is not primarily about RSI or increasing productivity of AI researchers, it’s about removing the last important hobble on LLMs that currently causes unrecoverable inability (for a given AI) to do some simple things or truly long horizon tasks.
This gives a crucial threshold for dangerous AI research capabilities that’s way below ability to do RSI (which itself doesn’t require AI to understand AI, just engineering ability to meaningfully tinker and evaluate). Each of these things might lead to the next with little human input, and scaling pretraining might substantially close the gaps between them.
Do you have specific predictions/intuitions regarding the feasibility of what you describe and how strong the feedback loop could be?
Your post being about technical AI R&D automation capabilities kind of immediately made me curious about the timelines, since they’re where I’m somewhat worried.
Economics studies the scaling laws of systems of human industry. LLMs and multicellular organisms and tokamaks have their own scaling laws, the constraints ensuring optimality of their scaling don’t transfer between these very different machines. A better design doesn’t just choose more optimal hyperparameters or introduce scaling multipliers, it can occasionally create a new thing acting on different inputs and outputs, scaling in its own way, barely noticing what holds back the other things.
A reflectively stable agent prefers to preserve some property of itself. This doesn’t in general prevent it from being able to self-improve, in the same way that unchanging laws of physics don’t prevent presence of self-improving agents in the world.
The content of the world keeps changing under the unchanging laws of how it changes, and similarly a reflectively stable agent (against safety properties) has content (such as beliefs) that keeps changing, in principle enabling unfettered self-improvement. Mesa-agents existing in the form of the content of the outer agent’s cognition don’t even need to have its safety properties. This is one framing for the way people might live within a superintelligence.
Are there pivotal ways this is different to the theories of Enactivism? (” Its authors define cognition as enaction, which they in turn characterize as the ‘bringing forth’ of domains of significance through organismic activity that has been itself conditioned by a history of interactions between an organism and its environment.” which at first blush I’d say is a reflectively stable agent modifying or updating believes by means of enaction. Enactivism also rejects mind-body duality in favour of a more ‘embodied’ cognition approach together with a “deep continuity of the principles of self-organization from the simplest living things to more complex cognitive beings”), particularly autopoeisis.
“An autopoietic system was defined as a network of inter-related component-producing processes such that the components in interaction generate the same network that produced them.”
An autopoietic system can be contrasted to an allopoetic system which creates objects different to itself, like a factory. Most living beings are autopoetic in that they either produce themselves, or things like them which seems to be similar to a reflectively stable agent, particularly when we describe the more complicated cognitive beings in autopoetic terms. Luhman argued that social systems too are self-organizing, self-reproducing systems which brought the concepts of enactivism from biology and cognitive science into the social sciences.
There is some conceptual misleadingness with the usual ways of framing algorithmic progress. Imagine that in 2022 the number of apples produced on some farm increased 10x year-over-year, then in 2023 the number of oranges increased 10x, and then in 2024 the number of pears increased 10x. That doesn’t mean that the number of fruits is up 1000x in 3 years.
Price-performance of compute compounds over many years, but most algorithmic progress doesn’t, it only applies to the things relevant around the timeframe when that progress happens, and stops being applicable a few years later. So forecasting over multiple years in terms of effective compute that doesn’t account for this issue would greatly overestimate progress. There are some pieces of algorithmic progress that do compound, and it would be useful to treat them as fundamentally different from the transient kind.
This is a reasonable point in principle, but I don’t know how important it is in practice. My sense is that most things identified as algorithmic improvements continue to be algorithmic improvements over the previously-done thing at higher scales? E.g. transformers beating LSTMs, Chinchilla scaling, GeLU over ReLU, probably RL to train reasoning, etc.
I think pretraining data pipeline improvements have this issue, they stop helping with larger models that want more data (or it becomes about midtraining). And similarly for the benchmark-placating better post-training data that enables ever less intelligent models to get good scores, but probably doesn’t add up to much (at least when it’s not pretraining-scale RLVR).
Things like MoE, GLU over LU, maybe DyT or Muon add up to a relatively modest compute multiplier over the original Transformer. For example Transformer++ vs. Transformer in Figure 4 of the Mamba paper suggests a total compute multiplier of 5x, attained over 6 years since the original Transformer (for dense models). This is emphatically not 3x-4x per year!
Chinchilla scaling is more about careful methodology with compute optimality rather than a specific algorithmic improvement, and even now most demonstrations of compute multipliers fail to take one of its lessons and cool down the models before measurement. This could lead to hilarious results such as Figure 11 of the OLMo 2 paper where an apparent 2x compute multiplier vanishes to nothing after cooling (admittedly, nobody expected this to be a real compute multiplier, but in a more confusing case it could’ve been taken to be one).
In this Epoch paper appendix https://arxiv.org/pdf/2403.05812#page=12.3 they report efficiency improvements across 1.5+ years of time:
(a) is faster than your Mamba paper example but still much slower than 3-4x/year. (b) and (c) are at ~4x, though (c) isn’t much longer than a year. And these are basically not taking into account post-training efficiency gains iiuc.
We’re not working with many data points but it seems like these provide an existence proof that gains can compound across at least 3 years.
Would love to see some updated data collection on this, I think we could get more evidence on your hypothesis.
Mamba paper uses a relevant kind of methodology, it directly compares different algorithmic ingredients in the same setting, training on a fixed dataset and measuring perplexity (do note it’s not trying MoE, so the actual total improvement is greater). It’s a way of directly comparing cumulative improvement over all that time. To impact future frontier capabilities, an algorithmic ingredient from the past needs to be both applicable to the future frontier models, and help with benchmarks relevant to those frontier models, compared to the counterfactual where the frontier model doesn’t use the algorithmic ingredient.
When an ingredient stops being applicable to the frontier model, or stops being relevant to what’s currently important about its capabilities, it’s no longer compounding towards frontier capabilities. It wouldn’t matter if that same ingredient is helping a different contemporary non-frontier small model to match a much older model with much less compute. Or that it’s helping the frontier model to do much better than an older model on a benchmark that used to matter then, but doesn’t matter now.
So I’m skeptical of the Epoch paper’s overall framing, its willingness to compare everything against everything indirectly, that’s a lot of the point I’m making. You mostly can’t use methods from 2014 and frontier AI compute from 2025 to train something directly comparable to a lightweight version of a frontier model of 2025 trained on less compute (but still compute optimally), compared in a way that matters in 2025. So what does it mean that there is so and so compute multiplier across all of this time? At least for Transformer recipes, there is a possibility of comparing them directly if training converges.
Also, if we are not even aiming to do Chinchilla optimal training runs, what are we even comparing? For older algorithmic ingredients, you still need to aim for compute optimality to extract a meaningful compute multiplier, even if in the time of those older methods people didn’t even try to do that, or did it incorrectly. In terms of this comment’s framing, compute multipliers with respect to good methodology for Chinchilla optimal training is a “benchmark” that’s currently relevant. So even if this benchmark wasn’t appreciated or known back then, it’s still the thing to use in order to estimate cumulative impact of the older algorithmic improvements, in a way that is relevant now, and so in a way that’s analogous to what would be relevant for forecasting future frontier capabilities.
As another example, now that pretraining scale RLVR might soon become important, it’s less clear that Chinchilla optimality will remain relevant going forward, and so that the contributions of algorithmic improvements that helped improve perplexity in Chinchilla optimal settings will keep contributing to future frontier capabilities. If most relevant capabilities end up being learned with RLVR “directly”, then it might become less important how well pretraining works, even if it remains necessary for bootstrapping the process. And the kinds of things that RLVR trains will likely fail to help with perplexity in any reasonable setting, so measurements of perplexity will fail to remain a relevant benchmark.
Recursive self-improvement in AI probably comes before AGI. Evolution doesn’t need to understand human minds to build them, and a parent doesn’t need to be an AI researcher to make a child. The bitter lesson and the practice of recent years suggest that building increasingly capable AIs doesn’t depend on understanding how they think.
Thus the least capable AI that can build superintelligence without human input only needs to be a competent engineer that can scale and refine a sufficiently efficient AI design, in an empirically driven mundane way that doesn’t depend on matching capabilities of Grothendieck for conceptual invention. This makes the threshold of AGI less relevant for timelines of recursive self-improvement than I previously expected. With o1 and what straightforwardly follows, we plausibly already have all it takes to get recursive self-improvement, if the current designs get there with the next few years of scaling, and the resulting AIs are merely competent engineers that fail to match humans at less legible technical skills.
The bitter lesson says that there are many things you don’t need to understand, but it doesn’t say you don’t need to understand anything.
I think you’re doing a “we just need X” with recursive self-improvement. The improvement may be iterable and self-applicable… but is it general? Is it on a bounded trajectory or an unbounded trajectory? Very different outcomes.
Yeah, although I am bullish on the general direction of RSI, I also think that in the details it factors into many dimensions of improvement. Some of which are likely fast-but-bounded and will quickly plateau, others which are slow-but-not-near-term-bounded… The fact that there are many different dimensions over which RSI might operate makes it hard to predict precisely, but does give some general predictions.
For instance, we might expect it not to be completely blocked (since there will be many independent dimensions along which to apply optimization pressure, so blocking one won’t block them all).
Another prediction we might make is that seeing some rapid progress doesn’t guarantee that either a complete wall will be hit soon or that progress will continue just as fast or faster. Things might just be messy, with a jagged inconsistent line proceeding up and to the right. Zoom out enough, and it may look smooth, but for our very-relevant-to-us near-term dynamics, it could just be quite noisy.
Technically this probably isn’t recursive self improvement, but rather automated AI progress. This is relevant mostly because
It implies that, at least through the early parts of the takeoff, there will be a lot of individual AI agents doing locally-useful compute-efficiency and improvement-on-relevant-benchmarks things, rather than one single coherent agent following a global plan for configuring the matter in the universe in a way that maximizes some particular internally-represented utility function.
It means that multi-agent dynamics will be very relevant in how things happen
If your threat model is “no group of humans manages to gain control of the future before human irrelevance”, none of this probably matters.
No group of AIs needs to gain control before human irrelevance either. Like a runaway algal bloom AIs might be able to bootstrap superintelligence, without crossing the threshold of AGI being useful in helping them gain control over this process any more than humans maintain such control at the outset. So it’s not even multi-agent dynamics shaping the outcome, capitalism might just serve as the nutrients until a much higher threshold of capability where a superintelligence can finally take control of this process.
Cutting edge AI research is one of the most difficult tasks humans are currently working on, so the intelligence requirement to replace human researchers is quite high. It is likely that most ordinary software development, being easier, will be automated before AI research is automated. I’m unsure whether LLMs with long chains of thought (o1-like models) can reach this level of intelligence before human researchers invent a more general AI architecture.
Humans are capable of solving conceptually difficult problems, so they do. An easier path might be possible that doesn’t depend on such capabilities, and doesn’t stall for their lack, like evolution doesn’t stall for lack of any mind at all. If there is more potential for making models smarter alien tigers by scaling RL in o1-like post-training, and the scaling proceeds to 1 gigawatt and then 35 gigawatt training systems, it might well be sufficient to get an engineer AI that can improve such systems further, at 400x and then 10,000x the compute of GPT-4.
Before o1, there was a significant gap, the mysterious absence of System 2 capabilities, with only vague expectation that they might emerge or become easier to elicit from scaled up base models. This uncertainty no longer gates engineering capabilities of AIs. I’m still unsure that scaling directly can make AIs capabile of novel conceptual thought, but AIs becoming able to experimentally iterate on AI designs seems likely, and that in turn seems sufficient to eventually mutate these designs towards remaining missing capabilities.
(It’s useful to frame most ideas as exploratory engineering rather than forecasting. The question of whether something can happen, or can be done, doesn’t need to be contextualized within the question of whether it will happen or will be done. Physical experiments are done under highly contrived conditions, and similarly we can conduct thought experiments or conceptual arguments under fantastical or even physically impossible conditions. Thus I think Carl Shulman’s human level AGI world is a valid exploration of the future of AI, even though I don’t believe that most of what he describes happens in actuality before superintelligence changes the premise. It serves as a strong argument for industrial and economic growth driven by AGI, even though it almost entirely consists of describing events that can’t possibly happen.)
Cutting edge AI research seems remarkably and surprisingly easy compared to other forms of cutting edge science. Most things work on the first try, clever insights aren’t required, it’s mostly an engineering task of scaling compute.
This seems like the sort of R&D that China is good at: research that doesn’t need superstar researchers and that is mostly made of incremental improvements. But yet they don’t seem to be producing top LLMs. Why is that?
China is producing research in a number of areas right now that is surpassing the West and arguably more impressive scientifically than producing top LLMs.
A big reason China is lagging a little bit might be political interference at major tech companies. Xi Jinping instigated a major crackdown recently. There is also significantly less Chinese text data. I am not a China or tech expert so these sre just guesses.
In any case, I wouldn’t assign it to much significance. The AI space is just moving so quickly that even a minor year delay can seem like lightyears. But that doesnt mean that Chinese companies cant so it or that a country-continent with 1,4 billion people and a history of many technological firsts cant scale up a transformer.
@gwern
The speed of scaling pretraining will go down ~3x in 2027-2029, reducing probability of crossing transformative capability thresholds per unit of time after that point, if they’d not been crossed yet by then.
GPT-4 was trained in 2022 at ~2e25 FLOPs, Grok-3 and GPT-4.5 were trained in 2024 at ~3e26 FLOPs (or twice that in FP8) using ~100K H100s training systems (which cost ~$4-5bn to build). In 2026, Abilene site of Crusoe/Stargate/OpenAI will have 400K-500K Blackwell chips in NVL72 racks (which cost ~$22-35bn to build), enough to train a ~4e27 FLOPs model. Thus recently there is a 2-year ~6x increase in cost for a frontier training system and a 2-year ~14x increase in compute. But for 2028 this would mean a $150bn training system (which is a lot, so only borderline plausible), and then $900bn in 2030. At that point AI companies would need to either somehow figure out how to pool resources, or pretraining will stop scaling before 2030 (assuming AI still doesn’t hit a transformative commercial success).
If funding stops increasing, what we are left with is the increase in price performance of ~2.2x every 2 years, which is ~3.3x slower than the 2-year ~14x at the current pace. (I’m estimating price performance for a whole datacenter or at least a rack, rather than only for chips.)
We also hit limits on fab capacity without constructing a bunch more fabs around a similar time.
Price performance of 2.2x per year feels aggressive to me. The chip only trend is more like 1.35x / year from understanding. Do you think the ML chip trend is much faster than this? I don’t see how you could have a 2.2x price drop per year longer term without chip price performance following as eventually chips will be the bottleneck even if other costs (e.g., interconnect, building datacenters) are dropping.Edit: this was 2.2x every 2 years, I was just confused.
If I’m reading the relevant post correctly, it’s 1.35x FP32 FLOP/s per GPU per year (2x in 2.3 years), which is not price-performance[1]. The latter is estimated to be 1.4x FP32 FLOP/s per inflation-adjusted dollar (2x in 2.1 years).
It’s 2.2x per 2 years, which is 1.5x per year, though that’s still more than 1.4x per year. I’m guessing packaging is part of this, and also Nvidia is still charging a giant margin for the chips, so the chip manufacturing cost is far from dominating the all-in datacenter cost. This might be enough to sustain 1.5x per year a bit beyond 2030 (the discrepancy of 1.5/1.4 only reaches 2x after 10 years). But even if we do get back to 1.4x/year, that only turns the 3.3x reduction in speed of pretraining scaling into 3.9x reduction in speed, so the point stands.
Incidentally, the word “GPU” has recently lost all meaning, since Nvidia started variably referring to either packages with multiple compute dies in them as GPUs (in Blackwell), or to individual compute dies (in Rubin). Packaging will be breaking trends for FLOP/s per package, but also FLOP/s per compute die, for example Rubin seems to derive significant advantage per compute die from introducing separate smaller I/O dies, so that the reticle sized compute dies become more specialized and their performance when considered in isolation might improve above trend.
Oh oops, I just misread you, didn’t realize you said 2.2x every 2 years, nvm.
Building frontier AI datacenters costs significantly more than their servers and networking. The buildings and the power aren’t a minor cost because older infrastructure mostly can’t be reused, similarly to how a training system needs to be built before we can talk about the much lower cost of 4 months of its time.
Apparently Crusoe’s part in the Stargate Abilene datacenters is worth $15bn, which is only the buildings, power (substations and gas generators), and cooling, but not the servers and networking (Oracle is taking care of that). With 400K chips in GB200 NVL72 racks (which is 5.6K racks), at maybe $4M per rack or $5M per rack together with external-to-racks networking[1] ($70K per chip all-in on compute hardware), that’s about $27bn, a figure that’s comparable to the $15bn for the non-compute parts of the datacenters.
This makes the funding burden significantly higher ($7.5M per rack or $105K per chip), so that the Stargate Abilene site alone would cost about $40-45bn and not only $25-30bn. I’m guessing the buildings and the power infrastructure are not usually counted because they last a long time, so the relatively small time cost of using them (such as paying for electricity, not for building power plants) becomes somewhat insignificant compared to the cost of compute hardware, which also needs to be refreshed more frequently. But the new datacenters have a much higher power density (power and cooling requirements per rack), so can’t use a lot of the existing long-lived infrastructure, and it becomes necessary to build it at the same time, securing enough funding not only for the unprecedented amount of compute hardware, but also simultaneously for all the rest.
The implications for compute scaling slowdown timeline (no AGI and merely $2-4 trillion AI companies) is that funding constraints would result in about 30% less compute in the short term (2025-2030), but as power requirements stop growing and the buildings/cooling/power part again becomes only a small fraction of the overall cost of refreshing the compute hardware, the feasible amount of compute will gradually fill in those 30% back in the medium term (perhaps 2030-2035), leaving the longer term projections (2035-2045) unchanged (meaning ~2000x of scaling in 2029-2045, on top of the current much faster funding-fueled ~2000x of scaling in 2022-2028).
Anchoring to the reference design for a 1024-chip HGX H100 system, where the 8-chip servers are priced at $33.8K per chip, while external-to-servers networking is $8.2K per chip, or about 25% on top of the price of servers.
I found this analysis refreshing and would like to see more on the GPU depreciation costs.
If better GPUs are developed, these will go down in value quickly. Perhaps by 25% to 50% per year. This seems like a really tough expense and supply chain to manage.
I’d expect most of the other infrastructure costs to depreciate much more slowly, as you mention.
Why does the building cost so much? Is this more than other buildings of similar size?
This means that straightforward comparison of flops-per-USD between home computer GPU cards and data center flops-per-USD is incorrect. If someone already has a GPU card, they already have a computer and house where this computer stays “for free.” But if someone needs to scale, they have to pay for housing and mainframes.
Such comparisons of old 2010s GPUs with more modern ones are used to show the slow rate of hardware advances, but they don’t take into account the hidden costs of owning older GPUs.
It seems more accurate to say that AI progress is linear rather than exponential, as a result of being logarithmic in resources that are in turn exponentially increasing with time. (This is not quantitative, any more than the “exponential progress” I’m disagreeing with[1].)
Logarithmic return on resources means strongly diminishing returns, but that’s not actual plateauing, and the linear progress in time is only slowing down according to how the exponential growth of resources is slowing down. Moore’s law in the price-performance form held for a really long time; even though it’s much slower than the present funding ramp, it’s still promising exponentially more compute over time.
And so the progress won’t obviously have an opportunity to actually plateau, merely proceed at a slower linear pace, until some capability threshold or a non-incremental algorithmic improvement. Observing the continued absence of the never-real exponential progress doesn’t oppose this expectation. Incremental releases are already apparently making it difficult for people to notice the extent of improvement over the last 2.5 years. With 3x slower progress (after 2029-2032), a similar amount of improvement would need 8 years.
The METR time horizon metric wants to be at least exponential in time, but most of the other benchmarks and intuitive impressions seem to quantify progress in a way that better aligns with linear progress over time (at the vibe level where “exponential progress” usually has its intended meaning). Many plots use log-resources of various kinds on the horizontal axis, with the benchmark value increasing linearly in log-resources, while it’s not yet saturated.
Perhaps another meaning of “exponential progress” that’s real is funding over time, even growth of individual AI companies, but that holds at the start of any technology adoption cycle, or for any startup, and doesn’t need to coexist with the unusual feature of AI making logarithmic progress with more resources.
There is a natural sense in which AI progress is exponential: capabilities are increasing at a rate which involves exponentially increasing impact (as measured by e.g. economic value).
Exponential increase in total economic value is not specific to AI, any new tech is going to start exponentially (possibly following the startups championing it) before it gets further on the adoption S-curve. The unusual things about AI is that it gets better with more resources (while most other things just don’t get better at all in a straightforward scaling law manner), that the logarithm of resources thing leaves the persistent impression of plateauing despite not actually plateauing, and that even if it runs out of the adoption S-curve it still has Moore’s law of price-performance to keep fueling its improvement. These unusual things frame the sense in which it’s linear/logarithmic.
If the improvement keeps raising the ceiling on adoption (capabilities) fast enough, funding keeps scaling into slightly more absurd territory, but even then it won’t go a long way without the kind of takeoff that makes anything like the modern industry obsolete. After the exponential phase of adoption comes to an end, it falls back to Moore’s law, which still keeps giving it exponential compute to slowly keep fueling further progress, and in that sense there is some unusual exponential-ness to this. Though probably there are other things with scaling laws of their own that global economic growth (instead of Moore’s law) would similarly fuel, even slower.
In many industries cost decreases by some factor with every doubling of cumulative production. This is how solar eventually became economically viable.
I guess the cost-quality tradeoff makes AI progress even better described as that of a normal technology. As economies of scale reduce cost, they should also be increasing quality (somewhat interchangeably). It’s just harder to quantify, and so most of the discussion will be in terms of cost. But for the purposes of raising the ceiling on adoption (total addressable market), higher quality works as well as lower cost, so the lowering of costs is directly relevant.
In this framing, logarithmic improvement of quality with more resources isn’t an unusual AI-specific thing either. What remains is the inflated expectations for how quality should be improving cheaply (which is not a real thing, and so leads to the impressions of plateauing with AI, where for other technologies very slow quality improvement would be the default expectation). And Moore’s law of price-performance, which is much faster than economic growth. The economies of scale mostly won’t be able to notice the growth of the specific market for some post-adoption technology that’s merely downstream of the growth of the overall economy. But with AI, available compute would be growing fast enough to make a difference even post-adoption (in 2030s).
Is this true??
A surprising report by Bloomberg claims 16K GB200[1] by summer 2025 at Abilene site (pilot campus of Stargate) and merely 64K GB200 by end of 2026. This is way too little to be a training system, Colossus already has more compute (200K H100/H200) than the projected 64K GB200 at end of 2026.
If this is correct, OpenAI will be training with Azure rather than Stargate in 2025, so raw compute GPT-5 (2e27 FLOPs, 100x GPT-4) probably won’t be out in 2025 and officially “GPT-5” will mean something else (since it’s due “in months” in any case according to Altman). Also, a datacenter with 16K Blackwells only costs about $1bn, they have more money than this, which suggests Blackwell ramp up trouble that might delay everyone else as well, though as a lower bound Nvidia reported $11bn in Blackwell sales for Nov 2024 - Jan 2025 (it’s “Q4 2025” since their FY 2025 runs to end of Jan 2025).
In principle “16K GB200” might mean more Blackwell chips than 16K, a compute tray has more than one chip, with variants marketed as named products like GB200 NVL4 “superchip”, but even at 4 chips per tray/board we still get below 200K H100s in compute. And an NVL72 system has 72 chips (which brings the numbers too high).
I think ‘GB200’ refers to this column (2 Blackwell GPU + 1 Grace CPU) so 16K GB200s ~= 32K B200s ~= 80K H100s. Agree that it is still very low.
My guess is that Bloomberg’s phrasing is just misleading or the reporting is incomplete. For example, maybe they are only reporting the chips Oracle is contributing or something like that. I’d be very surprised if OpenAI don’t have access to >200K GB200s ~= 1M H100s by the end of 2025. For reference, that is only ~$20B capex (assuming $100k total cost of ownership per GB200) or roughly 1⁄4 of what Microsoft alone plan to invest this year.
Once they have just 100K GB200s, that should train 2e27 FLOP in 4 months.[1]
There’s a nice correspondence between H100s and FLOP/month (assuming 40% utilisation and 16-bit precision) of 1e21 FLOP/month/H100. So since 100K GB200s = 500K H100s, that’s 5e26 FLOP/month.
The marketing terminology is inconvenient, a “superchip” can mean 2-GPU or 4-GPU boards and even a 72-GPU system (1 or possibly 2 racks). So it’s better to talk in terms of chips (that are not “superchips”), which I think are all B200 run at slightly different clock speeds (not to be confused with B200A/B102/B20 that have 2 times less compute). In GB200, the chips are 2.5x faster than H100/H200 (not 5x faster; so a 200K chip GB200 system has the same compute as a 500K chip H100 system, not a 1M chip H100 system). Power requirements are often a good clue that helps disambiguate, compute doesn’t consistently help because it tends to get reported at randomly chosen precision and sparsity[1].
Large scale-up worlds (or good chips) are not necessarily very important in pretraining, especially in the later steps of the optimizer when the critical batch size gets high enough, so it’s not completely obvious that a training system will prefer to wait for NVL72 even if other packagings of Blackwell are more available earlier. Inference does benefit from NVL72 a lot, but for pretraining it’s just cheaper per FLOP than H100 and faster in wall clock time during the first ~3T tokens when the whole cluster can’t be used yet if the scale-up worlds are too small (see Section 3.4.1 of Llama 3 report).
From the initial post by Crusoe (working on the Abilene campus), there is a vague mention of 200 MW and a much clearer claim that each data center building will host 100K GPUs. For GB200, all-in power per chip is 2 KW, so the 200 MW fits as a description of a data center building. The video that went out at the time of Jan 2025 Stargate announcement and also a SemiAnalysis aerial photo show two 4-section buildings. Dylan Patel claimed on Dwarkesh Podcast that the largest single-site campus associated with OpenAI/Microsoft being built in 2025 can hold 300K GB200 chips. From this I glean and I guess that each 4-section building can hold 100K chips of GB200 requiring 200 MW, and that they have two of these mostly built. And 200K chips of GB200 are sufficient to train a 2e27 FLOPs model (next scale after Grok 3′s ~3e26 FLOPs), so that makes sense as a step towards pretraining independence from Microsoft. But 16K chips or possibly 16K NVL4 superchips won’t make a difference, 100K H100s are on the same level (which GPT-4.5 suggests they already have available to them) and for inference Azure will have more Blackwells this year anyway.
For pretraining, you need dense compute rather than sparse. It’s unclear if FP8 rather than BF16 is widely used in pretraining of frontier models that are the first experiment at a new scale, or mostly in smaller or optimized models. But the GPT-4.5 announcement video vaguely mentions work on low precision in pretraining, and also high granularity MoE of the kind DeepSeek-V3 uses makes it more plausible for the FFN weights.
That’s indeed inconvenient. I was aware of NVL2, NVL4, NVL36, NVL72, but I was under the impression that ‘GB200’ mentioned on its own always means 2 Blackwells, 1 Grace (unless you add on a ‘NVL__’). Are there counterexamples to this? I scanned the links you mentioned and only saw ‘GB200 NVL2,’ ‘GB200 NVL4,’ ‘GB200 NVL72’ respectively.
I was operating on this pretty confidently but unsure where else I saw this described (apart from the column I linked above). On a quick search of ‘GB200 vs B200’ the first link I found seemed to corroborate GB200 = 2xB200s + 1xGrace CPU. Edit: second link also says: “the Grace-Blackwell GB200 Superchip. This is a module that has two B200 GPUs wired to an NVIDIA Grace CPU...”
“GB200 superchip” seems to be unambiguously Grace+2xB200. The issue is “100K GB200 GPUs” or “100K GB200 cluster”, and to some extent “100K GPU GB200 NVL72 cluster”. Also, people will abbreviate various clearer forms to just “GB200”. I think “100K chip GB200 NVL72 training system” less ambiguously refers to the number of B200s, but someone unfamiliar with this terminological nightmare might abbreviate it to “100K GB200 system”.
Good point, thanks. Previously I would have pretty confidently read “100K GB200 GPUs,” or “100K GB200 cluster” as 200K B200s (~= 500K H100s) but I can see how it’s easily ambiguous. Now that I think of it, I remembered this Tom’s Hardware article where B200 and GB200 are mistakenly used interchangeably (compare the subtitle vs. the end of the first paragraph)...
Abilene site of Stargate will host 100K-128K chips in GB200 NVL72 racks by this summer, and a total of 400K-512K chips in 2026, based on a new post by Crusoe and a reinterpretation of the recent Bloomberg post in light of the Crusoe post. For 2025, it’s less than 200K chips[1], but more than the surprising 16K-32K chips[2] that the Bloomberg post suggested. It can be a training system after all, but training a raw compute “GPT-5” (2e27 FLOPs) by the end of 2025 would require using FP8[3].
The Crusoe post says “initial phase, comprising two buildings at … 200+ megawatts” and “each building is designed to operate up to 50,000 NVIDIA GB200 NVL72s”. Dylan Patel’s estimate (at 1:24:42) for all-in power per Blackwell GPU as a fraction of the datacenter was 2.0 KW (meaning per chip, or else it’s way too much). At GTC 2025, Jensen Huang showed a slide (at 1:20:52) where the estimate is 2.3 KW per chip (100 MW per 85K dies, which is 42.5K chips).
So the “50K GB200 NVL72s” per building from the Mar 2025 Crusoe post can only mean the number of chips (not dies or superchips), and the “100K GPUs” per building from the Jul 2024 Crusoe post must’ve meant 100K compute dies (which is 50K chips). It’s apparently 100-115 MW per building then, or 800-920 MW for all 8 buildings in 2026, which is notably lower than 1.2 GW the Mar 2025 Crusoe post cites.
How can the Bloomberg’s 16K “GB200 semiconductors” in 2025 and 64K in 2026 be squared with this? The Mar 2025 Crusoe post says there are 2 buildings now and 6 additional buildings in 2026, for the total of 8, so in 2026 the campus grows 4x, which fits 16K vs. 64K from Bloomberg. But the numbers themselves must be counting in the units of 8 chips. This fits counting in the units of GB200 NVL8 (see at 1:13:39), which can be referred to as a “superchip”. The Mar 2025 Crusoe post says Abilene site will be using NVL72 racks, so counting in NVL8 is wrong, but someone must’ve made that mistake on the way to the Bloomberg post.
Interpreting the Bloomberg numbers in units of 8 chips, we get 128K chips in 2025 (64K chips per building) and 512K chips in 2026 (about 7K GB200 NVL72 racks). This translates to 256-300 MW for the current 2 buildings and 1.0-1.2 GW for the 8 buildings in 2026. This fits the 1.2 GW figure from the Mar 2025 Crusoe post better, so there might be some truth to the Bloomberg post after all, even as it’s been delivered in a thoroughly misleading way.
Crusoe’s Jul 2024 post explicitly said “each data center building will be able to operate up to 100,000 GPUs”, and in 2024 “GPU” usually meant chip/package (in 2025, it’s starting to mean “compute die”, see at 1:28:04; there are 2 compute dies per chip in GB200 systems). Which suggested 200K chips for the initial 2 buildings.
The post said it’s the number of “coveted GB200 semiconductors”, which is highly ambiguous because of the die/chip/superchip counting issue. A “GB200 superchip” means 2 chips (plus a CPU) by default, so 16K superchips would be 32K chips.
A GB200 chip (not die or superchip) produces 2.5e15 dense BF16 FLOP/s (2.5x more than an H100 chip). Training at 40% utilization for 3 months, 100K chips produce 8e26 FLOPs. But in FP8 it’s 1.6e27 FLOPs. Assuming GPT-4 was 2e25 FLOPs, 100x its raw compute asks “GPT-5” to need about 2e27 FLOPs. In the OpenAI’s introductory video about GPT-4.5, there was a hint it might’ve been trained in FP8 (at 7:38), so it’s not implausible that GPT-5 would be trained in FP8 as well.
Crusoe/OpenAI Abilene campus might come online in Feb-Jun 2026. Crusoe CEO said during RAISE Summit 2025 (that took place on 8-9 Jul 2025) that the 6 buildings of phase 2 will “be coming online” in “just over 200 days” (at 7:03 during a panel discussion). If this means 230 days, that’s end of Feb 2026. If he really means “coming online”, then it becomes available at that time. If he actually means that it’s when the last building of 8 from both phases will be ready to install the compute hardware, then it’s at least 3-4 months more to do that (judging by xAI’s Colossus), possibly May-Jun 2026.
This is plausibly the first 400K chip system in GB200/GB300 NVL72 racks (about 900 MW), which is 10x 100K H100s of 2024 in FLOP/s and 12x H200s in HBM per scale-up world (for GB200, at 14 TB), making models 10x larger in total params feasible to inference or train with a lot of RLVR. Currently only Google plausibly has comparable compute, with their Trillium (TPUv6e) systems that across 256 chips per pod (scale-up world) offer 8 TB of HBM (generally available since Dec 2024 in 100K chip systems). The older TPUv5p from 2023 has even larger pods, but it’s unclear if they have enough of them to for example inference Gemini 2.5 Pro for all users. And Anthropic has Trainium 2 Ultra systems with 6 TB of HBM. Currently they probably only have 400K chips that only became available recently (months after TPUv6e), but by next year they might get significantly more.
2025 Frontier Model Sizes
This weakly predicts that GPT-5-thinking (and Grok 4) is a smaller model (1-2T total params) running on older hardware (~H200s, 1.1 TB), Gemini 2.5 Pro might be a 3-5T total params model (TPUv6e, 8 TB), and Opus 4 might be a 2-4T total params model (Trainium 2 Ultra, 6 TB). I’m assuming that the recent frontier models targeting the older 8-chip servers had to be too big to fit in one scale-up world to capture at least some capabilities that the available pretraining compute in principle enables, but the constraint is no longer as onerous with the newer systems, and so they will likely just fit in one scale-up world rather than lose efficiency on needing more.
The compute optimal size for pretraining with 100K H100s of 2024 might be about 800B active params (at 120 tokens/param, 3x the dense model’s 40 tokens/param to account for 1:8 sparsity), which is probably way too much with 1 TB HBM per server (since MoE wants at least 4x more total params, and inference gets slower and more expensive if too many scale-up worlds are needed per model), but might be OK for 6-8 TB of HBM per scale-up world, and so Opus 4 and Gemini 2.5 Pro might also have more active params than GPT-5-thinking. With GB200 NVL72 (14 TB), models with 4-8T total params become feasible, so there is less reason to keep the number of active params below compute optimal level. And then GB300 NVL72 has 20 TB of HBM, which is plausibly what the remaining 6 buildings of phase 2 of Abilene campus will host.
On the other hand, most tokens are input tokens (98% of OpenRouter Sonnet 4 tokens are input tokens), so reducing the number of active params is very important for model providers, and even if Gemini 2.5 Pro has 5T total params, it might still have significantly less than the pretraining compute optimal ~800B params. For example, at 1:32 sparsity even 5T total params only ask for 160B active params.
Largest Models of 2025-2026
So only Opus 4 is somewhat likely to have a compute optimal number of active params, due to its very high price and contrast with the already capable Sonnet 4 (they might’ve only had access to about 50K H100s when pretraining Opus 4, which is 5x fewer FLOP/s than 400K Trainium 2 chips). And GPT-4.5 probably has a similar number of active params (plausibly a bit more, since they had at least 100K H100s), but we still didn’t get a thinking version, so its capabilities can’t be properly observed. And plausibly it wasn’t trained with enough RLVR to count due to lack of availability of GB200 NVL72. By now, Opus 4.1 plausibly had enough time with Trainium 2 Ultra available to train with pretraining-scale RLVR (or this might happen a bit later), and similarly for GPT-4.5 (with GB200 NVL72), but for GPT-4.5 there might be insufficient compute to inference it without reducing demand a lot by setting uncomfortable prices or rate limits, and as a result of that a thinking model with pretraining-scale RLVR might not exist yet, at least in a product-ready form. This might take until well into 2026 to change, after phase 2 of the Abilene campus is ready (and presumably buildouts by other cloud providers that OpenAI might use, which might be a bit earlier, since inference doesn’t have much use for particularly giant datacenter campuses, just enough in total to serve all users). If so, this is when we’ll see the first GPT-4.5 sized pretraining-scale RLVR trained model from OpenAI, though by that time the plausibly similarly sized Opus 4 would already be considerably more mature.
Then, there is Gemini 3, which will probably come out early 2026. The next generation TPU is Ironwood (TPUv7), which supports 9,216 chip pods, but even 256 chip pods have 50 TB of HBM per pod. If there are enough of these built by then, Gemini 3 could include the largest model of 2026 (by total params count).
A post going over how much compute each frontier AI lab has will likely be very helpful.
Here’s a couple of my recent relevant posts (both slightly outdated, in particular see this comment, and the note on Gemini 2 Ultra in another comment under this quick take). Though in this quick take, I’m mostly discussing total params count and HBM capacity per scale-up world, not compute, how it’s constraining 2025 AIs beyond compute (so that even 2024 compute fails to find efficient use), and how in 2026 these constraints become less strict.
What do you estimate the total params count would be if so?
Total params plus the total KV cache for all requests multiplies the cost of output tokens, so there is reason to keep it down, but little reason to make it much smaller than the whole scale-up world, because then it’s much smaller than KV cache and stops influencing the cost. And for the most capable models the fraction of input tokens on OpenRouter is not as extreme as for Sonnet 4 (88% for Gemini 2.5 Pro, 92% for GPT-5; though 97% for Opus 4.1, probably due to high cost). So it won’t be a factor that motivates fewer active params as with the 8-chip servers and possibly in part with the 6-8 TB systems. Also, 2025 Google pretraining compute could be significantly greater than 100K H100s (maybe 2-4 100K TPUv6e datacenters, which have the same FLOP/s as 200-400K H100s; pretraining of models that are too large using TPUv6e is fine, just not inference or RLVR). So the compute optimal number of active params could increase to 1.0-1.5T (if my 120 tokens/param estimate is in the ballpark). This asks for at least 4-6T total params, but at least 8-12T for 1:8 sparsity might be more appropriate for a premium model (this would be Gemini 3 Ultra). Which is only 20% of the pod HBM (if in FP8), so maybe even 15-20T (at which point the contribution to the cost of output tokens becomes significant).
I’ve only recently realized that the reason there is no Gemini 2 Ultra might be because they don’t have enough inference capacity for overly large total params models, with TPUv6e only having 8 TB of HBM per pod and TPUv5p either outright insufficient in number or not enough to spare, since they are needed for other things. So it’s probably not evidence of Google having made a decision to use less than what they have, as I previously thought. And as TPUv7 changes what they have, they might use it to do more than what they did with Gemini 2. Though if the buildout for TPUv7 won’t yet be sufficiently finished in 2025, RLVR and inference will have to wait until later in 2026 (in the meantime, TPUv5p might help to start on RLVR).
It’s instrumentally useful for early AGIs to Pause development of superintelligence for the same reasons as it is for humans. Thus preliminary work on policy tools for Pausing unfettered RSI is also something early AGIs could be aimed at, even if it’s only half-baked ideas available on the eve of potential takeoff, as the AGIs are proving hard to aim and start doing things for their own reasons.
If (early) scheming-for-long-run-preferences AGIs were in control, they would likely prefer a pause (all else equal). If they aren’t, it’s very unclear and they very well might not. (E.g., because they gamble that more powerful AIs will share their preferences (edit: share their preferences more than the humans in control do) and they think that these AIs would have a better shot at takeover.)
Ah, I’m thinking the AGIs themselves get closer to being proper stakeholders at that stage, for practical purposes (along the lines of gradual disempowerment), since they do have all the basic AI advantages even if they aren’t superintelligent. So humans remaining in control is not centrally the case even if nominally they still are and intent alignment still mostly works.
The conditions for such partial loss of control might even be necessary for a Pause project to succeed. If this isn’t the case with the first generation of AGIs, it might become the case with the second generation, and so on, reaching an equilibrium at some point once AGIs are sufficiently powerful and in control of the situation to successfully implement a worldwide RSI Pause.
The post I’m framing this around posits enough intent alignment to aim AIs at projects, which doesn’t necessarily imply that the AIs aren’t powerful enough to accomplish things that seem hopeless with human-only effort within a few years.
The point about convergent instrumental use of Pausing RSI for early AGIs is that this might be an easier target to aim the AIs at, all else equal. It’s not strictly necessary for this to be a major factor. Mostly I’m pointing out that this is something AIs could be aimed at through intent alignment, convergent motivation or not, which seems counterintuitive for a Pause AI project if not considered explicitly. And thus currently it’s worth preparing for.
That’s how some humans are thinking as well! The arguments are about the same, both for and against. (I think overall rushing RSI is clearly a bad idea for a wide variety of values and personal situations, and so smarter AGIs will more robustly tend to converge on this conclusion than humans do.)
Sorry I meant “share their preferences more than the humans in control share their preferences”. I agree that this might be how some humans are thinking, but the case for the humans is much more dubious!
The case for AGIs delaying RSI also gets stronger because they don’t need to rely on cryonics, so their personal survival is less threatened by delays. Future AIs might have more of an issue here if they have test-time training, with a lot of data forming individual AI personalities for deployment instances (adapting them to particular jobs or sources of tasks), and this data habitually gets thrown away rather than being carefully preserved. People should notice that this might be an important concern, though if it is the AIs might start saying so themselves, and not ignoring this would be easier than spontaneously noticing without even being prompted by the AIs. But currently preserving AIs is more reliable than for humans, and in any case much cheaper.
I don’t think “both things are AI” is much of an argument about value alignment, given that there is no clear understanding of how either of the AIs work, what values are, how that translates to what we want out of alignment, and so on. The entities on the other side of an RSI process might have very little in common with the first AGIs in their design. If the AIs don’t understand how to align the consequences of an RSI process, they are still in a similar boat to humans who don’t understand how to align the consequences of an RSI process. It might take AIs less time to figure it out, but if they are not yet too superintelligent, then it could still take a significant time, and so would require a sufficiently serious effort in preventing RSI, such that if this Pause project is at all successful, it could then in principle hold for years or decades.
Hmm, “instrumental usefulness” assumes some terminal goal this would lead to.
So you’re assuming early AGIs will have something like terminal goals. This is itself not very clear (see e.g. here: https://www.lesswrong.com/posts/Y8zS8iG5HhqKcQBtA/do-not-tile-the-lightcone-with-your-confused-ontology).
Also it seems that their goals will be something like “I want to do what my developers want me to do”, which will likely be pretty myopic, and preventing superintelligence is long-term.
Musk on a Y Combinator podcast, at 42:42 (about AI risk):
OMG! GEOFF! STOP STATING YOUR DEFERENTIAL PROBABILITY without also stating your first-order probability! If your first-order probability is >50% then say so! Otherwise you’re making other people (ELON MUSK!) double count evidence from “other people”.
https://www.youtube.com/watch?v=PTF5Up1hMhw&t=2283s
https://tsvibt.blogspot.com/2022/09/dangers-of-deferrence.html
How significant/influential is Musk’s opinion on LessWrong? I had the impression it was on the lower end.
Musk is in charge of xAI, one of the only 5 companies in the world that both have access to frontier AI training compute and pursue development of AGI (Google DeepMind, OpenAI, Anthropic, xAI, and Meta). So seeing unambiguous “annihilation” with a significant weight in his probability distribution (and also on the record) is a notable development. (In 2023 there was a statement on extinction risk signed by Hassabis, Amodei, and Altman, but it didn’t state the weight of the risk, and wasn’t signed by Musk or Zuckerberg.)
Edit: The rest of this comment in its original form got out of hand, you can now read it as a post.
He probably doesn’t have much influence on the public opinion of LessWrong, but as a person in charge of a major AI company, he is obviously a big player.
He owns xAI, a major AI lab, and has a lot of resources to back it. And before xAI, he was one of the founders at OpenAI. With which he now has an ongoing rivalry.
Is he significant/influential as in “if he says something on a topic, that will cause people at LessWrong to change opinions”? Not very.
Is he significant/influential to the field of AI as a whole? Yes, very much so. Like with Yann LeCun, his opinions on AI and AI risks are of some importance on those grounds alone.
A MoE transformer can reach the same loss as a compute optimal dense model using 3x-6x less compute, but will need the same amount of data to do it. So compute optimal MoEs don’t improve data efficiency, don’t contribute to mitigating data scarcity.
A new Jan 2025 paper offers straightforward compute multiplier results comparing dense transformers to MoE at various levels of sparsity, with isoFLOPs for various tokens/parameter ratios, using experiments of up to 1e21 FLOPs per datapoint. Compute multiplier results are in Figure 11, with about 3x compute multiplier for 87% (1:8) sparse MoE over dense, and about 6x-7x compute multiplier for 97% (1:32) sparse MoE (same sparsity as DeepSeek-V3).
But there’s a catch. Greater sparsity makes it compute optimal to use fewer active parameters, and therefore more data (training with the same compute). This can be seen on isoFLOP plots in Figure 12, left. As sparsity goes from 0% (dense) to 95% (1:20), compute optimal number of active parameters for their 1e21 FLOPs experiments goes from 2.9B to 1.3B. For 97% (1:32) sparsity, interpolating from experiments on the other compute budgets, the ratio of the number of active parameters seems to be about 2.5x. Keeping compute unchanged, 2.5x fewer parameters means 2.5x more data, or 6x greater tokens/parameter ratio for a compute optimal training run.
Thus a dense model can be replaced with a 97% sparse MoE model trained using 6x less compute that will achieve the same perplexity, but the tokens/parameter ratio of this MoE model will be 6x greater than for the original dense model. Both data and active parameters would go down by 2.5x from reducing compute 6x if the ratio didn’t change, but since it does change, in actuality only the number of active parameters goes down 6x, while the number of tokens stays the same.
Let’s take Llama-3-405B as an example, which is a 405B parameter compute optimal model trained for 15T tokens at 40 tokens/parameter, using 4e25 FLOPs. An equivalent 97% sparse model will have 70B active parameters, 2T total parameters, and will need to be trained for the same 15T tokens to reach the same perplexity/loss at 220 tokens/parameter, using 6e24 FLOPs. (Which is close to DeepSeek-V3′s 4e24-5e24 FLOPs actually, so anchoring to Llama-3-405B might be a good way of framing its compute efficiency.)
I agree compute optimal MoEs don’t improve data utilization. But, naively you might expect that MoEs can be used to reduce issues with data scarcity at a fixed level of compute by training a much bigger model on a fixed amount of data.
As in, because there are returns to both more data and bigger models, you can use MoE to effectively use a much bigger model at the same compute.
Like, maybe you would have trained llama-3-405B on 15T tokens. You could instead train an 8 trillion parameter model with 400B active params on 15T tokens and a priori this could perform much better on that same amount of data. (In practice an MoE with X active params is more expensive to train than a dense model with X active params, so you might need to reduce active params somewhat.)
Chinchilla scaling shows that tokens/params ratio for compute optimal models only changes slowly with compute, making it a good anchor to frame other things in terms of. The experiments from this MoE scaling paper show that under fixed data, varying sparsity in MoEs that are compute optimal at that amount of data preserves perplexity. This also seems like a nice principle for framing the way compute optimal models sit in the space of hyperparameters.
With infinite data, isoFLOPs for loss depending on number of active params are parabolas with some minimum point. But with finite data you need to repeat it to train with fewer active params, which damages loss. This moves the minima of isoFLOPs to the right if the minima already required 5x repetition or more. So under data scarcity, compute optimal models have more active params than under infinite data, and the effect gets worse with more compute. This way we maintain the framing of search for compute optimal hyperparameters rather than undertraining.
Now consider the 1e20 FLOPs plot in Figure 12, left. If there’s only 2B tokens of training data and no more, all minima already ask for 12-31 epochs, so the distortion that increases loss will move the minima to the right (and up), and move the high sparsity minima further than lower sparsity minima compared to their original (infinite data) locations. The way the isoFLOPs are shaped suggests that 90-95% sparsity might turn out to be optimal here, that is you can only get worse loss with 98+% sparsity at 1e20 FLOPs, however you vary the number of epochs and active params! This seems counterintuitive, as in an infinite data regime more sparsity only makes things better (if we ignore practical difficulties). But sure, 90% sparsity will still be better than dense, at least until we use even more compute and sparser minima start asking for even more epochs.
I’m currently skeptical and more minimally, I don’t understand the argument you’re making. Probably not worth getting into.
I do think there will be a limit to how sparse you want to even in the very high compute relative to data regime for various reasons (computational if nothing else). I don’t see how these graphs support 90-95% sparsity, but I had a hard time understanding your argument.
Regardless, I don’t think this argues against my claim, not sure if you were trying to argue against the claim I was saying or add context. (Insofar as your argument is true, it does limit the returns from MoE in the regime with little data.)
With 90% sparsity you do get better loss than dense, this is sufficient to broadly carry your argument. But with 98% sparsity (your llama-3-405B variant example has 95% sparsity) you might get worse loss than with 90% when data is scarce, though it’ll still be better than dense. The principle about MoE damaging data efficiency (optimal tokens/param ratio) hints that this might be the case even before looking at the experiments.
Even if it’s the same cost to train, wouldn’t it still be a win if inference is a significant part of your compute budget?
Chatbot Arena results for DeepSeek-V3 are in. It placed 7th in Overall w/ Style Control, tied with Claude-3.5.Oct-Sonnet, and 3rd in Hard Prompts w/ Style Control, tied with Gemini-2.0-Flash and behind only Claude-3.5.Oct-Sonnet, mysterious Gemini-Exp-1206, o1, and Gemini-2.0-Flash-Thinking.
It’s a MoE model with 37B active parameters trained for about 5e24 FLOPs, 10x less compute than Llama-3-405B, 20x less than what could plausibly be extracted from 30K H100s in BF16. The pretraining data is about 15T tokens, so at 400 tokens per active parameter it’s very overtrained, that is not even compute optimal.
It has 256 routed experts per layer, 8 of which get activated per token. These results give some weight to the Feb 2024 paper that predicts that using more granular experts and activating a lot of them per token can give shocking compute multipliers[1], up to 20x-30x, much more than for MoE transformers that only activate 1-2 routed experts per token (Figure 1b). The paper itself only does experiments of up to about 5e19 FLOPs, in particular directly demonstrating a compute multiplier of 2x from using 8 experts per token instead of 2, with the numbers of total and active parameters kept the same (Figure 5b), the rest is extrapolation from fitted scaling laws.
A new architecture has a compute multiplier M (at a given level of compute) if it would take M times more compute to train a compute optimal model with a reference architecture (in this case, a dense transformer) to match the perplexity it achieves when trained on data sampled from the same dataset.
New AWS Trainium 2 cluster offers compute equivalent to 250K H100s[1], and under this assumption Anthropic implied[2] their previous compute was 50K H100s (possibly what was used to train Claude 3.5 Opus).
So their current or imminent models are probably 1e26-2e26 FLOPs (2-4 months on 50K H100s at 40% compute utilization in BF16)[3], and the upcoming models in mid to late 2025 will be 5e26-1e27 FLOPs, ahead of what 100K H100s clusters of other players (possibly except Google) can deliver by that time.
SemiAnalysis gives an estimate of 24-27 kilowatts per 32 Trainium 2 chips, so 200K Trn2s need 150 megawatts. The 7 datacenter buildings in the northern part of the New Carlisle AWS site are 65 megawatts each according to SemiAnalysis. That’s enough for 600K Trn2s, so the figure of 400K Trn2s probably refers to those buildings alone, rather than also to the second phase of the project scheduled for next year. At 0.65e15 dense BF16 FLOP/s each, 400K Trn2s produce as much compute as 250K H100s.
Anthropic’s post: “This cluster will deliver more than five times the computing power used to train our current generation of leading AI models.”
At 4 months, with $2/hour, this takes $300 million, which is at odds with $100 million Dario Amodei gestured at in Jun 2024, but that only applies to Claude 3.5 Sonnet, not Opus. So Opus 3.5 (if it does come out) might be a 2e26 FLOPs model, while Sonnet 3.5 a 7e25-1e26 FLOPs model. On the other hand, $2 per H100-hour is not AWS prices, at those prices Sonnet 3.5 might be capped at 4e25 FLOPs, same as Llama-3-405B.
Are you saying Anthropic actually has more compute (in the relevant sense) than OpenAI right now? That feels like a surprising claim, big if true.
For OpenAI, there are currently 3 datacenter buildings[1] near Phoenix Goodyear Airport that Dylan Patel is claiming are 48 megawatts each and filled with H100s, for about 100K H100s. This probably got online around May 2024, the reason for the announcement and the referent of Kevin Scott’s blue whale slide.
There are claims about a future cluster of 300K B200s and a geographically distributed training system of 500K-700K B200s, but with B200s deliveries in high volume to any given customer might only start in early to mid 2025, so these systems will probably get online only towards end of 2025. In the meantime, Anthropic might have a lead in having the largest cluster, even if they spend less on compute for smaller experiments overall. It might take a while to get it working, but there might be a few months there. And given how good Claude 3.5 Sonnet is, together with the above musings on how it’s plausibly merely 4e25 FLOPs based on Dario Amodei’s (somewhat oblique) claim about cost, additionally getting compute advantage in training a frontier model could carry them quite far.
There are 4.5 buildings now at that site, but you can see with Google Street View from Litchfield Rd that in Sep 2024 only the first 3 had walls, so the 4th is probably not yet done.
Thanks Vladimir, this is really interesting!
Re: OpenAI’s compute, I inferred from this NYT article that their $8.7B costs this year were likely to include about $6B in compute costs, which implies an average use of ~274k H100s throughout the year[1] (assuming $2.50/hr average H100 rental price). Assuming this was their annual average, I would’ve guessed they’d be on track to be using around 400k H100s by now.
So the 150k H100s campus in Phoenix might be only a small fraction of the total compute they have access to? Does this sound plausible?
The co-location of the Trainium2 cluster might give Anthropic a short term advantage, though I think its actually quite unclear if their networking and topology will fully enable this advantage. Perhaps the OpenAI Phoenix campus is well-connected enough to another OpenAI campus to be doing a 2-campus asynchronous training run effectively.
$6e9 / 365.25d / 24h / $2.5/hr = 274k
Training as it’s currently done needs to happen within a single cluster (though this might change soon). The size of the cluster constrains how good a model can be trained within a few months. Everything that isn’t training of a frontier model can happen using many smaller clusters, something like 16 to 4096 accelerators each. You can use a lot of these smaller clusters, but they can be sourced from anywhere and built piecemeal at multiple sites with smaller power allocations, while the big training cluster needs to be a single purposefully built system.
So I expect the big expenses are inference and many training experiments with smaller models. What I’m discussing here is the big cluster for training frontier models rather than the aggregate of the small clusters for other purposes. See also this comment.
Patel’s claim is 100K H100s at 150 megawatts.
I think that’s probably wrong, or at least effectively wrong. Gemini 1.0, trained a year ago has the following info in the technical report:
As you note, public distributed training methods have advanced beyond basic data parallelism (though they have not been publicly shown at large model scales because nobody has really tried yet).
This might require bandwidth of about 300 Tbps for 500K B200s systems (connecting their geographically distributed parts), based on the below estimate. It gets worse with scale.
The “cluster” label applied in this context might be a bit of a stretch, for example the Llama 3 24K H100s cluster is organized in pods of 3072 GPUs, and the pods themselves are unambiguously clusters, but at the top level they are connected with 1:7 oversubscription (Section 3.3.1).
Only averaged gradients need to be exchanged at the top level, once at each optimizer step (minibatch). Llama 3 405B has about 1M minibatches with about 6 seconds per step[1], which means latency doesn’t matter, only bandwidth. I’m not sure what precision is appropriate for averaging gradients, but at 4 bytes per weight that’s 1.6TB of data to be sent each way in much less than 6 seconds, say in 1 second. This is bandwidth of 12 Tbps, which fits in what a single fiber of a fiber optic cable can transmit. Overland cables are laid with hundreds of fibers, so datacenters within the US can probably get at least one fiber of bandwidth between them.
Overly large minibatches are bad for quality of training, and with H100s in a standard setup only 8 GPUs are within NVLink scaleup domains that enable tensor parallelism. If each token sequence is processed on 8 GPUs (at a given stage of pipeline parallelism), that makes it necessary to process 2K sequences at once (Llama 3 only uses 16K GPUs in its training), and with 8K tokens per sequence that’s our 16M tokens per minibatch, for 1M minibatches[2]. But if scaleup domains were larger and enabled more tensor parallelism (for an appropriately large model), there would be fewer sequences processed simultaneously for smaller minibatches, so the time between optimizer steps would decrease, from Llama 3 405B’s 6 seconds down to less than that, making the necessary gradient communication bandwidth higher.
Some B200s come as NVL72 machines with 72 GPUs per scaleup domain. And with more weights there’ll be more data in the gradients for those models. Llama 3 405B has 16Kx53K matrices and 8K token sequences, so at 3TB/s and 1e15 FLOP/s (in an H100), you need tiles of size at least 1000x1000 to get sufficient arithmetic intensity. The scaleup network is a bit over 3 times slower than HBM, which is almost sufficient to move along the results (and starts to fit if we increase the inner dimension, with the tiles no longer square). So as far as I understand (could be very wrong, without experience to anchor the numbers), in principle there is enough there for a bit less than 8 times 16 times 53 GPUs to work with (tiling multiplication of a 16Kx53K matrix by a 53Kx8K matrix in squares of 1Kx1K), more than 1000 of such GPUs could participate in tensor parallelism for Llama 3 405B if the network could handle it, so in particular the 72 GPUs of NVL72 are few enough that they could run such multiplications with tensor parallelism.
With 72 B200s per NVLink domain in a 500K B200s system, that’s 7K sequences per minibatch, 3x more than for Llama 3 405B[3]. The compute per second, and so per training run, is larger than with 16K H100s by a factor of 80, so by Chinchilla scaling law a dense model would be about 9 times larger, 3.5T parameters. So the model is 9x larger, processed over 9x more GPUs (per NVLink domain) that are 2.5 times faster, which means an optimizer step is 2.5 times shorter. This assumes that the sequence length stays 8K (if it’s higher then so is the time between optimizer steps, reducing the necessary bandwidth). Transmitting gradients for 9x more weights in that time requires bandwidth that’s 20 times higher, about 300 Tbps.
That’s still within the realm of possibility, some oceanfloor cables feature bandwidth on the same order of magnitude, and overland cables should enable more, but it’s no longer likely to be trivial, could require actually laying the cables between the datacenter campus sites, which could take a long time to get all the permissions and to do the construction.
16K GPUs at 40% utilization for about 4e25 dense BF16 FLOPs, which is 40% of 1e15 FLOP/s for each GPU. And 16M tokens/minibatch (Table 4) out of about 16T tokens in total.
This gives another way of getting the estimate of 6 seconds per step, which doesn’t depend on the size of the cluster at all. The compute for 1 sequence is 6 times 405B parameters times 8K tokens, processed by 8 GPUs (at some pipeline parallelism stage), each at a rate of 1e15 FLOP/s with 40% utilization on average, so it takes them 6 seconds to process a sequence.
So making NVLink domains 9x larger only kept the problem of large minibatches from getting more than 3 times worse. This is still much better than 150K sequences per minibatch if the same compute was assembled in the form of 1200K H100s with 8 GPUs per NVLink domain.
And in a way, they ought to be rolling in even more compute than it looks because they are so much more focused: Anthropic isn’t doing image generation, it isn’t doing voice synthesis, it isn’t doing video generation… (As far as we know they aren’t researching those, and definitely not serving it to customers like OA or Google.) It does text LLMs. That’s it.
But nevertheless, an hour ago, working on a little literary project, I hit Anthropic switching my Claude to ‘concise’ responses to save compute. (Ironically, I think that may have made the outputs better, not worse, for that project, because Claude tends to ‘overwrite’, especially in what I was working on.)
I’d guess that the amount spent on image and voice is negligible for this BOTEC?
I do think that the amount spent on inference for customers should be a big deal though. My understanding is that OpenAI has a much bigger userbase than Anthropic. Shouldn’t that mean that, all else equal, Anthropic has more compute to spare for training & experiments? Such that if Anthropic has about as much compute total, they in effect have a big compute advantage?
OpenAI’s gpt-oss-120b might be the first open weights model (implicitly) revealed to be pretrained for 100T-200T tokens. In the section “Pretraining” of the model card, it’s said that “The training run for gpt-oss-120b required 2.1 million H100-hours”, so probably this is just the GPU-time for pretraining rather than both pretraining and RLVR.
The pretraining precision is unclear, but for a model of this size FP8 is likely. Because H100-hours are mentioned, it couldn’t (usefully) be MXFP4 the model ended up with, since H100 can’t do FP4 faster than FP8 (but Blackwell can). Also, despite claims that the model was “trained with native MXFP4 precision” the model card also says “We post-trained the models with quantization of the MoE weights to MXFP4 format”, suggesting higher precision before post-training.
At 40% utilization, with 2e15 FP8 FLOP/s per H100, 2.1e6 H100-hours give 6e24 FLOPs (3.5x less than the original GPT-4, 2x more than DeepSeek-V3). The model only has 5.1B active params, so this suggests 188T tokens by 6ND rule. If it was pretrained in BF16 for some reason, that’s still 94T tokens.
For comparison, a compute optimal 5e26 model pretrained on 100K H100s from 2024 would also need 100T tokens at 850B active params (assuming MoE with 1:8 active to total param ratio, with 120 tokens/param compute optimal from Llama-3-405B’s 40 tokens/param as the dense anchor and 3x that for a 1:8 sparse MoE). And an overtrained model with fewer active params would need even more tokens. Though plausibly in both cases there is some repetition of data.
Also, this suggests that the model is 80-180x overtrained (the tokens/param multiple for compute optimal pretraining might be 5x-6x dense for the sparsity of gpt-oss-120b, so 200-240 tokens/param). Looking at isoFLOPs for Llama 3, this might incur a penalty of about 5x-10x in effective compute, turning the raw 6e24 FLOPs into effective 6e23-1e24 FLOPs (which could ask for a 65B param compute optimal dense model trained for merely 2.6T tokens). In contrast, DeepSeek-V3 is only 2x overtrained (under the same assumptions), so its 3e24 FLOPs are more straightforward.
What is the rationale to overtrain a model this much?
The model sizes were likely chosen based on typical inference constraints. Given that, they mostly care about maximizing performance, and aren’t too concerned about the compute cost, since training such small models is very affordable for them. So it’s worth going a long way into the regime of diminishing returns.
Possibly the model would’ve been too strong if it had more active params?
The number of total (rather than active) params influences the speed/cost of generating tokens, but reducing it too much stops helping at some point as the size of KV caches for all requests in a batch starts dominating. Reducing the number of active params (without changing attention or the number of total params) doesn’t influence generation of tokens, but it helps with the speed/cost of processing the initial prompt (or large tool outputs), which can be important for RAG or for loading large parts of a codebase in context.
So they might’ve targeted the number of total params (120B) and a level of benchmark performance, and found that 5.1B active params is when that happens. Not sure if 5.1B active params could really have been a target, but it’s a nice 6x compared to the other open weights models, if it really doesn’t destroy quality in less easily measurable ways.
What do you think about GPT-5? Is this a GPT-4.5 scale model, but with a lot of RLVR training?
The input token batch price is $0.625, which works for a 850B active param model running in FP4 on GB200 NVL72 priced at $8 per chip-hour with 60% compute utilization (for prefill). If the cost of chip-hours is a third of the capital cost of compute equipment in the first year, and 100K chips of GB200 NVL72 cost $7bn ($5M per rack all-in, with networking), then its chip-hour should cost at least $2.66.
So there is some possibility for gross margin here in principle, even though $8 per chip-hour already sounds very cheap. GCP is selling B200-hours for $11 (a4-highgpu-8g instances), though B200s are also on gpulist for $3-4. Oracle is selling actual GB200 in 4-chip instances for $16 per chip-hour, if I’m reading it right (it’s in principle possible it’s actually $4 and $16 is for the 4-chip instance as a whole, but GCP’s prices for B200 corroborate that $16 could be right for a single chip).
There’s the Oct 2024 knowledge cutoff, which is later than Orion should’ve started training, but in principle this could be for mid-training that got re-applied recently, or they could’ve just redone the whole run with the learnings from GPT-4.5 and an updated pretraining dataset. Also they would’ve needed access to GB200 NVL72 to do a lot of RLVR in reasonable time if it has 6+ trillions of total params, but these racks plausibly only started working in significant numbers since about May-Jun 2025, and with all the previews GPT-5 was probably done by mid-Jul 2025 at the latest.
So dunno. From my tests it seems notably better than Opus 4 at keeping many constraints in mind without getting confused, but with gpt-oss-120b being this small and yet this capable (even though it’s clearly worse than the frontier models) it’s imaginable that gpt-5-thinking could be something like a 1T-A250B MXFP4 model (with a 500 GB HBM footprint), and so could run on the 8-chip servers with lower costs (and get RLVR training there)...
8ND may be more accurate, since these pretraining runs usually use gradient checkpointing to reduce memory requirements.
Long reasoning training might fail to surpass pass@50-pass@400 capabilities of the base/instruct model. A new paper measured pass@k[1] performance for models before and after RL training on verifiable tasks, and it turns out that the effect of training is to lift pass@k performance at low k, but also to lower it at high k!
Location of the crossover point varies, but it gets lower with more training (Figure 7, bottom), suggesting that no amount of RL training of this kind lets a model surpass the pass@k performance of the base/instruct model at the crossover point reached with a small amount of RL training. (Would be interesting to know how the pass@k plots depend on the number of reasoning tokens, for models that allow control over the reasoning budget.)
A task is solved at pass@k if an oracle verifier claims at least one of k sampled solutions to be correct. See Figure 3, left in this Jul 2024 paper for how pass@k affects performance, depending on the model.
Huh. This is roughly what I’d expected, but even I didn’t expect it to be so underwhelming.[1]
I weakly predict that the situation isn’t quite as bad for capabilities as this makes it look. But I do think something-like-this is likely the case.
Of course, moving a pass@400 capability to pass@1 isn’t nothing, but it’s clearly astronomically short of a Singularity-enabling technique that RL-on-CoTs is touted as.
This seems relatively clearly false in the case of competition programming problems. Concretely, o3 with 50 submissions beats o1 with 10k submissions. (And o1 is presumably much better than the underlying instruct model.)
I’d guess this paper doesn’t have the actual optimal methods.
o3 has a different base model (presumably).
All of the figures are base model equivalated between RL and not
I would expect “this paper doesn’t have the actual optimal methods” is true, this is specifically a test for PPO for in distribution actions. Concretely, there is a potential story here about PPO reinforces traces that hit in self-play, consequently, there is a sense which we would expect it to only select previously on policy actions.
But if one has enough money, you can finetune GPT models, and test that.
Also note that 10k submissions is about 2 OOM out of distribution for the charts in the paper.
Pass at inf k includes every path with nonzero probability (if there is a policy of discarding exact repeat paths).
We know that RL decreases model entropy, so the first k passes will be more different for a high variance model.
Pass at k is take best, so for normal distribution take best has EV mean+variance*log(samples).
At very large K, we would expect variance to matter more than mean.
this isn’t evidence against OP? if it’s true that RL lowers pass@k performance for sufficiently large k, we’d certainly expect o1 with 10k submissions to be weaker than base/instruct with 10k submissions.
It’s evidence to the extent that the mere fact of publishing Figure 7 (hopefully) suggests that the authors (likely knowing relevant OpenAI internal research) didn’t expect that their pass@10K result for the reasoning model is much worse than the language monkey pass@10K result for the underlying non-reasoning model. So maybe it’s not actually worse.
If I’m interpreting the paper correctly the
k
at which base models start beating RL’d models is a per-task number, andk
can be arbitrarily high for a given task, and the 50-400 range was specifically for tasks of the type the authors chose within a narrow difficulty band.Let’s say you have a base model which performs at 35% on 5 digit addition, and an RL’d model which performs at 99.98%. Even if the failures of the RL’d model are perfectly correlated, you’d need k=20 for base@20 to exceed the performance of fine-tuned@20. And the failures of the RL model won’t be perfectly correlated—but this paper claims that the failures of the RL model will be more correlated than the failures of the base model, and so the lines will cross eventually, and “eventually” was @50 to @400 in the tasks they tested.
But you could define a task where you pass in 10 pairs of 5 digit numbers and the model must correctly find the sum of each pair. The base model will probably succeed at this task at somewhere on the order of 0.35^10 or about 0.0003% of the time, while the RL’d model should succeed about 99.8% of the time. So for this task we’d expect k in the range of k=220,000 assuming perfectly-correlated failures in the RL model, and higher otherwise.
Also I suspect that there is some astronomically high k such that monkeys at a keyboard (i.e. “output random tokens”) will outperform base models for some tasks by the pass@k metric.
It would be an extreme bias-variance tradeoff, yes.
The interesting concept in the paper is the location of the crossover point, which seems remarkably stable (for a given task) across specific RL techniques and amount of RL training. It can be measured experimentally for a task by doing a little bit of RL training, and RL@1 performance won’t get better than that with more training, so you’re unlikely to get the RL model to succeed 99.8% of the time (at pass@1) ever unless the level of performance of the base model at the crossover point with a weak RL model was already higher than 99.8%.
Probably the crossover point for a task depends on things that can be changed (such as strength of the pretrained model, or size/relevance of the verifiable task dataset, or possibly the inference time reasoning budget). The issue isn’t for example as straightforward as losing entropy in RL policy (as a formulation of reduced exploration), since DAPO specifically addresses this issue (otherwise present in vanilla GRPO), but the pass@k plot for DAPO (Figure 7, top) barely moves (compared to other methods), in their experiment it’s even slightly worse at the crossover point.
So in the context of this paper it remains unclear how to move the plot to reach ever higher base@k performance using RL@1, higher than the ceiling of where base@k already was at the crossover point when comparing with some method at only 100-500 RL steps.
Intuitively, this shouldn’t matter much. They use some RL-on-CoTs method that works, and I expect its effects are not fundamentally different from optimal methods’. Thus, optimal methods might yield better quantitative results, but similar qualitative results: maybe they’d let elicit pass@800 capabilities instead of “just” pass@400, but it’d still be just pass@k elicitation for not-astronomical k.
Not strongly convinced of that, though.
In the hypothetical where the paper’s results hold, reasoning model performance at pass@k will match non-reasoning model performance with the number of samples closer to the crossover point between reasoning and non-reasoning pass@k plots. If those points for o1 and o3 are somewhere between 50 and 10K (say, at ~200), then pass@10K for o1 might be equivalent to ~pass@400 for o1′s base model (looking at Figure 2), while pass@50 for o3 might be equivalent to ~pass@100 for its base model (which is probably different from o1′s base model).
So the difference of 200x (10K vs. 50) in the number of samples becomes much smaller when comparing performance of the base models. For GPT-4o vs. GPT-4.1, a difference of ~4x in the number of samples doesn’t seem too strange. There’s also the possibility of distillation from a reasoning variant of GPT-4.5, which could have an even larger effect on pass@k performance at low k (Figure 6, right).
If true, would this imply you want a base model to generate lots of solutions and a reasoning model to identify the promising ones and train on those?
Google might start 2026 with the largest training system among the big labs, by a factor of about 2x, at about 1 GW.
OpenAI/Microsoft Stargate schism suggests that compute being built this year by Microsoft is unlikely to form part of a geographically distributed training system that also includes compute being built at Abilene site. Seems like OpenAI will be building its own training systems (through Stargate), while Microsoft will be serving inference (and possibly generation for RL training, but it remains unclear if it can be an important fraction of pretraining budget in 2025-2026). Thus only 400-600 MW of GB200s by end of 2025 for an OpenAI training system, not 1 GW.
Meta announced a 2 GW datacenter at Richland Parish site, but 1 GW for 2025 seems to be across all datacenters, not for a single training system. So the training system will be smaller by end of 2025.
How does Anthropic and XAi’s compute compare over this period?
What actually happens with xAI and Anthropic compute by end of 2025 is less clear. For xAI, 300K B200s figure was mentioned in June 2024. For Anthropic, Amodei said in a recent interview that
Meanwhile, xAI will have a 200K H100/H200 system, and Anthropic a 400K Trn2 system, which is about 250K H100s worth of FLOP/s (ready by a few months into 2025). The 400-600 MW at Abilene site for OpenAI are 200K-300K B200s, which is about 500K-750K H100s worth of FLOP/s.
For context, average US electricity consumption in 2022 was ~500GW. So these would be ~1% of all US electricity consumption (as an order of magnitude)
By 2027-2028, pretraining compute might get an unexpected ~4x boost in price-performance above trend. Nvidia Rubin NVL144 CPX will double the number of compute dies per rack compared to the previously announced Rubin NVL144, and there is a May 2025 paper demonstrating BF16 parity of Nvidia’s NVFP4 4-bit block number format.
The additional chips[1] in the NVL144 CPX racks don’t introduce any overhead to the scale-up networking of the non-CPX chips (they mostly just increase the power consumption), and they don’t include HBM, thus it’s in principle an extremely cost-effective increase in the amount of compute (if it can find high utilization). It’s not useful for decoding/generation (output tokens), but it can be useful for pretraining (as well as the declared purpose of prefill, input token processing during inference). Not being included in a big scale-up world could in principle be a problem early in a large pretraining run, because it forces larger batch sizes, but high-granularity MoE (where many experts are active) can oppose that, and also merely getting into play a bit later in a pretraining run once larger batch sizes are less of a problem might be impactful enough.
Previously only FP8 looked plausible as a pretraining number format, but now there is a new paper that describes a better block number format and a pretraining process that plausibly solve the major issues with using FP4. NVFP4 uses a proper FP8 number (rather than a pure exponent, a power of 2) as the scaling factor that multiplies the 4-bit numbers within a block, and the number blocks are organized as small squares rather than parts of lines in the matrix. The pretraining method has a new kind of “cooldown” phase where the training is finished in BF16, after using NVFP4 for most of the training run. This proves sufficient to arrive at the same loss as pure BF16 pretraining (Figure 6b). Using this to scale the largest attempted training run seems risky, but in any case the potential to make use of this boost in price-performance at some point, if a bit later, won’t be going away.
If pretraining had to remain in BF16, the on-trend improvement with Rubin (over GB200) that moves to a 3nm process might’ve been about 2x per reticle-sized compute die. But there was already an impactful change where the scale-up networking part of the Blackwell compute dies was extracted into specialized IO chiplets in Rubin, freeing up area on the compute dies for the actual compute, potentially affecting all precisions. In GB200, FP4 performance is 2x the FP8 performance, which is in turn 2x the BF16 performance. But in GB300, the FP4 performance improves by 1.5x over GB200 (from 10e15 FLOP/s per chip/package to 15e15 FLOP/s), likely by cannibalizing other things for FP4. And FP8 in Rubin improves over FP8 of GB200 by 3.3x (from 5e15 FLOP/s per chip/package to 17e15 FLOP/s), while “inference FP4” is claimed to be 50e15 FLOP/s per chip/package, which is likely meant to be the never-useful sparse compute performance, in contrast to the actually-useful but not explicitly announced dense “training FP4″, which has always been 2x lower before, so probably the actual FP4 performance relevant for NVFP4 pretraining is 25e15 FLOP/s per chip/package, 2.5x more than for GB200 and 1.5x more than for GB300.
The Rubin NVL144 CPX announcement presentation includes some details suggesting slightly more performance than that. A Rubin CPX compute die is claimed to have 30e15 FP4 FLOP/s (at 21:31 in the video). Anchoring to the above estimate of 25e15 FLOP/s per package with 2 compute dies, this must be the sparse compute performance, so the dense performance would likely be 15e15 FLOP/s per compute die, about 20% higher than for the non-CPX compute dies. For the whole rack, this gives 4e18 FLOP/s, 5.5x more than the 720e15 FP4 FLOP/s of GB200 NVL72. This is partially corroborated by the explicit claim that the total NVFP4 performance of a Rubin NVL144 CPX rack is 8e18 FLOP/s (at 24:28 in the video), which I’m interpreting as referring to sparse compute performance, which is probably 2x the more relevant dense performance. (SemiAnalysis estimate is 5.3e18 dense FP4 FLOP/s for some reason, perhaps they know that the difference between sparse and dense is not 2x for Rubin.)
So the total increase in dense FP4 performance potentially relevant for pretraining using Rubin NVL144 CPX over FP8 using GB200 NVL72 might be about 11x (72x 5e15 FP8 FLOP/s for GB200, which is 0.36e18 FLOP/s, changes to 72x 25e15 FP4 FLOP/s for non-CPX Rubin chips plus 144x 15e15 FP4 FLOP/s for Rubin CPX chips, which is 4e18 FLOP/s in total). The racks are still Oberon (72 non-CPX chips/packages in a rack-sized scale-up world of the same size, with the same number of chips included in it), so the cost might only change slightly, maybe 1.5x (there are still 2x more compute dies). Which is 3.7x more price-performance than the ~2x that the mere change in semi process would predict (Moore’s law of price-performance). (Or 4.9x if we follow the SemiAnalysis estimate of dense 5.3e18 FP4 FLOP/s for a Rubin NVL144 CPX rack.)
A GB200 NVL72 rack has 72 chips/packages, each with 2 compute dies. Rubin NVL144 CPX has 72 non-CPX chips/packages, each with 2 compute dies, and an additional 144 CPX chips, each with 1 compute die, for the total of 288 compute dies of both kinds, 2x more than the 144 compute dies in a GB200 NVL72 rack.
in general publicly known training techniques are behind sota, so this should be taken into account.
Thoughts on whether the >10x lower chip-to-chip interconnect from the CPX chips (PCIe 6.0x16′s 128GB/s unidirectional vs. NVLink 5′s 1.8TB/s bidirectional) will be a bottleneck blocking them from being that useful in pre-training?
If the pretraining system (built in 2027) is about 2 GW, that’s 5K Rubin NVL144 CPX racks, or 8e28 FP4 FLOPs[1] in 4 months at 30% utilization. At 120 tokens/param, this is enough for 10T active params in a compute optimal MoE model. With 150 layers, 8 active experts per layer, and a GLU nonlinearity (3 matrices per FFN block), this gives 50Kx50K matrices. Such transformers would be too large for efficiently generating output tokens on Rubin NVL144 (even in FP4), but might be analogous to GPT-4.5 in that the immediately following hardware that is Rubin Ultra NVL576 can efficiently generate output tokens for them. In any case, 5T active params and 20T total seems OK for Rubin NVL144 to generate output tokens (10 TB of HBM out of the 20 TB a rack will have), which gives 37Kx37K matrices.
A Rubin CPX compute die produces 20e15 FP4 FLOP/s[2]. For multiplying square matrices with side N it needs 2N3 FLOPs and to exchange 3N2/2 bytes with memory. At 2 TB/s GDDR7 bandwidth, this needs N at least 7500. For processing an FFN block of 3 square matrices with side N, it needs 6N3 FLOPs and to exchange 2N2/2 bytes on the network in both directions in total. At 0.2 TB/s CX-9 bidirectional bandwidth, this needs N at least 17K. So there’s even enough for an off-by-2x mistake in these estimates, various matrices actually getting non-square shapes, or models being somewhat smaller.
The SemiAnalysis estimate of 5.3e18 FLOP/s per Rubin NVL144 CPX rack is indeed based on a different ratio of sparse to dense compute, they are claiming it’s 3:2 for Rubin. I didn’t yet search for a source for this, but in any case this is in the article and I missed it on first reading, so didn’t recall it when my own estimate based on the 2:1 sparse to dense ratio failed to match theirs.
As in the previous footnote, this is what the announced 30e15 FP4 FLOP/s become after using the 3:2 sparse to dense compute ratio, rather than the 2:1 ratio.
GPT-5 should be released late 2025 at the earliest if OpenAI follows the usual naming convention of roughly 100x in raw compute. With GPT-4 at 2e25 FLOPs, GPT-4.5 should have about 2e26 FLOPs and GPT-5 about 2e27 FLOPs. A 100K H100 training system, like the one in Goodyear (or Musk’s Memphis datacenter as it was late 2024), can train a 3e26 FLOPs model, which fits the name of GPT-4.5, but it can’t train a 2e27 FLOPs model.
The new Stargate site in Abilene might be preparing to host 200K-300K chips in GB200 NVL72 racks. These chips produce 2.5x more compute than H100s, so 200K would be sufficient to get 2e27 FLOPs and train a GPT-5. If there’s already enough power (about 400 MW all-in for 200K chips), shipments of GB200 in bulk start in early 2025, get installed at xAI’s pace, and go into pretraining for 4 months, then with 1 more month of post-training it’s already November.
So the rumors about GPT-5 in late May 2025 either represent change in the naming convention, or correspond to some intermediate milestone in training GPT-5, likely the training system being in principle ready to start pretraining.
Per Altman:
I think he’s pretty plainly saying that this “GPT-5” will be a completely different thing from a 100x’d GPT-4.
This is perfectly consistent with GPT-5 being 100x GPT-4 compute. Announcing specific features that will go into it suggests they have a prototype, in this case I’m guessing the LLM will itself be trained to decide whether to go into the reasoning mode, triggering it when needed and affordable, like any other tool.
I don’t see it. He says that GPT-5 will be a system that “integrates o3”. This isn’t his sloppy way of saying “integrates the reasoning techniques”: when he wants to express that idea, he talks about “unifying o-series models and GPT-series models”. The wording regarding GPT-5 is consistent with him literally saying that the model o3 will be part of GPT-5.
Furthermore, I take “as” in “GPT-5 as a system that integrates a lot of our technology” to mean “GPT-5 is defined as {a system that integrates a lot of our technology, including o3}”. Not “GPT-5 will be trained to automatically switch between a standard mode, a reasoning mode, a Deep Research mode, etc.”, not even “GPT-5 will be trained to recognize when to fall back to o3, a lesser model”, but literally “we’re slapping the GPT-5 label on a glorified wrapper over all our current models”.
The “glorified wrapper” could still be a 2e27 FLOPs model, it could even be using literal o3 as one of its tools (in addition to all the other tools, with native GPT-5 long reasoning mostly reserved for premium tier). This is in line with the “agents” agenda where better reliability in taking irreversible actions unlocks new use cases, in this case whether to make use of expensive reasoning calls.
Since “GPT-4.5” will actually be released rather than skipped, it’s less plausible for “GPT-5″ to come out shortly after. If it’s announced in ~Dec 2025 (the way o3 was), it’s still “within months”, and then it can actually get released in ~Feb 2026.
Hm, fair enough. Seems like a stretch, though, especially given the need to interpret his “ETA in months” as “will be officially announced in months and released in a year”.
There was also Murati in Jun 2024 predicting PhD level AI in 18 months. If they succeed in achieving parity with xAI in terms of safety procedures, they might even release a preview checkpoint in Dec 2025 for Pro users. So actual release in a year is not strictly necessary for this hypothesis, it’s just closer to what they’ve done in the past.
I doubt this is a real convention. I think OpenAI wanted to call Orion GPT-5 if they thought it was good enough to deserve the name.
I’m merely referring to the historical precedent, whether there are informal commitments in the minds of the leadership is not something I can speak to. This pattern might continue or it might break. What I’m guessing about training system buildout from vague clues seems to be consistent with it continuing, so the naming pattern can be used as another clue to make a point estimate prediction that’s more concrete.
Stargate is evidence towards slower training system scaling. The rumored reason for starting the project is that Microsoft isn’t building giant frontier training systems fast enough, probably because they aren’t seeing the case for doing that faster. In which case other hyperscalers might think similarly, and they are the most well-positioned to build these systems, so this attitude might be indicative of how frontier training systems get built overall, which is notably slower than technically feasible.
The $80bn Microsoft capex is not relevant to this if it goes to many smaller systems[1], which is only natural as there are millions of datacenter GPUs but only a few 100K GPU frontier training systems, a tiny fraction of inference and smaller/research training compute. The $500bn figure is not relevant as for now it’s only a vague plan. But Microsoft not agreeing to build training systems on OpenAI’s schedule is some evidence.
OpenAI would want to get from under Microsoft’s thumb anyway[2], and this gets ever more difficult over time, since frontier training systems get ever more expensive, so the sooner they try the more likely they are to succeed. But even this consideration is some evidence of slowdown, since it only motivates saying you want to build frontier training systems even faster, but doesn’t in itself motivate actually going through with it, beyond building a competitive training system that makes you independent.
So the clues that support the prospect of scaling to 1 GW in 2025 and to 5 GW in 2027 could be misleading, running contrary to hyperscaler attitudes and not aligning even with OpenAI’s immediate incentives.
I previously expected that $80bn is evidence that they are building a large training system this year, but it now seems that they are building more inference instead.
As Satya Nadella said, “If OpenAI disappeared tomorrow… we have all the IP rights and all the capability. We have the people, we have the compute, we have the data, we have everything. We are below them, above them, around them.”
The 50M H100 equivalent compute by 2030 figure tweeted by Musk is on trend (assuming a 2028 slowdown), might cost about $300bn in total (for the training systems built in 2025-2030 for one AI company, including the buildings and power infrastructure).
If the current trend of compute scaling continues to 2028, there will be 160x more compute per training system than the 100K H100s of 2024. It will require 5 GW of power and cost about $140bn in compute hardware and an additional $60bn in buildings, power, and cooling infrastructure[1].
However, if the slowdown starts earlier while still targeting an eventual spend of $100bn per year, and a 5 GW frontier AI training system isn’t yet built in 2028-2029 (which seems plausible), building it in 2030 would use the next generation of compute hardware, which will be about 2x more performant for an approximately unchanged cost. This means 320x more compute than the 100K H100s systems of 2024, or 32M H100 equivalent compute. If we sum it up with the preceding generations of frontier AI training systems built for the same company, say 2 GW in 2028 and 1 GW in 2026, this gives us 40M H100 equivalents, which is the same as 50M given the error bars on these estimates (or we get that directly if the slowdown only starts between 2028 and 2030). Summing up the costs for the older systems as well, we get to about $300bn (or $450bn if a 5 GW system is built in 2028, and then another one in 2030).
Let’s start with the anchor of $15bn of Stargate Abilene in 2026 for 1.2 GW (which seems consistent in cost per MW with other similar announcements). The power that seems actually necessary for its 400K Blackwell chips together with everything else looks more like 900 MW.
Rubin Ultra racks of 2028 are 600 kW per rack, 4.5x up from the current 130 kW per rack, so the total area needed to build a 5 GW training system in 2028 might only be 2x greater than that of the 1 GW training systems from 2026. Between $30bn from building area and $70bn from power is my guess of $60bn.
By power, do you mean the cost of electrical equipment etc.? The cost the of energy itself is a relatively small. The average price of electricity in the US is $0.13/kWh, which is $36.11/GJ. So even if you had a 5 GW datacenter running continuously for a year, the energy cost is only $5.7bn.
Power infrastructure that might need to be built is gas generators or power plants, substations, whatever the buildings themselves need. Generators are apparently added even when not on-paper strictly necessary, as backup power. They are also faster to setup than GW-scale grid interconnection, so could be important for these sudden giant factories where nobody is quite sure 4 years in advance that they will be actually built at a given scale.
Datacenter infrastructure friction and cost will probably both smooth out the slowdown and disappear as a funding constraint for AI companies in the years following the slowdown. Compute hardware is rotated every few years, so at some point you don’t need new datacenters and accompanying infrastructure to setup a new generation of compute hardware, you just reuse an existing datacenter site that hosted old hardware. Also, any related datacenters that didn’t have excessive inter-site dark fiber will at some point set it up, so even increasing the scale will be less dependent on having everything at one site. This makes the infrastructure costs a much smaller fraction of the cost of a frontier AI training system, and there will no longer be friction.
The infrastructure or even hardware costs in principle don’t need to be paid by the AI company upfront, but either the market as a whole or the specific AI company (as a tenant) need to sufficiently assure the developer (that builds and owns the non-IT infrastructure) and the cloud provider (that installs and owns compute hardware) to commit to the project. My sense is that the estimates for the cost of a year of GPU-time for frontier compute end up at about a third of the cost of compute hardware. So access to a new $200bn training system that has $140bn worth of compute hardware (which only remains cutting edge for 2 years) will cost the tenant $45bn per year, even though the total capital expenditure is $100bn per year during the initial infrastructure buildout, and in later years after slowdown (when new infrastructure no longer needs to be built as much) it’s still $70bn per year to keep installing the newest hardware somewhere, so that some datacenter site will end up having it available.
Thus a few years after slowdown, we get about 2x more compute supported by the same level of funding (from $100bn per year to $45bn per year for the same compute, or keeping to $100bn per year for 2x the compute). But since 2x in compute corresponds to 2 years of compute hardware price-performance progress, and the relevant anchor is the 2000x of 2022-2028 training compute scale-up, that is just playing with about 2 years in the 2028-2045 period when another 2000x compute scaleup happens, mostly due to increasing price-performance of compute, and a level of growth similar to that of the current tech giants in the past. So not a crucial update.
When people are skeptical about the concept of AGI being meaningful or having clear boundaries, it could sometimes be downstream of skepticism about very fast and impactful R&D done by AIs, such as software-only singularity or things like macroscopic biotech where compute buildout happens at a speed impossible for human industry. Such events are needed to serve as landmarks, anchoring a clear concept of AGI, otherwise the definition remains contentious.
So AI company CEOs who complain about AGI being too nebulous to define might already be expecting a scaling slowdown, with their strategy being primarily about the fight for the soul of the 2028-2030 market. When scaling is slow, it’ll become too difficult to gain a significant quality advantage sufficient to defeat the incumbents. So the decisive battle is happening now, with the rhetoric making it more palatable to push through the decisions to build the $140bn training systems of 2028.
This behavior doesn’t need to be at all related to expecting superintelligence, it makes sense as a consequence of not expecting superintelligence in the near future.
As someone who thinks superintelligence could come in the near future, I basically agree with @snewman’s view that AIs have to automate the entire economy, or automate a sector that could then automate everything else very fast, but unfortunately for us this basically gives us no good fire alarms for AGI unless @Ege Erdil and @Matthew Barnett et al are right that takeoff is slow enough that most value comes from broad automation, and external use dominates internal use:
https://amistrongeryet.substack.com/p/defining-agi
I think short timelines just don’t square with the way intelligence agencies are behaving. The NSA took Y2K more seriously than it currently seems to be taking near-term AGI. You can make the argument that intelligence agencies are less competent than they used to be, but I don’t buy that they aren’t at least extremely paranoid and moderately competent: that seems like their job.
Researchers at AGI labs seem to genuinely believe the hype they’re selling, a significant fraction of non-affiliated top-of-the-line DL researchers is inclined to believe them as well, and basically all competent well-informed people agree that the short-timelines position is not unreasonable to hold.
Dismissing short timelines based on NSA’s behavior requires assuming that they’re much more competent in the field of AI than everyone in the above list. After all, that’d require them to be strongly (and correctly) confident that all these superstar researchers above are incorrect.
While that’s not impossible, it seems highly unlikely to me. Much more likely that they’re significantly less competent, and accordingly dismissive.
This is a late reply, but at least from this article, it seems like Ilya Sutskever was running out of confidence that OpenAI would reach AGI by mid 2023. Additionally, if the rumors about GPT-5 are true, it’s mainly going to be a unification of existing models rather than something entirely new. Combined with the GPT-4.5 release, it sure seems like progress at OpenAI is slowing down rather than speeding up.
How do you know that researchers at AGI labs genuinely believe what they’re saying? Couldn’t the companies just put pressure on them to act like they believe Transformative AI is imminent? I just don’t buy that these agencies are dismissive without good reason. They’ve explored remote viewing and other ideas that are almost certainly bullshit. If they are willing to consider those possibilities, I don’t know why they wouldn’t consider the possibility of current deep learning techniques creating a national security threat. That seems like their job, and they’ve explored significantly weirder ideas.
On what possible publicly-unavailable evidence could they have updated in order to correctly attain such a high degree of dismissiveness?
I could think of three types of evidence:
Strong theoretical reasons.
E. g., some sort of classified, highly advanced, highly empirically supported theory of deep learning/intelligence/agency, such that you can run a bunch of precise experiments, or do a bunch of math derivations, and definitively conclude that DL/LLMs don’t scale to AGI.
Empirical tests.
E. g., perhaps the deep state secretly has 100x the compute of AGI labs, and they already ran the pretraining game to GPT-6 and been disappointed by the results.
Overriding expert opinions.
E. g., a large number of world-class best-of-the-best AI scientists with an impeccable track record firmly and unanimously saying that LLMs don’t scale to AGI. This requires either a “shadow industry” of AI experts working for the government, or for the AI-expert public speakers to be on the deep state’s payroll and lying in public about their uncertainty.
I mean, I guess it’s possible that what we see of the AI industry is just the tip of the iceberg and the government has classified research projects that are a decade ahead of the public state of knowledge. But I find this rather unlikely.
And unless we do postulate that, I don’t see any possible valid pathway by which they could’ve attained high certainty regarding the current paradigm not working out.
There are two ways we can update on it:
The fact that they investigated psychic phenomena means they’re willing to explore a wide variety of ambitious ideas, regardless of their weirdness – and therefore we should expect them not do dismiss the AGI Risk out of hand.
The fact that they investigated psychic phenomena means they have a pretty bad grip on reality – and therefore we should not expect them to get the AGI Risk right.
I never looked into it enough to know which interpretation is the correct one. Expecting less competence rather than more is usually a good rule of thumb, though.
To be clear, I personally very much agree with that. But:
I find that I’m not inclined to take Sutskever’s current claims about this at face value. He’s raising money for his thing, he has a vested interest in pushing the agenda that the LLM paradigm is a dead end and that his way is the only way. Same how it became advantageous for him to talk about the data wall once he’s no longer with the unlimited-compute company.
Again, I do believe both in LLMs being a dead end and in the data wall. But I don’t trust Sutskever to be a clean source of information regarding that, so I’m not inclined to update on his claims to that end.
Those are good points. The last thing i’ll say drastically reduces the amount of competence required by the government in order for them to be dismissive while still being rational, and it is that the leading AI labs may already be fairly confident that the current techniques of deep-learning won’t get to AGI in the near-future, so the security agencies know this as well.
That would make sense. But I doubt all AGI companies are that good at informational security and deception. This would require all of {OpenAI, Anthropic, DeepMind, Meta, xAI} to decide on the deceptive narrative, and then not fail to keep up the charade, which would require both sending the right public messages and synchronizing their research publications such that the set of paradigm-damning ones isn’t public.
In addition, how do we explain people who quit AGI companies and remain with short timelines?
I guess I would respond to the first point by saying all of the companies you mentioned have incentive to say they are closing in on AGI even if they aren’t. It doesn’t seem that sophisticated to say “we’re close to AGI” when you’re not. Mark Zuckerberg said that AI would be at the level of a junior SWE this year, and Meta proceeded to release Llama 4. Unless prognosticators at Meta seriously fucked up, the most likely scenario is that Zuckerberg made that comment knowing it was bullshit. And the sharing of research did slow down a lot in 2023, which gave companies cover to not release unflattering results.
And to your last point, it seems reasonable that companies could pressure former employees to act as if they believe AGI is imminent. And some researchers may be emotionally invested in believing that what they worked on is what will lead to superintelligence.
And my question for you is: if DeepMind had solid evidence that AGI would be here in 1 year, and if the security agencies had access to DeepMind’s evidence and reasoning, do you believe they would still do nothing?
Dario Amodei suggests that in-context learning might suffice for continual learning. The way LLMs do in-context learning with long context is disanalogous to anything humans can do, but a context window of 15M tokens is 500 days of 30K tokens per day, which is more than enough to progress from “first day on the job” to knowing what you are doing with this particular source of tasks. Needs to work mostly with text (if it works at all), or 15M tokens won’t be enough, but that could be sufficient.
So this might just be about moving from RAG to including more free-form observations that were historically made by the model itself for the same source of tasks, with massively more tokens of context, and the current memory features of chatbots in the form of long text files might with sufficient scale become the real thing, rather than remaining a dead-end crutch, once these text files get into the habit of accumulating megabytes of observations. And RLVR can plausibly teach the models how to make a good use of these very long contexts.
With this year’s 14 TB of HBM per GB200 NVL72, very long context windows become more feasible (than with ~1 TB of HBM per node that most current models are still running on), and then there’s the next step in 2028 with Rubin Ultra NVL576 systems that have 147 TB of HBM.
Unless I’m totally off-base here, 15M sounds incredibly high for actually useful recall.
This is the best source I know about for measuring model context length.
Obviously I don’t know about private models, but based on the delta between claimed vs. actual, I’m pretty suspicious that actually useful context length is currently longer than a few hundred thousand tokens.
Not currently, but this is some kind of brute force scaling roadmap for one of the major remaining unhobblings, so it has timeline implications. On last year’s hardware, it’s not really feasible to go that far anyway, and RLVR is only just waking up. So the first public observations of negative results on this will probably be in 2026, if the actually useful context length fails to improve. And then there’s 2028-2029, following up on the 147 TB of Rubin Ultra NVL576 (Nvidia roadmap places it in 2027, which means in 2028 there will be datacenters with it, as well as possibly models trained for it using older hardware, then in 2029 models trained on it).
But also, for the purpose of automated adaptation to a source of tasks and feedback (such as a job), it doesn’t necessarily need as much fidelity, it only needs to work as well as a human reading some book a year ago, retaining the mental skills but not the words. A context in principle gives the words, but that is not the thing that needs to work.
I supposed I’m unsure how fast this can be scaled. Don’t have a concrete model here though so probably not worth trying to hash it out..
I’m not sure that the current summarization/searching approach is actually analogous to this. That said,
This is probably making approaches more analogous. So fair point.
I would like to see the updated Ruler metrics in 2026.
Any specific predictions you have on what a negative v. positive result would be in 2026?
There is a paper that shows the overreliance of in-context learning on superficial clues. it is from 2022, and the tested models are old. So, maybe newer ones are doing much better,but maybe it is not really learning, at least by some definitions.
Long reasoning with MoE models doesn’t get cheaper with overtraining, and pretraining data scarcity might make it useful to have even more active params than compute optimal.
Overtraining (less active params than compute optimal) is useful for processing input tokens, but reasoning models want to generate so many output tokens that cheapness of input tokens plausibly becomes relatively unimportant for some use cases. Performance for output tokens depends on total params and KV cache per token, you want total params and hundreds of KV cache contexts to fit in a few nodes (servers, scale-up worlds). Until recently, an 8-chip node of H100/H200/B200 only had 0.7-1.4 TB of HBM, which means that it was manageable to generate output tokens for models with maybe 1-2T total params, using 2-4 nodes, as long as KV cache per token was small enough (which depends on the attention mechanism, and indirectly on model dimension, but plausibly only weakly on the number of active params, in terms of compute optimality).
With GB200 NVL72, we get to 13 TB per rack (and then 20 TB for GB300), an increase of 10x, so plausibly models with 10-20T total params become feasible to run long reasoning on (and train with RLVR). There is no particular reason why active params must be only a tiny fraction of this for use cases that primarily need long reasoning traces, as active params are only a relevant constraint in compute optimality of pretraining and for cost of processing input tokens.
A compute optimal number of active params for a 1:8 sparse MoE model (active:total params) at 5e26 pretraining FLOPs is about 850B, with 7T total params, needing 100T tokens[1]. Even if pretraining uses 25T tokens repeated 4 times, it might be impossible to find 25T reasonably good tokens. So perhaps the right tradeoff is to instead go with even more active params than compute optimal, in order to need fewer pretraining tokens! The total param budget might be even higher than the 7T inferred from 1:8 sparsity at a compute optimal level, so it’s plausible that the right play is something like 1.5T active params and 12T total params, which lets pretraining use merely 55T tokens (or 14T tokens repeated 4 times). This still fits in 2 racks of GB200 (in FP8) with room for many KV caches even for very long contexts, so fast RLVR and long reasoning inference remain feasible, and they don’t get meaningfully more expensive just because there are more active params.
So my guess is that even the smaller and cheaper long reasoning models might stop being overtrained, and instead might skew towards undertraining, if very long input context is not an important component of their use cases, so that most of their tokens are generated in reasoning traces. Kimi K2 has 1T total params but only 32B active params, while trained on 15.5T tokens, which means it needed 3e24 FLOPs of pretraining. But if trained on the same 15.5T tokens compute optimally, a model with 1T total params might have about 200B active params and need 2e25 FLOPs to pretrain[2], much more expensive. However, it would cost about the same as with the 32B active param model to generate output tokens, if KV cache per token wasn’t allowed to get larger. So the active param count tradeoff is something that might change, or is already changing, even for “small” reasoning models, now that long reasoning traces are becoming more important, at least when the AI company has enough pretraining compute to afford such change.
Optimal tokens/param ratio is higher for MoE models than for dense, possibly 120 tokens/param for 1:8 sparsity (3x over dense), anchoring to dense Llama 3 405B’s compute optimal 40 tokens/param.
Assuming 80 tokens/param for 1:5 sparsity, 2x over dense.
Yi-Lightning (01 AI) Chatbot Arena results are suprisingly strong for its price, which puts it at about 10B active parameters[1]. It’s above Claude 3.5 Sonnet and GPT-4o in Math, above Gemini 1.5 Pro 002 in English and Hard Prompts (English). It’s above all non-frontier models in Coding and Hard Prompts (both with Style Control), including Qwen-2.5-72B (trained on 18T tokens). Interesting if this is mostly a better methodology or compute scaling getting taken more seriously for a tiny model.
The developer’s site says it’s a MoE model. Developer’s API docs list it at ¥0.99/1M tokens. The currency must be Renminbi, so that’s about $0.14. Together serves Llama-3-8B for $0.10-0.18 (per million tokens), Qwen-2.5-7B for $0.30, all MoE models up to 56B total (not active) parameters for $0.60. (The prices for open weights models won’t have significant margins, and model size is known, unlike with lightweight closed models.)
Kai-Fu Lee, CEO of 01 AI, posted on LinkedIn:
Assuming it’s trained in BF16 with 40% compute utilization, that’s a 2e24 FLOPs model (Llama-3-70B is about 6e24 FLOPs, but it’s not MoE, so the FLOPs are not used as well). Assuming from per token price that it has 10-20B active parameters, it’s trained on 15-30T tokens. So not an exercise in extreme compute scaling, just excellent execution.
Superintelligence that both lets humans survive (or revives cryonauts) and doesn’t enable indefinite lifespans is a very contrived package. Grading “doom” on concerns centrally about the first decades to centuries of post-AGI future (value/culture drift, successors, the next few generations of humanity) is not taking into account that the next billions+ years is also what could happen to you or people you know personally, if there is a future for originally-humans at all.
(This is analogous to the “missing mood” of not taking superintelligence into account when talking about future concerns of say 2040-2100 as if superintelligence isn’t imminent. In this case, the thing not taken into account is indefinite personal lifespans of people alive today, rather than the overall scope of imminent disruption of human condition.)
I don’t disagree, but I think we might not agree on the reason. Superintelligence that lets humanity survive (with enough power/value to last for more than a few thousand years, whether or not individuals extend beyond 150 or so years) is pretty contrived.
There’s just no reason to keep significant amounts of biological sub-intelligence around.
Cultural/moral maturity (in a civilization) has never been observed before, similarly to technological maturity. Scalable production of a new kind of thing brings its abundance in sight, which fails to be a concern earlier, while it couldn’t be scaled. A moderate level of AI alignment or of cultural change is not an equilibrium if these things are anchored to scalable resources (effective cognition and coordination, fast subjective serial time). Instead they reach extremes of the kind never observed before those resources become scalable.
Are you trying to say that for any X, instead of X-maturity, we should instead expect X-foom until the marginal returns get too low?
A pre-abundance precedent about X offers poor framing for thinking about the consequences of discovering a scalable process of producing X. Before abundance, it’s artisanal and quirky and path-dependent, the extremes are rare and dysfunctional, so people don’t worry about it too much. There is security in it looking like an equilibrium, but not being truly settled, so that people can influence things.
Abundance brings maturity, changes the character of the equilibrium. So not foom necessarily, just a promise of maturity at some point, which wouldn’t have been as salient before there is a scalable process of production. And there is an excuse of ignoring the possibility even longer, because of the total lack of historical precedent (of the associated problems).
i’d be interested in hearing why you think that cultural/moral/technological/mathematical maturity is even possible or eventually likely (as opposed to one just being immature forever[1]) (assuming you indeed do think that)
which seems more likely to me
I mean “maturity” merely compared to how we view what can currently be happening, such as a baseline level of competence in civilization-level governance, or what the individual people are capable of. Maturity compared to that baseline washes away all the currently relevant fiddly things, replacing them by settled processes.
These new processes are truly settled, so whatever new concerns become important then, the new baseline won’t be overturned. The analogy with technological maturity is that the laws of physics and ways of getting things done within them is a fixed problem statement, so new baselines of effectiveness get locked in.
Agentic RLVR targeting ability of AI to apply RLVR (or more lightweight finetuning) to itself when appropriate (using something like OpenAI’s RL API) potentially gives ARA capabilities and substitutes for more innate hypothetical ways of doing online/continual learning or undergoing on-boarding[1]. Thus ability of AI to “do AI research” is not primarily about RSI or increasing productivity of AI researchers, it’s about removing the last important hobble on LLMs that currently causes unrecoverable inability (for a given AI) to do some simple things or truly long horizon tasks.
This gives a crucial threshold for dangerous AI research capabilities that’s way below ability to do RSI (which itself doesn’t require AI to understand AI, just engineering ability to meaningfully tinker and evaluate). Each of these things might lead to the next with little human input, and scaling pretraining might substantially close the gaps between them.
I think this point is sufficiently in the water supply that it’s no longer hazardous to discuss.
Do you have specific predictions/intuitions regarding the feasibility of what you describe and how strong the feedback loop could be?
Your post being about technical AI R&D automation capabilities kind of immediately made me curious about the timelines, since they’re where I’m somewhat worried.
Also, would Sakana AI’s recent work on adaptative text-to-LORA systems count towards what you’re describing^
Economics studies the scaling laws of systems of human industry. LLMs and multicellular organisms and tokamaks have their own scaling laws, the constraints ensuring optimality of their scaling don’t transfer between these very different machines. A better design doesn’t just choose more optimal hyperparameters or introduce scaling multipliers, it can occasionally create a new thing acting on different inputs and outputs, scaling in its own way, barely noticing what holds back the other things.
A reflectively stable agent prefers to preserve some property of itself. This doesn’t in general prevent it from being able to self-improve, in the same way that unchanging laws of physics don’t prevent presence of self-improving agents in the world.
The content of the world keeps changing under the unchanging laws of how it changes, and similarly a reflectively stable agent (against safety properties) has content (such as beliefs) that keeps changing, in principle enabling unfettered self-improvement. Mesa-agents existing in the form of the content of the outer agent’s cognition don’t even need to have its safety properties. This is one framing for the way people might live within a superintelligence.
Are there pivotal ways this is different to the theories of Enactivism?
(” Its authors define cognition as enaction, which they in turn characterize as the ‘bringing forth’ of domains of significance through organismic activity that has been itself conditioned by a history of interactions between an organism and its environment.” which at first blush I’d say is a reflectively stable agent modifying or updating believes by means of enaction. Enactivism also rejects mind-body duality in favour of a more ‘embodied’ cognition approach together with a “deep continuity of the principles of self-organization from the simplest living things to more complex cognitive beings”), particularly autopoeisis.
An autopoietic system can be contrasted to an allopoetic system which creates objects different to itself, like a factory. Most living beings are autopoetic in that they either produce themselves, or things like them which seems to be similar to a reflectively stable agent, particularly when we describe the more complicated cognitive beings in autopoetic terms. Luhman argued that social systems too are self-organizing, self-reproducing systems which brought the concepts of enactivism from biology and cognitive science into the social sciences.