Vladimir_Nesov

Karma: 34,862

Vladimir_Nesov 7 Sep 2025 23:01 UTC
6 points
0
in reply to: kairos_’s comment on: Vladimir_Nesov’s Shortform
Here’s a couple of my recent relevant posts (both slightly outdated, in particular see this comment, and the note on Gemini 2 Ultra in another comment under this quick take). Though in this quick take, I’m mostly discussing total params count and HBM capacity per scale-up world, not compute, how it’s constraining 2025 AIs beyond compute (so that even 2024 compute fails to find efficient use), and how in 2026 these constraints become less strict.

Vladimir_Nesov 7 Sep 2025 22:22 UTC
4 points
0
in reply to: anaguma’s comment on: Vladimir_Nesov’s Shortform
Total params plus the total KV cache for all requests multiplies the cost of output tokens, so there is reason to keep it down, but little reason to make it much smaller than the whole scale-up world, because then it’s much smaller than KV cache and stops influencing the cost. And for the most capable models the fraction of input tokens on OpenRouter is not as extreme as for Sonnet 4 (88% for Gemini 2.5 Pro, 92% for GPT-5; though 97% for Opus 4.1, probably due to high cost). So it won’t be a factor that motivates fewer active params as with the 8-chip servers and possibly in part with the 6-8 TB systems. Also, 2025 Google pretraining compute could be significantly greater than 100K H100s (maybe 2-4 100K TPUv6e datacenters, which have the same FLOP/s as 200-400K H100s; pretraining of models that are too large using TPUv6e is fine, just not inference or RLVR). So the compute optimal number of active params could increase to 1.0-1.5T (if my 120 tokens/param estimate is in the ballpark). This asks for at least 4-6T total params, but at least 8-12T for 1:8 sparsity might be more appropriate for a premium model (this would be Gemini 3 Ultra). Which is only 20% of the pod HBM (if in FP8), so maybe even 15-20T (at which point the contribution to the cost of output tokens becomes significant).

I’ve only recently realized that the reason there is no Gemini 2 Ultra might be because they don’t have enough inference capacity for overly large total params models, with TPUv6e only having 8 TB of HBM per pod and TPUv5p either outright insufficient in number or not enough to spare, since they are needed for other things. So it’s probably not evidence of Google having made a decision to use less than what they have, as I previously thought. And as TPUv7 changes what they have, they might use it to do more than what they did with Gemini 2. Though if the buildout for TPUv7 won’t yet be sufficiently finished in 2025, RLVR and inference will have to wait until later in 2026 (in the meantime, TPUv5p might help to start on RLVR).
What links here?
- Vladimir_Nesov's comment on Vladimir_Nesov’s Shortform by Vladimir_Nesov (7 Sep 2025 23:01 UTC; 2 points)

Vladimir_Nesov 7 Sep 2025 20:11 UTC
21 points
0
on: Vladimir_Nesov’s Shortform
Crusoe/OpenAI Abilene campus might come online in Feb-Jun 2026. Crusoe CEO said during RAISE Summit 2025 (that took place on 8-9 Jul 2025) that the 6 buildings of phase 2 will “be coming online” in “just over 200 days” (at 7:03 during a panel discussion). If this means 230 days, that’s end of Feb 2026. If he really means “coming online”, then it becomes available at that time. If he actually means that it’s when the last building of 8 from both phases will be ready to install the compute hardware, then it’s at least 3-4 months more to do that (judging by xAI’s Colossus), possibly May-Jun 2026.

This is plausibly the first 400K chip system in GB200/GB300 NVL72 racks (about 900 MW), which is 10x 100K H100s of 2024 in FLOP/s and 12x H200s in HBM per scale-up world (for GB200, at 14 TB), making models 10x larger in total params feasible to inference or train with a lot of RLVR. Currently only Google plausibly has comparable compute, with their Trillium (TPUv6e) systems that across 256 chips per pod (scale-up world) offer 8 TB of HBM (generally available since Dec 2024 in 100K chip systems). The older TPUv5p from 2023 has even larger pods, but it’s unclear if they have enough of them to for example inference Gemini 2.5 Pro for all users. And Anthropic has Trainium 2 Ultra systems with 6 TB of HBM. Currently they probably only have 400K chips that only became available recently (months after TPUv6e), but by next year they might get significantly more.

2025 Frontier Model Sizes

This weakly predicts that GPT-5-thinking (and Grok 4) is a smaller model (1-2T total params) running on older hardware (~H200s, 1.1 TB), Gemini 2.5 Pro might be a 3-5T total params model (TPUv6e, 8 TB), and Opus 4 might be a 2-4T total params model (Trainium 2 Ultra, 6 TB). I’m assuming that the recent frontier models targeting the older 8-chip servers had to be too big to fit in one scale-up world to capture at least some capabilities that the available pretraining compute in principle enables, but the constraint is no longer as onerous with the newer systems, and so they will likely just fit in one scale-up world rather than lose efficiency on needing more.

The compute optimal size for pretraining with 100K H100s of 2024 might be about 800B active params (at 120 tokens/param, 3x the dense model’s 40 tokens/param to account for 1:8 sparsity), which is probably way too much with 1 TB HBM per server (since MoE wants at least 4x more total params, and inference gets slower and more expensive if too many scale-up worlds are needed per model), but might be OK for 6-8 TB of HBM per scale-up world, and so Opus 4 and Gemini 2.5 Pro might also have more active params than GPT-5-thinking. With GB200 NVL72 (14 TB), models with 4-8T total params become feasible, so there is less reason to keep the number of active params below compute optimal level. And then GB300 NVL72 has 20 TB of HBM, which is plausibly what the remaining 6 buildings of phase 2 of Abilene campus will host.

On the other hand, most tokens are input tokens (98% of OpenRouter Sonnet 4 tokens are input tokens), so reducing the number of active params is very important for model providers, and even if Gemini 2.5 Pro has 5T total params, it might still have significantly less than the pretraining compute optimal ~800B params. For example, at 1:32 sparsity even 5T total params only ask for 160B active params.

Largest Models of 2025-2026

So only Opus 4 is somewhat likely to have a compute optimal number of active params, due to its very high price and contrast with the already capable Sonnet 4 (they might’ve only had access to about 50K H100s when pretraining Opus 4, which is 5x fewer FLOP/s than 400K Trainium 2 chips). And GPT-4.5 probably has a similar number of active params (plausibly a bit more, since they had at least 100K H100s), but we still didn’t get a thinking version, so its capabilities can’t be properly observed. And plausibly it wasn’t trained with enough RLVR to count due to lack of availability of GB200 NVL72. By now, Opus 4.1 plausibly had enough time with Trainium 2 Ultra available to train with pretraining-scale RLVR (or this might happen a bit later), and similarly for GPT-4.5 (with GB200 NVL72), but for GPT-4.5 there might be insufficient compute to inference it without reducing demand a lot by setting uncomfortable prices or rate limits, and as a result of that a thinking model with pretraining-scale RLVR might not exist yet, at least in a product-ready form. This might take until well into 2026 to change, after phase 2 of the Abilene campus is ready (and presumably buildouts by other cloud providers that OpenAI might use, which might be a bit earlier, since inference doesn’t have much use for particularly giant datacenter campuses, just enough in total to serve all users). If so, this is when we’ll see the first GPT-4.5 sized pretraining-scale RLVR trained model from OpenAI, though by that time the plausibly similarly sized Opus 4 would already be considerably more mature.

Then, there is Gemini 3, which will probably come out early 2026. The next generation TPU is Ironwood (TPUv7), which supports 9,216 chip pods, but even 256 chip pods have 50 TB of HBM per pod. If there are enough of these built by then, Gemini 3 could include the largest model of 2026 (by total params count).

Vladimir_Nesov 7 Sep 2025 16:11 UTC
4 points
0
in reply to: Josh You’s comment on: ryan_greenblatt’s Shortform
Absence of AGI^[1] by (say) 2055 is predicted by models that deserve to be developed in earnest (I’d currently give the claim 15%, with 10% mostly for technological reasons and 5% mostly because of a human-instituted lasting Pause or a disaster). This doesn’t significantly affect the median timeline yet, but as time goes on these models can get stronger (Moore’s law even in price-performance form breaking down, continual learning turning out to be a grand algorithmic obstruction that might take decades to solve, with in-context learning not good enough for this purpose within available compute). And this would start affecting the median timeline more and more. Also, development of AGI might result in a lasting ASI^[2] Pause (either through societal backlash or from AGIs themselves insisting on this to prevent ASIs misaligned with them before they figure out how to align ASIs).
1. ↩︎
  AGIs are AIs unbounded in ability to develop civilization on their own, without needing substantial human input, including by inventing aligned-with-them ASIs.
2. ↩︎
  ASIs are qualitatively more intelligent than humans or humanity, while non-ASI AGIs are reasonably comparable to humans or humanity, even if notably more capable.

Vladimir_Nesov 7 Sep 2025 9:11 UTC
4 points
4
in reply to: avturchin’s comment on: My AI Predictions for 2027
Very large updates are abundant. I’m looking out of my window, and now I’m completely certain it’s currently not raining, even though a priori the odds for that are far from complete certainty at any given time.

Good priors are important, but their details often get washed away by the scale of concrete evidence. Base rates are more often than not just a result of updating on most of your data, rather than conceptualized from first principles, and then as a cherry on top you update on a little more data to get a better prediction.

Vladimir_Nesov 5 Sep 2025 14:25 UTC
3 points
0
on: A Pitfall of “Expertise”

Still, now I know. [...] Do you see what I had to do? Can you imagine how it probably hurt?

It’s not a universal experience, thankfully. I think flow state is most useful for juggling pipelines for learning little things like that. Perhaps it’s worth cultivating an attitude where you are not an “expert” at anything, if these are the side effects.

Vladimir_Nesov 5 Sep 2025 1:09 UTC
2 points
0
on: In Defense of Alcohol

the risk of debilitating addiction

I worry the risk of occasional mild discomfort as a result of a very resistible slight addiction isn’t being priced in. To the extent that it’s permanent, even rare consumption would create the problem. Many people assume a sufficient buffer of willpower and habit to keep consumption at sane levels, but disregard the total cost in the long term of having to manage this constraint, compared to the alternative where even the slight addiction is completely absent.

Vladimir_Nesov 3 Sep 2025 22:53 UTC
19 points
10
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform
I think most of the value in researching timelines is in developing models that can then be quickly updated as new facts come to light. As opposed to figuring out how to think about the implications of such facts only after they become available.

People might substantially disagree about parameters of such models (and the timelines they predict) while agreeing on the overall framework, and building common understanding is important for coordination. Also, you wouldn’t necessarily a priori know which facts to track, without first having developed the models.

Vladimir_Nesov 3 Sep 2025 18:16 UTC
25 points
4
on: All Exponentials are Eventually S-Curves
An exponent models things locally, at an appropriate level of detail for modeling them locally. An S-curve won’t actually be an S-curve, there will be a lot more data than that in the real thing, omitting the data specifying when the exponent slows down is no different. Simpler models are often more useful, even when you do realize they have a limited scope of applicability.

Vladimir_Nesov 3 Sep 2025 15:01 UTC
LW: 13 AF: 6
5
AF
on: Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro
Non-OpenAI pre-RLVR chatbots might serve as an anchor for how long it takes an AI company to turn an algorithmic idea into a frontier model, after it becomes a clearly worthwhile thing to do. Arguably only Anthropic managed to catch up to OpenAI, and it took them 1.5 years with Sonnet 3.5. Even Google never caught up after 2+ years, their first credibly frontier chatbot is Gemini 2.5 Pro, which is already well into RLVR (and similarly for Grok 4). So it seems reasonable to expect that it would take about 2 years for RLVR-based models to start being done well, somewhere in 2026-2027.

The IMO results probably indicate something about the current lower bound on capabilities in principle, for informally graded tasks such as natural language proofs. This is a lot higher than what finds practical use so far, and improvements in 2026-2027 might be able to capture this kind of thing (without needing the scale of 2026 compute).

Vladimir_Nesov 2 Sep 2025 18:45 UTC
4 points
2
on: Models vs beliefs
The use of models/theories is in their legibility, you don’t necessarily want to obey your models even when forming beliefs on your own. Developing and applying models is good exercise, and there is nothing wrong with working on multiple mutually contradictory models.

Framings take this further, towards an even more partial grasp on reality, and can occasionally insist on patently wrong claims for situations that are not central to how they view the world. Where models help with local validity and communication, framings help with prioritization of concepts/concerns, including prioritization of development of appropriate kinds of models.

Neither should replace the potentially illegible judgement that isn’t necessarily possible to articulate or motivate well. That seems to be an important failure mode that leads to either rigid refusal to work with (and get better at) the situations that are noncentral for your favored theories, or to deference to such theories even where they have no business having a clue. If any situation is free to spin up new framings and models around itself, even when they are much worse than and contradictory to the nearby models and framings that don’t quite fit, then there is potential to efficiently get better at understanding new things, without getting overly anchored to ways of thinking that are much more familiar or better understood.

Vladimir_Nesov 2 Sep 2025 15:47 UTC
5 points
0
in reply to: Thomas Kwa’s comment on: Vladimir_Nesov’s Shortform
I guess the cost-quality tradeoff makes AI progress even better described as that of a normal technology. As economies of scale reduce cost, they should also be increasing quality (somewhat interchangeably). It’s just harder to quantify, and so most of the discussion will be in terms of cost. But for the purposes of raising the ceiling on adoption (total addressable market), higher quality works as well as lower cost, so the lowering of costs is directly relevant.

In this framing, logarithmic improvement of quality with more resources isn’t an unusual AI-specific thing either. What remains is the inflated expectations for how quality should be improving cheaply (which is not a real thing, and so leads to the impressions of plateauing with AI, where for other technologies very slow quality improvement would be the default expectation). And Moore’s law of price-performance, which is much faster than economic growth. The economies of scale mostly won’t be able to notice the growth of the specific market for some post-adoption technology that’s merely downstream of the growth of the overall economy. But with AI, available compute would be growing fast enough to make a difference even post-adoption (in 2030s).

Vladimir_Nesov 2 Sep 2025 1:46 UTC
13 points
0
in reply to: ryan_greenblatt’s comment on: Vladimir_Nesov’s Shortform
Exponential increase in total economic value is not specific to AI, any new tech is going to start exponentially (possibly following the startups championing it) before it gets further on the adoption S-curve. The unusual things about AI is that it gets better with more resources (while most other things just don’t get better at all in a straightforward scaling law manner), that the logarithm of resources thing leaves the persistent impression of plateauing despite not actually plateauing, and that even if it runs out of the adoption S-curve it still has Moore’s law of price-performance to keep fueling its improvement. These unusual things frame the sense in which it’s linear/logarithmic.

If the improvement keeps raising the ceiling on adoption (capabilities) fast enough, funding keeps scaling into slightly more absurd territory, but even then it won’t go a long way without the kind of takeoff that makes anything like the modern industry obsolete. After the exponential phase of adoption comes to an end, it falls back to Moore’s law, which still keeps giving it exponential compute to slowly keep fueling further progress, and in that sense there is some unusual exponential-ness to this. Though probably there are other things with scaling laws of their own that global economic growth (instead of Moore’s law) would similarly fuel, even slower.

Vladimir_Nesov 1 Sep 2025 20:36 UTC
58 points
9
on: Vladimir_Nesov’s Shortform
It seems more accurate to say that AI progress is linear rather than exponential, as a result of being logarithmic in resources that are in turn exponentially increasing with time. (This is not quantitative, any more than the “exponential progress” I’m disagreeing with^[1].)

Logarithmic return on resources means strongly diminishing returns, but that’s not actual plateauing, and the linear progress in time is only slowing down according to how the exponential growth of resources is slowing down. Moore’s law in the price-performance form held for a really long time; even though it’s much slower than the present funding ramp, it’s still promising exponentially more compute over time.

And so the progress won’t obviously have an opportunity to actually plateau, merely proceed at a slower linear pace, until some capability threshold or a non-incremental algorithmic improvement. Observing the continued absence of the never-real exponential progress doesn’t oppose this expectation. Incremental releases are already apparently making it difficult for people to notice the extent of improvement over the last 2.5 years. With 3x slower progress (after 2029-2032), a similar amount of improvement would need 8 years.
1. ↩︎
  The METR time horizon metric wants to be at least exponential in time, but most of the other benchmarks and intuitive impressions seem to quantify progress in a way that better aligns with linear progress over time (at the vibe level where “exponential progress” usually has its intended meaning). Many plots use log-resources of various kinds on the horizontal axis, with the benchmark value increasing linearly in log-resources, while it’s not yet saturated.
  
  Perhaps another meaning of “exponential progress” that’s real is funding over time, even growth of individual AI companies, but that holds at the start of any technology adoption cycle, or for any startup, and doesn’t need to coexist with the unusual feature of AI making logarithmic progress with more resources.

Vladimir_Nesov 30 Aug 2025 21:00 UTC
2 points
0
in reply to: mako yass’s comment on: MakoYass’s Shortform
The point is that people shouldn’t be stakeholders of everything, let alone to an equal extent. Instead, particular targets of optimization (much smaller than the whole world) should have much fewer agents with influence over their construction, and it’s only in these contexts that preference aggregation should be considered. When starting with a wider scope of optimization with many stakeholders, it makes more sense to start with dividing it into smaller parts that are each a target of optimization with fewer stakeholders, optimized under preferences aggregated differently from how that settles for the other parts. Expected utility theory makes sense for such smaller projects just as much as it does for the global scope of the whole world, but it breaks normality less when applied narrowly like that than if we try to apply it to the global scope.

The elephant might need to be part of one person’s home, but not a concern for anyone else, and not subject to anyone else’s preferences. That person would need to be able to afford an elephant though, to construct it within the scope of their home. Appealing to others’ preferences about the would-be owner’s desires would place the would-be owner within the others’ optimization scope, make the would-be owner a project that others are working on, make them stakeholders of the would-be owner’s self, rather than remaining a more sovereign entity. If you depend on the concern of others to keep receiving the resources you need, then you are receiving those resources conditionally, rather than allocating the resources you have according to your own volition. Much better for others to contribute to an external project you are also working on, according to what that project is, rather than according to your desires about it.

Vladimir_Nesov 30 Aug 2025 14:05 UTC
4 points
−2
in reply to: mako yass’s comment on: MakoYass’s Shortform
I think assuming the whole world as the optimization scope is a major issue with how expected utility theory is applied to metaethics. It makes more sense if you treat it as a theory of building machines piecemeal, each action adding parts to some machine, with the scope of expected utility (consequences of actions) being a particular machine (the space of all possible machines in some place, or those made of particular parts).

Coordination is then a study of how multiple actors can coordinate assembly of the same shared machine from multiple directions simultaneously. The need to aggregate preferences is a comment on how trying to build different machines while actually building a single machine won’t end coherently. But also, aggregating preferences is mostly necessary for individual projects, rather than globally, you don’t need to aggregate preferences of others as they are talking about yourself, if you yourself are a self-built machine with only a single builder. Similarly, you don’t want too many builders for your own home. Shared machines are more of a communal property, there should be boundaries defining the stakeholders that get to influence the preference over which machine is being built in a particular scope.

Vladimir_Nesov 29 Aug 2025 17:24 UTC
3 points
0
on: AI #131 Part 2: Various Misaligned Things

Their capacity is still miles behind domestic demand, their quality still lags far behind Nvidia, and of course their capacity was going to ramp up a lot over time as is that of TSMC and Nvidia (and presumably Samsung and Intel and AMD). I don’t get it.

Does anyone seriously think that if we took down our export controls, that Huawei would cut back its production schedule? I didn’t think so.

As with Intel, you need to distinguish fabs from chip (and system) design, very different activities. Having (multiple) real customers is plausibly crucially important for a fledgling/struggling fab company (they let the fab company make the process of being a customer better, and set the right priorities), and it’s probably a major reason Intel Foundry is dying. Intentions to ramp up a lot or subsidies and vague domestic demand don’t necessarily help that much (it’s not necessarily centrally voluntary and non-fake demand, given the current product). If you have a somewhat worse product as a fab company, you don’t get comparable interest in it, and can’t get those multiple real customers.

Vladimir_Nesov 29 Aug 2025 16:26 UTC
4 points
−2
on: AI #131 Part 1: Gemini 2.5 Flash Image is Cool

The AI takeover is a default.

The takeover framing is deeply misleading, because normally the thing taken over is approximately unchanging, with one power taking it from another. With AI, the thing taken over is not (merely) the human world, and so the dynamic and risks are completely different. AI doesn’t need to intrude on current human interests at all (even though it might), and can still take over the Future (much greater than the current human world), at which point human interests can become irrelevant to that Future, with no takeover in the interim.

Vladimir_Nesov 27 Aug 2025 21:38 UTC
5 points
0
in reply to: aog’s comment on: Slowdown After 2028: Compute, RLVR Uncertainty, MoE Data Wall
The question is total addressable market, and the current tech giants are the anchor for what capturing something mildly useful for everyone in the whole world gets you, which is on the order of $100bn per year. If AIs don’t get much better than they are now, it’s plausible this is where the winning AI companies land for a while (at least they probably don’t get to trillion dollar capex budgets). If AIs do get much better, then the anchor shifts to some portion of the global job market, which is in tens of trillions per year.

For now, a $200bn project every two years to build 5 GW worth of datacenters with the newest compute hardware (for a single AI company) seems plausible in a steady state ($140bn compute hardware, $60bn non-compute infrastructure). At which point it might slowly get to 10 GW in 2030s as the longer-term non-compute infrastructure no longer needs to be built anew every two years together with the compute hardware (once the total power and cooling requirements stop changing too much and it becomes possible to reuse the older datacenter campuses and only refresh the compute hardware).

But even for the 5 GW datacenter campuses, there isn’t currently enough talk about building them in 2028 to be sure they’ll be built on-trend, it sounds like 2 GW in 2028 is more likely and the 5 GW sites (serving a single AI company) will only appear in 2029-2031. Though 1 GW in 2026 also wasn’t clearly foreshadowed back in 2023, and 5 GW in 2028 is both on-trend and in principle financially plausible if growth continues, so maybe it still happens.
What links here?
- Vladimir_Nesov's comment on Vladimir_Nesov’s Shortform by Vladimir_Nesov (1 Sep 2025 20:36 UTC; 58 points)
- Vladimir_Nesov's comment on Vladimir_Nesov’s Shortform by Vladimir_Nesov (7 Sep 2025 23:01 UTC; 6 points)

Vladimir_Nesov 25 Aug 2025 22:03 UTC
3 points
0
in reply to: Said Achmiz’s comment on: Banning Said Achmiz (and broader thoughts on moderation)
That seems hard to judge from anything empirical, you’d need to compare with the counterfactual where there is little difficulty in distinguishing and so good tentative takes don’t need to live in squalor among piles of worthless nonsense (especially well-presented “high effort” worthless nonsense). So I don’t see how it can possibly be clearly false, and similarly I don’t see how it can possibly be clearly true, since it has to rely on low-legibility intuitive takes about unobservable counterfactuals.

Also, the problems from the difficulty to distinguish are both on the side of the authors (in the form of incentives) and on the side of the readers (in the form of low availability of good content of this type, and having to endure the worthless nonsense without even being able to know if it actually is worthless nonsense).

The difficulty to distinguish from worthless nonsense is already too much of a punishment

Empirically, this is clearly false. The track record of LW in the past ~8 years makes this very clear.

Vladimir_Nesov

2025 Frontier Model Sizes

Largest Models of 2025-2026