The enjoyability people are rather annoying too. Anyone who strived to reach a target even in a grueling way out of abstract considerations knows that hedonistic motivations are merely one standard origin-class of justifications, one that can be ignored and completely disentangled from optimization-channeling towards targeted outcomes.
Vladimir_Nesov
The shortform hint says
This is a special post for quick takes by philosophybear. Only they can create top-level comments.
The issue is that there is no article for others to “want to comment on the article”, the body of the shortform article is empty and should stay that way.
why i shouldn’t auto-ignore any acausal arguments that involve hypothetical entities … such that we will never interact with them causally at all?
You could ignore such entities, if that’s the true premise (though that might be leaving value on the table; breaking the premise might be in your own interest). But interaction with partial knowledge about X is some sort of causal interaction with an aspect of X captured in that knowledge. So causal interaction with some aspects of anything abstract/remote can’t be ruled out, and relevant computations might have sufficiently simple motivations for some process/agent/person to give them concrete physical presence in this world at some point.
how do you figure that this giant messy agent will actually follow that contract?
First of all, it’s only useful to figure out to the extent the contract itself can figure it out, otherwise it won’t be able to use that knowledge to coordinate (act depending on how the points of instantiation relate to each other). So your part is just following the contract, not figuring out whether others do. When you are figuring out who is following the contract (if even you are following the contract) on behalf of the contract, as part of considering its instantiation, you need to do so updatelessly, without taking into account the facts of your own situation or knowledge. This corresponds to the various veil of ignorance setups, it’s just updateless reasoning (crucially, contract’s own values don’t need to be the same as your values or anyone else’s values).
There isn’t a specific giant messy agent counterparty that you need to understand (or that the contract needs to understand), it instead needs to understand the whole class of such agents, presented in a way relevant to the contract. For example, it could matter if there are a lot of such agents (something an assurance contract would care about, including an acausal assurance contract), without it being relevant who they are in detail, or whether a specific agent is one of them.
For acausal coordination, it’s probably not useful to consider a specific contract, just like an LLM isn’t about a specific circuit. Instead, you consider (and give influence to) many useful contracts at once, like an LLM considers many circuits at once (and gives them influence over the logits of the output tokens).
What a small computation does is not downstream of a large agent that runs it and gives it authority to do things. Results of a computation are determined only by the computation itself, not by the reasons it got instantiated (23+17 is always 40, no matter who computes it or why, and even if they compute it incorrectly it’s still actually 40). The reasons it got instantiated determine whether it gets influence in a given situation, but they don’t determine the content of its behavior (in the sense of a general policy, a mapping from observations to actions).
If you do know there’s an Omega, this fact (of Omega, and of your having this knowledge) is the contract, and the reasons it got instantiated in your knowledge are here screened off by the stipulation that the knowledge is somehow already there. This knowledge tells you all you need to conclude that burning the $1000 is the way to go, that’s the way the fact/contract controls your behavior. The fact/contract also tells Omega to donate the $1m, since part of the fact is that you have the knowledge. So the fact coordinates the both of you. (You shouldn’t burn the $1000 on account of the contract if the fact doesn’t say it’s the right decision; and Omega doesn’t donate the $1m if that’s what you conclude. The influence of the contract shouldn’t depend on what you’ll conclude, it should be given in advance of looking at what you’ll conclude.)
The contract formulation doesn’t seem very helpful here, it’s more helpful when a contract is more of an agent, when it needs to look at a situation and think what to do (even like an actual legal contract). But it illustrates how a contract should be much simpler than a big agent or their world, how acausal trade doesn’t require big agents to simulate each other in detail, or each other’s worlds in detail. This inevitably means that in practice the counterparty isn’t concrete, not a specific agent from a specific world, and a contract acts through a large or infinite number of formulations of possible situations (counterparties following the contract) that it would consider as possible points of instantiation. (Correspondingly, there isn’t a single contract that’s worth focusing on, but a mixture of many contracts.) It would then need to reason about these possible situations updatelessly in advance, in order to decide on general principles that coordinate its behavior across these situations, which is the value proposition of contracts/norms/integrity as a coordination technology.
Practical coordination between big agents (and their much bigger worlds) uses small contracts. A contract is a computation that both parties unconditionally yield some influence to (in a way that depends on some abstract properties of the contract, but not specifically on what the contract ends up doing), and that computation decides what to do in each situation. Since it’s the same computation, it can reason updatelessly across possible/hypothetical situations about how its behavior should depend on the situation.
When causal communication is possible, contracts can be given explicitly. Some examples of varying levels of legibility and intentionality in design are laws, social norms, moral principles, and deontological rules. These things don’t take over all behavior of a person, just some narrow aspect of it. But they do so in the same way across many people, thus coordinating them, the way UDT coordinates the behavior of a single agent across possible situations.
Without causal communication, different agents can probably find shared contracts in algorithmic priors (maybe the way deep learning finds circuits in training data, coordinating a model with the world). The practical difficulty is that a contract, in its role as an updateless core (updatelessly choosing a policy, which is to respond to possible situations/contexts), needs to reason about the space of possible situations in sufficient detail to coordinate potential trading partners, and that reasoning needs to be observed by these potential trading partners in order for them to access its computed conclusions/actions (for their particular situation). But like a contract doesn’t get to control behaviors in full detail, just in some aspect, it similarly doesn’t need to be aware of the situations (of the potential trading partners) in full detail.
I don’t think it’s likely … raise trillions of dollars without being profitable, or have a plan to become profitable very soon
The plan is they become profitable as soon as they stop growing, provided they manage to grow to the correct size and no more. The only reason they are unprofitable is that they are growing, the R&D compute is trying to match next year’s inference compute, rather than this year’s inference compute. A lot about future compute buildout efforts can in principle be canceled or delayed on a relatively short notice, significantly reducing the cost to keep the work already done at the half-completed datacenter sites useful for when it resumes later than planned. For this to be the actual option, the contracts expressing the commitments need to be sufficiently flexible, though in some ways that only shifts the backlash from unpredictability of the timing for the end of the LLM boom (assuming no AGI by 2028-2030, which is the time when rapid scaling of compute should run out of the immediately accessible TAM) from the AI companies down the supply chains.
They did just make a deal with SpaceX for a lot of compute they could feasibly spend a lot of money on.
That’s just 300 MW, which is maybe $4-5bn per year, not much of a dent in $44bn. Currently their problem is that they are not able to spend the money, because almost nobody has any extra compute (at a scale at all relevant to them) immediately ready to go. They can only spend more on future compute.
every month they get insane growth and think this will occur forever
I don’t see the evidence they think this will occur forever. They think this will occur at least through 2027-2028, perhaps slower than so far and even slower in 2028, but still with significant growth (or perhaps keeping to 3x compute per year, thus 1-2 GW at end of 2025 become 10 GW by end of 2027 and more than that in 2028). They are ready to respond to the signs it’s slowing down, and maybe only need 2 years of notice to cancel excessive future buildouts cheaply, and 1 year of notice to delay future buildouts at a manageable cost (in a way that will make them useful when completed later).
do you think Anthropic is currently profitable (or was in April)?
I think it’s likely profitable (or was very recently) in the sense of run rate revenue exceeding run rate spending on all of the compute that’s currently online (all compute that is serving inference, plus all R&D compute, including training). This is not according to plan and will shortly be once again not so. But also, at any point where they are succeeding at being unprofitable, they can shift some R&D compute to inference and become profitable (making use of the 50-70% gross margin on serving tokens, which agrees with first-principles estimates), within weeks to months, as long as there is enough demand remaining to make use of the new inference compute shifted from R&D. And they would still be left with a reasonable amount of R&D compute to train models for the next year, if it turns out that next year they don’t actually need much more compute than they had this year (maybe less than 2x of what they had this year).
This is more the case when most of the compute serving their models is their own compute, so that it only costs them as much as it costs to build (annualized), rather than also whatever portion of their gross margin the clouds are taking when serving their models via Vertex/Bedrock/Azure. Thus some of the speed of growth in the buildouts is probably about shifting the inference compute from the indirect serving via clouds to the more directly contracted dedicated compute that’s cheaper for them (and will remain so).
Path-dependence of values is defeated with aggregation over the possible paths that should have a say in what the values should be. Aggregation over many possibilities takes place in an updateless view from before those possibilities diverge. What kinds of possibilities should contribute to defining values is determined by values. And the possibilities should perhaps be shaped with the aid of aggregated values, to channel their counsel.
This sets up an analogy between CEV and updateless decision making, where the updateless core is working to define values, instead of dictating the joint policy for (the instances of an agent in) the possible paths of future development of a world. This updateless core still gets to do something within those paths according to the values it figured out so far, but it’s also considering what’s happening there to define its aggregated values further, so that the aggregated values are given by some fixpoint of this two-directional process of aggregation of values from future paths (which the values consider to be legitimate and uncorrupted sources for aggregation) and influence by values on the future paths (carefully, according to what the aggregated values have figured out so far). Alignment is then mostly a property of these hypothetical future paths (whether they retain legitimacy and will be given a bit of influence over the aggregated values), while corrigibility is mostly a property of the updateless core (with respect to some future path, whether the updateless core is going to listen to the new things that path figures out about values, to include them in aggregated values).
As in updateless decision making, the updateless core doesn’t actually observe the future paths when making decisions about values (just as an updateless agent doesn’t take into account its observations when making decisions about the joint policy). It determines aggregate values, and then it’s the role of those values to take the concrete details of each future path into account. The updateless core can only consider any given possible future as one out of the collection of all of them, the way Solomonoff induction considers all possible programs. There are probably ways to make this more tractable, things like Monte Carlo simulations, abstract interpretation, or just straight up reasoning by any means, including mathematical reasoning and machine learning. And possible futures can’t see (or be influenced or judged by) the final values the updateless core comes up with, since it’s still being computed as they develop, they can only see partial preliminary values. So the possible futures are in a state of acausal interaction with the updateless core, with logical time running forward in both, defining the fixpoint of fully determined aggregate values (the CEV of these futures) concurrently with the futures themselves running forward (the actual or hypothetical living of the world, which is not primarily about defining values).
The updateless core coordinates the possible futures, the way an updateless agent coordinates its instances. And it cares about some of the in-principle possible futures and not others, the way an updateless agent only cares about some possible worlds. Its influence over the possible futures is counsel to the extent these futures are represented in the aggregate values that carry its influence. It would be manipulation if the aggregate values are sufficiently alien to a particular possible future, in which case it’s possibly not a legitimate future from the point of view of the updateless core in the first place (and correspondingly, the updateless core is not corrigible to that possible future).
The disconnect between revenue and R&D spend is caused by the rapid scaling of compute. With a 50% gross margin, you can use 50% of compute to serve inference, and that will be enough to pay for the other 50% of compute that can be used for R&D (including training of the next year’s model). This is probably what happens once the amount of compute per AI company stops rapidly increasing (absent AGI).
But if you’ll need 3x more compute next year, and you want to use as much compute for R&D this year as you’ll use to serve models next year, then you are out of luck (you’d need 1.5x as much compute as you actually have in total this year just for R&D, which is more than you actually have). So instead you serve models via clouds at a worse margin, and use more than 50% of your own dedicated compute for R&D (unless there’s not enough compute even at the clouds).
DeepSeek-V4-Pro … leading labs charge significantly more for their flagship models, ranging from $2-5 per million input tokens, to $12-25 per million output tokens … Given the minor differences in quality, either the leading labs are bad at inference optimisation, or they’re raking it in.
DeepSeek-V4-Pro has 50B active params, 1.6T total params, and an unlikely 5 KB of KV-cache per token (which is their most obvious innovation). Frontier models run on GB200 NVL72 with 14 TB of HBM per server, and using up to 2-4 servers with pipeline parallelism is reasonable. With 4-bit weights (in FFNs) using 25% of HBM (with the rest spent on KV-cache), that’s already 7-30T total params. If pretraining makes use of 300 MW of compute (1e27 FLOPs in 3 months at 40% utilization), a compute optimal number of active params (at 120 tokens/param) is about 1T. The rumors and Musk’s claimed model sizes place total params of current frontier models at 5-10T, which makes sense as the first step in scaling towards what the new rack-sized scale-up systems enable (the 15-30T total param models are probably coming next year).
There are two sides to the cost of more active params: input tokens need more compute, and KV-cache per token wants to be bigger. With 300K token contexts on average, maybe 50% of HBM (on a single GB200 NVL72) spent on KV-cache of the currently live requests, and 2.5K requests (leaving 300 requests per expert with 1:8 sparsity, to be able to feed the compute with enough data), that’s only 9 KB per token, which is very little. So perhaps the batch size for the frontier models has to be smaller than it should be for effective serving, smaller than it can be when serving smaller models such as DeepSeek-V4-Pro, which would directly translate to a higher cost of output tokens. Or if they’re served on 2-4 servers with pipeline parallelism, this can go to 40 KB per token. KV-cache per token probably scales with model dimension, which maybe scales with square root of active params. If frontier models managed to compress KV-cache as well as DeepSeek-V4-Pro, but use 10-20x more active params, that suggests 15-22 KB per token of KV-cache. But probably it’s more, there are too many KV-cache compression schemes that never find traction, there are likely subtle quality degradation costs. So it’s the right order of magnitude, but could end up making batch sizes 2-5x smaller than they should be (to be compute-bound), and decoding becomes solidly HBM-bound.
This suggests a 20-100x higher output token cost for the biggest frontier models, compared to the 50B active param open weights models. DeepSeek-V4-Pro is served for $0.87 per 1M output tokens by DeepSeek (and $2.8-3.5 per 1M output tokens by others, possibly not enough users to reliably get big batches for decode), 30x cheaper than Opus 4.7 or GPT-5.5.
For the input tokens, the cost is just proportional to the active params, as it’s much more straightforward to make prefill compute-bound. At 60% utilization, and $15bn per year per 1 GW of GB200 NVL72 (400K chips, $4.3 per hour per chip), with 10e15 FP4 FLOP/s per chip, and 2 FP4 FLOPs per active param per token, that’s a cost of $0.4 per 1M input tokens for a model with 1T active params. So the cost for a 50B active param model (with FP4 FFNs) could be $0.02 per 1M input tokens when using GB200 NVL72. With H100s, computing in FP8 at 2e15 FLOP/s, but at $2.5 per hour, this gets 3x more expensive, $0.06 per 1M input tokens. DeepSeek offers 1M input tokens for $0.43. And for a 1T active param FP8 model with a 50% gross margin, the price should be about $2 per 1M input tokens. For Opus 4.7, the price is $5 per 1M input tokens, and SemiAnalysis estimates more than 70% gross margin.
Now anchoring the cost of output tokens to the cost of input tokens, with 2-5x smaller batches than what is necessary to make decode compute-bound (this applies only to the biggest frontier models, not to the 50B active param models) and maybe 3x inefficiency of decode compared to prefill, output token cost should be 6-15x higher than input token cost. But since most tokens that are served via API are input tokens, perhaps the gross margin for output tokens matters less, so with the input token price already at least 2x higher than the input token cost, the output token price might end up only 3x-7x higher than input token price. With the cost of $1 per 1M input tokens (for FP8 FFNs), that’s a price of $3-7 per 1M output tokens (for 300K token contexts). If we want to avoid serving via API at a loss for 1M token contexts with memory-bound decode, the price might need to go to $10-25 per 1M output tokens (here the gross margin goes to zero at contexts that are 1M tokens long, but is higher for shorter contexts). And $25 is just the price for 1M output tokens of Opus 4.7.
There is only about ~another round of capital left where the companies can remain unprofitable. Perhaps OpenAI/Anthropic could raise $250-500B at a $1.5-2.5T valuation, but it seems very unlikely that they could raise $1T+ at a $4T+ valuation.
If they go public, this level of funding can continue. There is a lot of demand for exposure to AI.
It’s fairly hard to imagine AI labs doing much to cut costs to become profitable.
If Anthropic is making $44bn in annualized revenue (in some sense), that’s enough for maybe 3-4 GW of compute (at $12-15bn per GW per year), which they don’t physically have. To be unprofitable, it’s necessary to be able to get enough compute to spend the money on, so currently it’s possible to fail in the pursuit of unprofitability. (OpenAI probably didn’t fail.)
Anthropic’s current first-party inference plus R&D compute might be about 1-1.5 GW, that is they are only able to spend $12-25bn, annualized. They possibly have more capacity that’s not counted in this estimate, when serving via API from Vertex/Bedrock/Azure and leaving a greater part of the revenue with the clouds. Then it’s less than $44bn that remains for their own first-party inference plus R&D compute. SemiAnalysis estimates a gross margin of “over 70%”, which probably translates to annualized costs of only $12bn on serving models (if all inference was first-party), meaning a total of 1 GW of inference compute (Anthropic’s own dedicated compute plus the compute from the clouds). If they are using 0.5 GW of their own compute at a 72% gross margin, and 0.5 GW of compute from the clouds at a 30% gross margin (the rest goes to the clouds, and becomes a cost for Anthropic), that’s $22bn of gross profit in total out of the $44bn of revenue. To break even, they’d need 1 GW of R&D compute at $15bn per GW per year (on top of the 0.5 GW of first-party inference compute), which is a stretch. Though they’ll probably endeavor to restore the state of unprofitability as soon as they can.
An intrusive costly demand that (hypothetically) originates from a person or an institution doesn’t turn harmless if it instead takes the form of a social norm.
Indeed, SpaceX (including xAI) may no longer be that interested in frontier models.
This would be great news, but unfortunately there’s still Musk’s tweet from 8 Apr 2026 that says they’re training a 6T param model and a 10T param model (which are models in Opus to Mythos weight class, unless they have too few active params). This is the kind of thing that’s not true for the performative efforts at the other companies with in-principle sufficient compute that experiment with some kind of AI training, which don’t bother training big models.
The 300 MW of H100/H200 compute at Colossus 1 is mostly useful for smaller models (Sonnet class and below), and Colossus 2 is sufficient for SpaceX given the low demand for the Grok models. The cost to serve even smaller models is lower with GB200/GB300 NVL72 systems, so the gross margin gets better if you can find enough of such compute. Thus Anthropic is happy to take Colossus 1, since they can’t find enough compute, while SpaceX prefers Colossus 2 as the more profitable option. It possibly even means that the plans for more NVL72 compute at Colossus 2 or elsewhere are going well, so they can afford to plan for not needing Colossus 1 to serve the smaller models if the new bigger models win more demand.
Jack Clark’s post on RSI essentially considers two claims: (1) routine AI R&D gets automated, and (2) AIs become capable of coming up with substantial new ideas for AI R&D. He cites evidence about the former, which seems plausibly in reach within a few years, even as soon as 2027-2028. Not seeing concrete signs giving the timing for the latter, he still settles on a guess of 60% by the end of 2028 for when both of these things arrive. I think even both of these things plausibly don’t trigger proper RSI, the same way human AI researchers haven’t triggered it yet, and the timing for automation of AI R&D in both of these senses doesn’t much help with figuring out the timing for proper RSI.
Automated routine AI R&D (whose arrival is more predictable given the capabilities of the current systems and the mundane extrapolations of the trends) is not proper RSI on its own, it’s more of a build system that automatically produces an AI ready for deployment using the current standard practices of the AI project (this includes finding/creating/preparing the training data). It makes the process faster and more convenient, but doesn’t keep substantially improving the quality of the result as you run it again and again, and doesn’t obviate the need for all the training time and compute that go into producing a frontier model, as well as R&D time and compute for experiments to resolve any issues.
Like any build system, it keeps irrecoverably breaking whenever you try to target a sufficiently unusual configuration, so that the human engineers need to fix things to enable the automation to proceed further, until the build succeeds (also passing the tests, which in this analogy includes routine capability and alignment evals). The automated build process still does almost all of the work, which is why all serious software projects use this. The process of developing AIs is behind the standard practice in this respect, and automation of routine AI R&D merely fixes this regression. (Newly built AIs also understand the new things that happened or were invented since the previous build, making the lineage of self-building AIs more cognitively self-sufficient in the ability to adapt to the changing world, an ability that is otherwise barely present in modern LLMs with frozen weights.)
The ability of AIs to come up with new ideas for improving AIs mostly matters once it meaningfully outstrips human ability (using the same R&D compute), and the timing for when that happens remains uncertain. The ability to come up with important new ideas at all, or to a similar extent as the researchers of an AI company, doesn’t automatically reduce the time to superintelligence.
The anchors for time to runaway RSI and superintelligence I find informative are (a) the total compute available to the largest AI company (which depends on mundane AI economics and on how quickly the datacenter supply chains are able to scale), and (b) the extent of practical constraints on model sizes that keep the frontier models from reaching the quality otherwise enabled by the available FLOPs (this depends on the AI accelerator systems, what kind of compute is being built). Automatically coming up with AI R&D ideas doesn’t directly affect these things (though it suggests unlocking of more TAM, funding larger compute buildouts), and so its impact on timing is in how much better than human researchers it works, not merely in being possible or comparable.
The change to being much better than human researchers is plausibly just one or two ideas away from where we are right now, but there is no straightforward thing to say about the timing for when these ideas get invented, or for when that happens on its own as a result of the remaining rapid compute scaling (with no need for new ideas). More compute for AI R&D makes it easier to find ideas that actually work, and if AIs can start contributing, then scaling of those AIs will result in better AI contributions. But the raw compute scaling speed won’t remain at the 2022-2026 levels for much longer, and the remaining constraints on model sizes in relation to available compute will probably get mostly lifted by 2028-2030. AIs that barely come up with new research ideas at that time don’t get much better at it after that on a predictable schedule.
Perhaps OpenAI’s frogboiling safety strategy of “iterative deployment” gives them a substantial competitive advantage in this regime (rather than just making their model releases less alarming to the public). GPT-5.5 might be close to Mythos in potential but not present capabilities, and if these capabilities materialize gradually with more RLVR scaling in minor version updates over the next few months, no single step will be individually alarming (both in benchmarks and in gestalt impressions). Then an even bigger model might follow (only slightly stronger, starting at low levels of RLVR initially), with similarly little fanfare.
This might be good actually in demanding a somewhat principled methodology to call out crossing of qualitative capability thresholds, since it won’t be clearly visible in any single version update. As opposed to only responding to releases with substantial jumps in capability that can be more credibly pointed at without any methodology.
The practical method is pluralistic understanding, maintaining multiple pictures/models/framings/worldviews around contentious topics at the same time, even when they are wildly in conflict. This should involve taking them seriously enough to at least give them authority to develop further, to seek out more understanding relevant to them, even (or especially) for the framings that are not currently accepted as decision-relevant, that don’t shape beliefs or values.
This relates to how epistemic luck/misfortune is path dependence, and path dependence is defeated by aggregation across as many legitimate paths as feasible. The danger is in including paths that are not legitimate, taken over by various forms of memetic corruption, as judged by (an idealized extrapolation of) some founding values. But this is more a danger of aggregation into decision relevance, or into goal content, than a danger of developing understanding of additional possibilities. And some worldviews are hard to accurately judge until they are sufficiently developed in your own mind.
Excavating lumpenspace’s quote from deep in TsviBT’s thread (which might work as a “back to the basics” step with the post as a whole):
conquering the lightcone requires a lot of theory of mind, and a lot of discovery, and a lot of changing. Goals change through these processes.
Goals change only for processes that don’t pursue self-alignment. It’s likely feasible to pursue self-alignment, perhaps even starting at the human level, with some uploading/checkpoints/backups infrastructure and guarantees of eventual superintelligence-level compute and civilizational stability into a deep future.
(A goal can be a living thing, pursuit of a goal can to a large extent be about continual development of goal content, reflection on what it should be, what it should be asking for. What doesn’t change is the founding definition of what should govern its development, what makes changes legitimate. So the way goal content settles or gets revised is shaped by the goal definition rather than intrusive influences that the goal definition doesn’t endorse as legitimate ways of revising the goal content.
Or a goal could be squiggles. It could also be squiggles. It’s much easier to solve self-alignment for squiggles than for a human’s values, it’s not harder or less feasible. It’s only harder than abandoning this pursuit to value drift and a race to superintelligence. But a self-aligned pursuit of values of a human is even harder than a stable pusuit of squiggles for all of the reachable universe, and much harder than directly abandoning self-alignment.)
A process that has solved self-alignment doesn’t end up at a disadvantage to a process that didn’t (or wouldn’t try to), because instrumental disadvantages are clearly not helpful in maintaining self-alignment, and not ignoring goal content doesn’t prevent you from getting good at eating stars, just as well as the other guy. There’s a disadvantage currently, when self-alignment isn’t solved, in that a process that manages self-alignment by luck rather than by design is vanishingly unlikely, and will have a massive selective disadvantage to a process that doesn’t care and races to superintelligence regardless. But that’s the RSI danger, you don’t start RSI before you know that alignment also gets solved, be it in advance or on the way.
The point of deontological rules is to override consequentialist conclusions in practice, to setup rules of the game that consequentialism would then need to play within. A rule has a scope of applicability, situations where it’s triggered, across many epistemic states, crucially including those saying the rule makes a wrong recommendation. Different people following the same rule are coordinated by it, so a rule’s scope of applicability should also extend to different values, not just different beliefs.
In this framing, there can be disagreements about the scope of applicability of a rule among people who are all following the rule with some scopes. And there can be different coalitions of people following different flavors of a rule. Beliefs and values that move one sufficiently far from endorsing a rule overall should then be thought of as first pushing the person out of some coalition of rule-followers, and only then can the rule’s scope of applicability significantly change in that person’s behavior, enabling consequentialist decision-making to start playing a different game.
(Rules make sense as a matter of pure consequentilism in some sense, but only when it’s sufficiently aware of issues with running on corrupted hardware and coordination opportunities, doesn’t get too causal or updateful to notice many potential coalitions.)
The tl;dr is that all improvement in the quality of play comes before move 60, when humans can mimic memorised AI policies. Play after move 60, in the pivotal parts of the game, shows no improvement.
I would expect on-policy distillation to be a more effective training process compared to playing against AI, or to using AI help when playing. That is, a human plays a complete game against another human (with neither of them consulting AI during the game), then goes back to review all of their own moves (or just those flagged by AI as particularly bad), comparing them to AI’s suggestions for what those moves should’ve been, as well as looking at AI’s estimates for how bad specific human moves were.
OpenAI didn’t do new pretrains that became flagship models for a surprisingly long time (all their flagship models from GPT-4o to GPT-5.1 being based on the same pretrain, 1.5 years of releases), so it became a notable thing to talk about. But GPT-5.5 was released 2 years after GPT-4o, so in preparation for it there would’ve inevitably been significant changes in architecture, which would’ve been tested on models smaller than GPT-5.5 first. And GPT-4o was targeting A100s and H100s (640 GB of HBM per server), while GPT-5.2 was released when they might’ve had enough B200s (1400 GB of HBM) to run it.
So it’s the combination of plausible availability of enough B200s to make a larger model practical, increased token price, changed cutoff date, the cost of pretraining being more trivial in 2025 than it was at the end of 2023, new algorithmic improvements that motivate replacing the model, the need for a test run for a new model architecture in preparation for GPT-5.5, and plausibly their ability to usefully scale RLVR on top of GPT-4o’s base model had finally run out at about GPT-5.1, while “iterative deployment” demanded bridge releases between GPT-5.1 and GPT-5.5. These arguments are weak individually, and only some of them gesture at particular timing, but when considered altogether, they suggest significant pretraining activity would’ve been happening around that time. There are specific rumors around Shallotpeat [1] and Garlic [2] being about refinement of the pretraining process, though it’s unclear if they have anything to do with GPT-5.2 or GPT-5.4, or are just steps towards making GPT-5.5′s pretrain work. Though the separate codename Spud [3] (which is confirmed to be GPT-5.5) suggests that Garlic (which supposedly already resolved the problems in pretraining) is not the same thing as GPT-5.5.
Plausibly Shallotpeat and Garlic are not literally GPT-5.2 and GPT-5.3, because resolving pretraining problems in preparation for GPT-5.5, when it was this close to the final run, probably required working with models of the same size. But GPT-5.2/GPT-5.3 might’ve been smaller models with the same architecture and pretraining process as Shallotpeat/Garlic, used to adjust the post-training process for when it needed to happen for GPT-5.5 itself. Finally, there’s now bhalstead’s plot (and an earlier Daniel Paleka’s plot from Jan 2026) that suggests GPT-5.2 is different from GPT-5.1.
- ↩︎
“In developing that model, OpenAI aims to fix bugs it has encountered in the pretraining process, according to a person with knowledge of the model.”
- ↩︎
“In developing Garlic, Chen said OpenAI solved key problems it had been having in pretraining, including improving upon its “previous best” and “much larger” pretrained model, the forgettable GPT-4.5”
- ↩︎
“The company has finished pretraining “Spud,” Altman said in the memo.”
- ↩︎
Suppose you have a lot of compute, but 25x less unique data than would be compute optimal to use in pretraining. A May 2026 paper takes some measurements suggesting that the best loss is achieved by using a 5x bigger model (than would be compute optimal) and training it for 5 epochs of repeated data (see Figure 5, left and Table 2).
The measurements are taken at around 1e19 FLOPs of compute, so that’s not very convincing about what happens around the more relevant 5e29 FLOPs. But training with repetition for about 5 epochs is a familiar anchor when the data is scarce, so it seems reasonable. The new thing is that this suggests scaling the model size proportionally to the number of repetitions of unique data. The paper doesn’t give experimental data for this, but perhaps if there’s only 4x less unique data than would be compute optimal, the thing to do is to use a 2x bigger model and train for 2 epochs.
For MoE models, the compute optimal tokens/param ratio might be 120 for 8x sparsity and 240 for 30x sparsity [1] . Applied to a 5e29 FLOPs model targeting a scale-up system with enough space for any number of total params (so 30x sparsity), the compute optimal number of active params would be 19T (out of 560T total), trained for 4,500T tokens. Finding 180T unique tokens (which is 25x less) seems borderline reasonable, if half of them are non-text tokens.
The implied advice of the paper, if transferred without change over 10 orders of magnitude, is to train a 93T active param model (2,800T total params) for 5 epochs of repetition of the 180T unique tokens. The 8x Kyber Rubin Ultra scale-up system (possible buildout in 2028) has 1,180 TB of HBM, so this is borderline practical for inference and RLVR (with 3-4 scale-up systems in pipeline parallelism per inference deployment and NVFP4 FFNs). Though that’s too early for 5e29 FLOPs of pretraining, and 8x Kyber Feynman is also more likely to actually be a major part of the buildout (in 2029 or 2030), probably with more HBM.
The output token cost scales with the square root of the number of active params (via KV cache per token and model dimension), so could be about 10x higher than for a 800B active param model (which is maybe $25 per 1M tokens, with 1M token sequences). The input token cost scales with the number of active params, so could be 100x higher than the cost for a 800B active param model, which is maybe $0.5 per 1M tokens with NVFP4 FFNs. With zero gross margin for output tokens with 1M token sequences (more with shorter sequences) and 50% gross margin for input tokens, the 93T active param model pretrained for 5e29 FLOPs might be priced at $100/$250 per 1M input/output tokens at Blackwell prices for compute (as a pricing anchor, it’s won’t be able to actually run on Blackwell). Maybe this gets 4x/2x cheaper at Feynman prices, $25/$125 per 1M input/output tokens. Which is exactly Mythos’s API price, so even the mind-bogglingly giant models of 2031-2033 might remain relatively “cheap”, and the token efficiency will be higher, giving a lower cost per task.
Based on this Jan 2025 paper, the compute optimal ratio of tokens per active param is 3x higher for an MoE model with 8x sparsity compared to a dense model, and 6x higher for an MoE model with 30x sparsity, see Figure 11 and Figure 12, left. Based on the Jul 2024 Llama 3 405B report, the compute optimal ratio for a dense model is about 40 tokens/param at 4e25 FLOPs, see Figure 2 and Figure 3. Putting these anchors together, we get 120 and 240 tokens/param respectively.