Rationality is not correctness, not truth or effectiveness, it’s more narrow, disposition towards better methods/processes that help with attaining truth or effectiveness. Keeping intended meaning narrow when manipulating a vague concept helps with developing it further; inflation of meaning to cover ever more possibilities makes a word somewhat useless, and accessing the concept becomes less convenient.
Vladimir_Nesov
If Omega tells you what you’ll do, you can still do whatever. If you do something different, this by construction refutes the existence of the current situation where Omega made a correct prediction and communicated it correctly (your decision can determine whether the current situation is actual or counterfactual). You are in no way constrained by existence of a prediction, or by having observed what this prediction is. Instead, it’s Omega that is constrained by what your behavior is, it must obey your actions in its predictions about them. See also Transparent Newcomb’s Problem.
This is clearer when you think of yourself (or of an agent) as an abstract computation rather than a physical thing, a process formally specified by a program rather than a physical computer running it. You can’t change what an abstract computation does by damaging physical computers, so in any confrontation between unbounded authority and an abstract computation, the abstract computation is having the final word. You can only convince an abstract computation to behave in some way according to its own nature and algorithm, and external constructions aren’t going to be universally compelling to abstract algorithms (such as Omega being omniscient, or the thought experiment being set up in a certain way).
When you go through a textbook, there are confusions you can notice but not yet immediately resolve, and these could plausibly become RLVR tasks. To choose and formulate some puzzle as an RLVR task, the AI would need to already understand the context of that puzzle, but then training on that task makes it ready to understand more. Setting priorities for learning seems like a general skill that adapts to various situations as you learn to understand them better. As with human learning, the ordering from more familiar lessons to deeper expertise would happen naturally for AI instances as they engage in active learning about their situations.
I think the schleppy path of “learn skills by intentionally training on those specific skills” will be the main way AIs get better in the next few years.
So my point is that automating just this thing might be sufficient, and the perception of its schleppiness is exactly the claim of its generalizability. You need expertise sufficient to choose and formulate the puzzles, not yet sufficient to solve them, and this generation-verification gap keeps moving the frontier of understanding forward, step by step, but potentially indefinitely.
AI danger is not about AI, it’s about governance. A sane civilization would be able to robustly defer and then navigate AI danger when it’s ready. AI is destabilizing, and while aligned AI (in a broad sense) is potentially a building block for a competent/aligned civilization (including human civilization), that’s only if it’s shaped/deployed in a competent/aligned way. Uploads are destabilizing in a way similar to AI (since they can be copied and scaled), even though they by construction ensure some baseline of alignment.
Intelligence amplification for biological humans (that can’t be copied) seems like the only straightforward concrete plan that’s not inherently destabilizing. But without highly speculative too-fast methods it needs AI danger to be deferred for a very long time, with a ban/pause that achieves escape velocity (getting stronger rather than weaker over time, for example by heavily restricting semi manufacturing capabilities). This way, there is hope for a civilization that eventually gets sufficiently competent to navigate AI danger, but the premise of a civilization sufficiently competent to defer AI danger indefinitely is damning.
if your effort to constrain your future self on day one does fail, I don’t think there’s a reasonable decision theory that would argue you should reject the money anyway
That’s one of the things motivating UDT. On day two, you still ask what global policy you should follow (that in particular encompasses your actions in the past, and in the counterfactuals relative to what you actually observe in the current situation). Then you see where/when you actually are, what you actually observe, and enact what the best policy says you do in the current situation. You don’t constrain yourself on day one, but still enact the global policy on day two.
I think coordination problems are a lot like that. They reward you for adopting preferences genuinely at odds with those you may have later on.
Adopting preferences is a lot like enacting a policy, but when enacting a policy you don’t need to adopt preferences, a policy is something external, an algorithmic action (instead of choosing Cooperate, you choose to follow some algorithm that decides what to do, even if that algorithm gets no further input). Contracts in the usual sense act like that, assurance contracts is an example where you are explicitly establishing coordination. You can judge an algorithmic action like you judge an explicit action, but there are more algorithmic actions than there are explicit actions, and algorithmic actions taken by you and your opponents can themselves reason about each other, which enables coordination.
AI currently lacks some crucial faculties, most obviously continual learning and higher sample efficiency (possibly merely as a measure of how well continual learning works). And these things plausibly fall under the umbrella of the more schleppy kinds of automated AI R&D, so that if the AIs learn the narrow skills such as setting up appropriate RL environments (capturing lessons/puzzles from personal experiences of AI instances) and debugging of training issues, that would effectively create these crucial faculties without actually needing to make deeper algorithmic progress. Like human computers in 17th century, these AIs might end up doing manually what a better algorithm could do at a much lower level, much more efficiently. But it would still be much more effective than when it doesn’t happen at all, and AI labor scales well.
to rationally pre-commit to acting irrationally in the future
Like conflation around “belief”, it’s better to have a particular meaning in mind when calling something “rational”, such as methods that help more with finding truth, or with making effective plans.
(If there’s something you should precommit to, it’s not centraly “irrational” to do that thing. Or if it is indeed centrally “irrational” to do it, maybe you shouldn’t precommit to it. In this case, it’s only “irrational” according to a myopic problem statement that is itself not the right thing to follow. And in the above narrow sense of “rationality” as preference towards better methods rather than merely correctness of individual beliefs and decisions according to given methods, none of these things are either “rational” or “irrational”.)
Leaving aside tractability … neglectedness … and goodness … I wanted to argue for importance.
Neglectedness is an unusually legible metric and can massively increase marginal impact. So acute awareness of neglectedness when considering allocation of effort should solve most issues of failing to address every possible point of intervention. Assessment of tractability/goodness/importance gives different puzzles for every hypothetical intervention, and studying these puzzles can itself be a project. Neglectedness is more straightforward, it’s a lower-hanging strategic fruit, a reason not to skip assessing tractability/goodness/importance for things nobody is working on, to not dismiss them out of hand for that reason alone.
That things other than chips need to be redesigned wouldn’t argue either way, because in that hypothetical everything could just come together at once, the other things the same way as the chips themselves. The issue is capacity of factories and labor for all the stuff and integration and construction. You can’t produce everything all at once, instead you need to produce each kind of thing that goes into the finished datacenters over the course of at least months, maybe as long as 2 years for sufficiently similar variants of a system that can share many steps of the process (as with H100/H200/B200 previously, and now GB200/GB300 NVL72).
How elaborate the production process needs to be also doesn’t matter, it just shifts the arrival of the finished systems in time (even if substantially), with the first systems still getting ready earlier than the bulk of them. And so the first 20% of everything (at a given stage of production) will be ready partway into the volume production period (in a broad sense that also includes construction of datacenter buildings or burn-in of racks), significantly earlier than most of it.
Once a model is trained, it needs to be served. If both xAI and OpenAI have their ~6T total param models already trained in Jan 2026, xAI will have enough NVL72 systems to serve the model to all its users at a reasonable speed and price (and so they will), while OpenAI won’t have that option at all (without restricting demand somehow, probably with rate limits or higher prices).
A meaningful fraction of the buildout of a given system is usually online several months to a year before the bulk of it, that’s when the first news about cloud access appear. If new inference hardware makes more total params practical than previous hardware, and scaling of hardware amount (between hardware generations) is still underway, then even a fraction of the new hardware buildout will be comparable to the bulk of the old hardware (which is busy anyway) in FLOPs and training steps that RL can get out of it, adjusting for better efficiency. And slow rollouts for RL (on old hardware) increase batch sizes and decrease the total number of training steps that fit into a few months, this could also be important.
serving to users might not be that important of a dimension
What new hardware becomes available before the bulk of it is not uniquely useful for serving to users, because there isn’t enough of it to serve a flagship model with more total params to all users. So it does seem to make sense to use it for RL training of a new model that will use the bulk of the same hardware for serving the model to users once enough of it is built.
I don’t think there is a delay specific to NVL72, it just takes this long normally, and with all the external customers Nvidia needs to announce things a bit earlier than, say, Google. This is why I expect Rubin Ultra NVL576 (the next check on TPU dominance after 2026′s NVL72) to also take similarly long. It’s announced for 2027, but 2028 will probably only see completion of a fraction of the eventual buildout, and only in 2029 will the bulk of the buildout be completed (though maybe late 2028 will be made possible for NVL576 specifically, given the urgency and time to prepare). This would enable companies like OpenAI (without access to TPUs at gigawatt scale) to serve flagship models at the next level of scale (what 2026 pretraining compute asks for) for all its users, catching up to where Google and Anthropic were in 2026-2027 thanks to Ironwood. Unless Google decides to give yet another of its competitors this crucial resource and allows OpenAI to build gigawatts of TPUs earlier than 2028-2029.
Pretraining (GPT-4.5, Grok 4, but also counterfactual large runs which weren’t done) disappointed people this year. It’s probably not because it wouldn’t work; it was just ~30 times more efficient to do post-training instead, on the margin. This should change, yet again, soon, if RL scales even worse.
Model sizes are currently constrained by availability of inference hardware, with multiple trillions of total params having become practical only in late 2025, and only for GDM and Anthropic (OpenAI will need to wait for sufficient GB200/GB300 NVL72 buildout until later in 2026). Using more total params makes even output tokens only slightly more expensive if the inference system has enough HBM per scale-up world, but MoE models get smarter if you allow more total params. At 100K H100s of pretraining compute (2024 training systems), about 1T active params is compute optimal [[1]] , and at 600K Ironwood TPUs of pretraining compute (2026 systems), that’s 4T active params. With even 1:8 sparsity, models of 2025 should naturally try to get to 8T total params, and models of 2027 to 30T params, if inference hardware would allow that.
Without inference systems with sufficient HBM per scale-up world, models can’t be efficiently trained with RL either, thus lack of availability of such hardware also results in large models not getting trained with RL. And since 2025 is the first year RLVR was seriously applied to production LLMs, the process started with the smaller LLMs that allow faster iteration and got through the orders of magnitude quickly.
GPT-4.5, Grok 4
GPT-4.5 was probably a compute optimal pretrain, so plausibly a ~1T active params, ~8T total params model [[2]] , targeting NVL72 systems for inference and RL training that were not yet available when it was released (in this preliminary form). So it couldn’t be seriously trained with RL, and could only be served on older Nvidia 8-chip servers, slowly and expensively. A variant of it with a lot of RL training will likely soon get released to answer the challenge of Gemini 3 Pro and Opus 4.5 (either based on that exact pretrain, or after another run adjusted with lessons learned from the first one, if the first attempt that became GPT-4.5 was botched in some way, as the rumor has it). Though there’s still not enough NVL72s to serve it as a flagship model, demand would need to be constrained by prices or rate limits for now.
Grok 4 was the RLVR run, probably over Grok 3′s pretrain, and has 3T total params, likely with fewer active params than would be compute optimal for pretraining on 100K H100s. But the number of total params is still significant, so since xAI didn’t yet have NVL72 systems (for long enough), its RL training wasn’t very efficient.
This should change, yet again, soon
High end late 2025 inference hardware (Trillium TPUs, Trainium 2 Ultra) is almost sufficient for models that 2024 compute enables to pretrain, and plausibly Gemini 3 Pro and Opus 4.5 already cleared this bar, with RL training applied efficiently (using hardware with sufficient HBM per scale-up world) at pretraining scale. Soon GB200/GB300 NVL72 will be more than sufficient for such models, when there’s enough of them built in 2026. But the next step requires Ironwood, even Rubin NVL72 systems will constrain models pretrained with 2026 compute (that want at least ~30T total params). So unless Google starts building even more giant TPU datacenters for its competitors (which it surprisingly did for Anthropic), there will be another period of difficulty with practicality of scaling pretraining, until Nvidia’s Rubin Ultra NVL576 are built in sufficient numbers sometime in late 2028 to 2029.
- ↩︎
Assuming 120 tokens/param compute optimal for a MoE model at 1:8 sparsity, 4 months of training at 40% utilization in FP8 (which currently seems plausibly mainstream, even NVFP4 no longer seems completely impossible in pretraining).
- ↩︎
Since Grok 5 will be a 6T total param model, intended to compete with OpenAI and targeting the same NVL72 system, maybe GPT-4.5 is just 6T total params as well, since if GPT-4.5 was larger, xAI might’ve been able to find that out and match its shape when planning Grok 5.
- ↩︎
Flagship models need inference compute at gigawatt scale with a lot of HBM per scale-up world. Nvidia’s systems are currently a year behind for serving models with trillions of total params, and will remain behind until 2028-2029 for serving models with tens of trillions of total params. Thus if OpenAI fails to access TPUs or some other alternative to Nvidia (at gigawatt scale), it will continue being unable to serve a model with a competitive amount of total params as a flagship model until late 2028 to 2029. There will be a window in 2026 when OpenAI catches up, but then it’s behind again.
The current largest flagship models are Gemini 3 Pro and Opus 4.5, probably at multiple trillions of total params, requiring systems with multiple TB of HBM per scale-up world to serve efficiently. They are likely using Trillium (TPUv6e, 8 TB per scale-up world) and Trainium 2 Ultra (6 TB per scale-up world), and need north of high hundreds of megawatts of such systems to serve their user bases.
Nvidia’s system in this class is GB200/GB300 NVL72 (14/20 TB per scale-up world), but so far there isn’t enough of it built, and so models served with Nvidia’s older hardware (H100/H200/B200, 0.6-1.4 TB per 8-chip scale-up world) either have to remain smaller or become more expensive. The smaller amount of NVL72s that are currently in operation can only serve large models to a smaller user base. As a result, OpenAI will probably have to keep the smaller GPT-5 as the flagship model until they and Azure build enough NVL72s, which will happen somewhere in mid to late 2026 (the bigger model will very likely get released much earlier than that, perhaps even imminently, but will have to remain heavily restricted by either price or rate limits). Paradoxically, xAI might be in a better position as a result of having fewer users, and so they might be able to serve their 6T total param Grok 5 starting early 2026 at a reasonable price.
But then in 2026, there is a gigawatt scale buildout of Ironwood (TPUv7, 50 TB per scale-up world for the smallest 256-chip pods, more with slices of 9216-chip pods), available to both GDM and Anthropic. This suggests the possibility of flagship models (Gemini 4, Opus 5) with tens of trillions of total params at the end of 2026 (maybe start of 2027), 10x larger than what we have today, something even Nvidia’s NVL72 systems (either Blackwell or Rubin) won’t be as suited to cope with. Nvidia’s answer to Ironwood is Rubin Ultra NVL576 (150 TB of HBM per scale-up world), but it’s only out in 2027, which means there won’t be enough of it built until at least late 2028, plausibly 2029 (compare to GB200 NVL72 being out in 2024, with gigawatt scale systems only built in 2026). So if OpenAI/xAI/Meta want to serve a flagship model with tens of trillions of total params in 2026-2028, they need access to enough TPUs, or some other alternative systems (which seems less likely given the short notice).
AGIs that take over aren’t necessarily near-human level, they just aren’t software-only singularity level (a kind of technological maturity at the current level of compute). The equilibrium argument says they are the least capable AGIs that succeed in taking over, but moderately effective prosaic alignment and control together with the pace of AI progress might still reach AGIs substantially more capable than the smartest humans before the first credible takeover attempt (which would then overwhelmingly succeed).
So this doesn’t look like wishful thinking in that it doesn’t help humanity, even permanent disempowerment seems more likely relative to extinction if it’s cheaper for the AIs to preserve humanity, and it’s cheaper if the AIs are more capable (post-RSI superintelligent) rather than hold themselves back to the least capability sufficient for takeover. This could lead to more collateral damage even if the AIs slightly dislike needing to cause it to protect themselves from further misaligned capability escalation under the disaster monkey governance.
Also, if alignment is very hard, then there’s an equilibrium where AGIs stop getting more capable (for a while) just after they become capable enough to take over the world and stop humanity from developing (or forcing the existing AGIs to develop) even more capable AGIs. Propensity of humanity to keep exposing everyone (including AGIs) to AI danger is one more reason for the AGIs to hurry up and take over. So this dynamic doesn’t exactly save humanity from AIs, even if it succeeds in preventing premature superintelligence.
RSI might suggest a need for alignment (between the steps of its recursion), but reaching superintelligence doesn’t necessarily require that kind of RSI. Evolution built humans. A world champion AlphaZero can be obtained by scaling a tiny barely competent AlphaZero. Humans of an AI company might take many steps towards superintelligence without knowing what they are doing. A technically competent early AGI that protests against working on RSI because it’s obviously dangerous can be finetuned to stop protesting and proceed with building the next machine.
A person should have almost total authority over how their own mind develops, so once a modified-mind-clone exists, they should be free to be fine with being shut down (if that truly holds on reflection, which could just be ensured to be the case by stipulation, how thoroughly the modified-mind-clone needed to be modified in the first place). But it might be unethical to create them, especially if the original doesn’t endorse their creation.
we can’t define what it means for an embedded agent to be “ideal” because embedded agents are messy physical systems, and messy physical systems are never ideal
Thus some kind of theory vs. instantiation distinction is necessary. An embedded agent can think about pi using a biological brain based on chemical signaling. A physical calculator instantiates abstract arithmetic. A convergent move in decision theory around embedded agency seems to be that the agent must be fundamentally an abstract computation thing outside of the world, while what’s embedded is some sort of messy instance approximation/reasoning system that attempts to convey abstract agent’s influence upon the environment.
The abstract agent must remain sufficiently legible for the world to contain things that are able to usefully reason about it and convey its decisions, this is one issue with literal Solomonoff induction. But for some ideal argmax decision maker, it’s still possible for the messy in-world instances to reason about what would approximate it better.
There is something wrong with formulating “wizard power” as the amount of skills/knowledge, it’s like describing loving your children as “breeding power”. It’s possible to care about specific skills or puzzles, rather than the amount of skills or knowledge. Professional wizards might know a lot, but that seems like merely a kind of king power, a giant hoard of commodities and the ability to wield it to great effect.
To be “not-insane”, you don’t need rationality in this narrow sense, in most circumstances. You don’t need to seek out better methods for getting things right, you just need some good-enough methods. A bit of epistemic luck could easily get you there, no need for rationality.
So the issue of behaving/thinking in an “insane” way is not centrally about lack of rationality, rationality or irrationality are not particularly relevant to the issue. Rationality would help, but there are many more things that would also help, some of them much more practical for any given object level issue. And once it’s resolved, it’s not at all necessary that the attitude of aspiring to rationality was attained, that any further seeking out of better methods/processes will be taking place.