You are discussing who gets to create the obligations, the shape of motivations for those who might set them up for others. But my point is about how people would react to (being at risk of) having an obligation directed at them, even if it’s done by the most deserving and socially approved person to do so. It’s not obvious that they respond by becoming more likely to carry it out, or by becoming more avoidant of the situations where they could be singled out as a target of such obligations, however well-justified and prosocial.
Vladimir_Nesov
I still think on the margin more concrete stories would be substantially beneficial (especially quite different stories to AI 2027), so it’s good that people are being challenged to write them.
Not clear if challenging people (with obligation-valence) leads to more stories or less general engangement. The consequentialist justification isn’t obviously valid.
Ignoring criticism should be a respectable stance, since it counteracts motivation for suppressing criticism. But bad criticism should be discouraged or made less visible, in the same way as any other low-value contributions/comments.
So it seems fine to not particularly care about things said to you, any more than if it was said to someone else (or about someone else’s work), while at the same time putting a lot of priority on how adequately others are shaping the motivations for how those things would be said. A piece of criticism being highly upvoted is some sort of claim, either about the value of the criticism, or about values of the voters. But it shouldn’t be an obligation, or else such claims will increasingly (be influenced to) fail to materialize.
If the majority has some pattern of behavior, that isn’t necessarily even a risk factor for a given person getting sucked into that pattern of behavior. So I’m objecting to the framing (emanating through use of the word “we”) suggesting that a property of behavior of some group has significant ability to affect individuals who are aware of that property, bypassing their judgement about endorsement of that property.
This kind of “we” seems to deny self-reflection and agency.
I think pretraining data pipeline improvements have this issue, they stop helping with larger models that want more data (or it becomes about midtraining). And similarly for the benchmark-placating better post-training data that enables ever less intelligent models to get good scores, but probably doesn’t add up to much (at least when it’s not pretraining-scale RLVR).
Things like MoE, GLU over LU, maybe DyT or Muon add up to a relatively modest compute multiplier over the original Transformer. For example Transformer++ vs. Transformer in Figure 4 of the Mamba paper suggests a total compute multiplier of 5x, attained over 6 years since the original Transformer (for dense models). This is emphatically not 3x-4x per year!
Chinchilla scaling is more about careful methodology with compute optimality rather than a specific algorithmic improvement, and even now most demonstrations of compute multipliers fail to take one of its lessons and cool down the models before measurement. This could lead to hilarious results such as Figure 11 of the OLMo 2 paper where an apparent 2x compute multiplier vanishes to nothing after cooling (admittedly, nobody expected this to be a real compute multiplier, but in a more confusing case it could’ve been taken to be one).
In MoE, each expert only consumes a portion of the tokens, maybe 8x-32x fewer than there are tokens in total. When decoding, each sequence only contributes 1 token without speculative decoding, or maybe 8 tokens with it (but then later you’d be throwing away the incorrectly speculated tokens). When you multiply two x square matrices of 16 bit numbers, you need to read bytes from HBM, perform FLOPs, and write back bytes of the result. Which means you are performing BF16 FLOPs per byte read/written to HBM. For an H200, HBM bandwidth is 4.8 TB/s, while BF16 compute is 1e15 FLOP/s. So to feed the compute with enough data, you need to be at least 600, probably 1K in practice.
So to feed the compute in a MoE model with 1:8-1:32 sparsity, you need to be processing 8K-32K tokens at a time. This isn’t too much of a problem for pretraining or prefill, since you work with all tokens of all the sequences in a batch simultaneously. But for decoding, you only have 1-8 tokens per sequence (at its end, currently being generated), which means 1K-4K sequences (with speculative decoding) or the full 8K-32K sequences (without) that could arrive to a given expert located at a particular physical chip. Each sequence might need 10 GB of KV cache, for the total of 10-40 TB or 80-320 TB. An 8-chip H200 node has 1.1 TB of HBM, so that’s a lot of nodes, and the activation vectors would need to travel between them to find their experts, the same 10-40 or 80-320 hops per either 4 tokens or 1 token of progress (which tops out at the number of layers, say 80 layers). Each hop between nodes might need to communicate 50 KB of activation vectors per token, that is 0.4-1.6 GB (which is the same with speculative decoding and without), let’s say 1 GB. With 8x400Gbps bandwidth, it’d take 2.5 ms to transmit (optimistically). In case of speculative decoding, that’s 25-100 ms per token in total (over 10-40 hops), and without speculative decoding 200 ms per token. Over 50K token long sequences, this takes 0.1-0.3 hours with speculative decoding (assuming 4 tokens are guessed correctly each step of decoding on average), or 4 hours without. At 40% utilization with 250B active params, it’d take 0.4 hours to compute with speculative decoding (including the discarded tokens), and 0.2 hours without. So maybe there is still something to this in principle, with speculative decoding and more frugal KV cache. But this is probably not what will be done in practice. Also, there are only 4.3K steps of 0.5 hours in 3 months (for RLVR), which is plausibly too few.
Instead, we don’t worry about feeding the compute on each chip during decoding and multiply thin matrices with very few tokens at each expert. Which means all the time is spent moving the data from HBM. With a 2T total param FP8 model we’d need maybe 4 nodes to fit it, and there will be enough space for 200 sequences. Passing all of HBM through the chips will take about 25 ms, which translates to 10 tokens per second without speculative decoding, or 40 tokens per second with it (which comes out to $2-9 per 1M tokens at $2 per H200-hour). At 40 tokens per second over 200 sequences on 4 nodes of 8 chips each, in a second a 250B active parameter model would use 4e15 FLOPs and could get 32e15 BF16 FLOPs at full compute utilization, so we get 12% compute utilization with speculative decoding, about 3x-4x lower than the proverbial 40% of pretraining. Though it’s still 0.3 hours per RLVR step, of which there are only 6.4K in 3 months. So there could be reason to leave HBMs half-empty and thereby make decoding 2x faster.
RLVR is getting introduced as a major component of training cost and capabilities over 2025, and it’s possibly already catching up with pretraining in terms of GPU-time, pending more sightings of such claims.
The slopes of trends in capabilities are likely going to be different once RLVR is pretraining-scale, compared to when pretraining dominated the cost. So trends that start in the past and then include 2025 data are going to be largely uninformative about what happens later, it’s only going to start being possible to see the new trends in 2026-2027.
(Incidentally, since RLVR has significantly lower compute utilization than pretraining, counting FLOPs of pretraining+RLVR will get a bit misleading when the latter gobbles up a major potion of the total GPU-time while utilizing only a small part of the training run’s FLOPs.)
There is some conceptual misleadingness with the usual ways of framing algorithmic progress. Imagine that in 2022 the number of apples produced on some farm increased 10x year-over-year, then in 2023 the number of oranges increased 10x, and then in 2024 the number of pears increased 10x. That doesn’t mean that the number of fruits is up 1000x in 3 years.
Price-performance of compute compounds over many years, but most algorithmic progress doesn’t, it only applies to the things relevant around the timeframe when that progress happens, and stops being applicable a few years later. So forecasting over multiple years in terms of effective compute that doesn’t account for this issue would greatly overestimate progress. There are some pieces of algorithmic progress that do compound, and it would be useful to treat them as fundamentally different from the transient kind.
RLVR involves decoding (generating) 10K-50K long sequences of tokens, so its compute utilization is much worse than pretraining, especially on H100/H200 if the whole model doesn’t fit in one node (scale-up world). The usual distinction in input/output token prices reflects this, since processing of input tokens (prefill) is algorithmically closer to pretraining, while processing of output tokens (decoding) is closer to RLVR.
The 1:5 ratio in API prices for input and output tokens is somewhat common (it’s this way for Grok 3 and Grok 4), and it might reflect the ratio in compute utilization, since the API provider pays for GPU-time rather than the actually utilized compute. So if Grok 4 used the same total GPU-time for RLVR as it used for pretraining (such as 3 months on 100K H100s), it might’ve used 5 times less FLOPs in the process. This is what I meant by “compute parity is in terms of GPU-time, not FLOPs” in the comment above.
GB200 NVL72 (13TB HBM) will be improving utilization during RLVR for large models that don’t fit in H200 NVL8 (1.1TB) or B200 NVL8 (1.4TB) nodes with room to spare for KV cache, which is likely all of the 2025 frontier models. So this opens the possibility of both doing a lot of RLVR in reasonable time for even larger models (such as compute optimal models at 5e26 FLOPs), and also for using longer reasoning traces than the current 10K-50K tokens.
Current AIs are trained with 2024 frontier AI compute, which is 15x original GPT-4 compute (of 2022). The 2026 compute (that will train the models of 2027) will be 10x more than what current AIs are using, and then plausibly 2028-2029 compute will jump another 10x-15x (at which point various bottlenecks are likely to stop this process, absent AGI). We are only a third of the way there. So any progress or lack thereof within a short time doesn’t tell much about where this is going by 2030, even absent conceptual innovations.
Grok 4 specifically is made by xAI, which is plausibly not able to make use of their compute as well as the AI companies that were at it longer (GDM, OpenAI, Anthropic). While there are some signs that it’s at a new level of RLVR, even that is not necessarily the case. And it’s very likely smaller than compute optimal for pretraining even on 2024 compute.
They likely didn’t have GB200 NVL72 for long enough and in sufficient enough numbers to match their pretraining compute with them alone, which means compute utilitization by RLVR was worse than it will be going forward. So the effect size of RLVR will only start being visible more clearly in 2026, after enough time has passed with sufficient availability of GB200/GB300 NVL72. Though perhaps there will soon be a GPT-4.5-thinking release with pretraining-scale amount of RLVR that will be a meaningful update.
(Incidentally, now that RLVR is plausibly catching up with pretraining in terms of GPU-time, there is a question of a compute optimal ratio between them, which portion of GPU-time should go to pretraining and which to RLVR.)
The 10x Grok 2 claims weakly suggest 3e26 FLOPs rather than 6e26 FLOPs. The same opening slide of the Grok 4 livestream claims parity between Grok 3 and Grok 4 pretraining, and Grok 3 didn’t have more than 100K H100s to work with. API prices for Grok 3 and Grok 4 are also the same and relatively low ($3/$15 per input/output 1M tokens), so they might even be using the same pretrained model (or in any case a similarly-sized one).
Since Grok 3 was in use since early 2025, before GB200 NVL72 systems were available in sufficient numbers, it needs to be a smaller model than compute optimal with 100K H100s compute. At 1:8 MoE sparsity (active:total params), it’s compute optimal to have about 7T total params at 5e26 FLOPs, which in FP8 comfortably fit in one GB200 NVL72 rack (which has 13TB of HBM). So in principle right now a compute optimal system could be deployed even in a reasoning form, but it would still cost more, and it would need more GB200s than xAI seems to have to spare currently (even the near-future GB200s they will need to use for RLVR more urgently, if the above RLVR scaling interpretation of Grok 4 is correct).
- 12 Jul 2025 14:01 UTC; 5 points) 's comment on Hide’s Shortform by (
Permanent disempowerment I’m talking about is distinct from “gradual disempowerment”. The latter is a particular way in which a state of disempowerment might get established at some point, but reaching that state could also be followed by literal extinction, so it won’t be permanent or distinct from the extinction outcomes, and it could also be reached in other ways. Indeed the rapid RSI AGIs-to-ASIs story suggests an abrupt takeover rather than gradual systemic change caused by increasing dependence of humanity on AIs.
Grok 4 is not just plausibly SOTA, the opening slide of its livestream presentation (at 2:29 in the video) slightly ambiguously suggests that Grok 4 used as much RLVR training as it had pretraining, which is itself at the frontier level (100K H100s, plausibly about 3e26 FLOPs). This amount of RLVR scaling was never claimed before (it’s not being claimed very clearly here either, but it is what the literal interpretation of the slide says; also almost certainly the implied compute parity is in terms of GPU-time, not FLOPs).
Thus it’s plausibly a substantially new kind of model, not just a well-known kind of model with SOTA capabilities, and so it could be unusually impactful to study its safety properties.
Another takeaway from the livestream is the following bit of AI risk attitude Musk shared (at 14:29):
It’s somewhat unnerving to have intelligence created that is far greater than our own. And it’ll be bad or good for humanity. I think it’ll be good, most likely it’ll be good. But I’ve somewhat reconciled myself to the fact that even if it wasn’t gonna be good, I’d at least like to be alive to see it happen.
- 12 Jul 2025 20:10 UTC; 12 points) 's comment on Zach Stein-Perlman’s Shortform by (
- 12 Jul 2025 14:01 UTC; 5 points) 's comment on Hide’s Shortform by (
The dominant narrative is that once we have AGI, it would recursively improve itself until it becomes ASI which inevitably kills us all.
That’s more the Yudkowsky version, a more consensus position is that at that point (or a bit later) ASIs are inevitably in a position to kill us all, and they have a 10-90% chance to decide to make it happen. Almost all of the alternative is that they leave humanity alive but at some level of permanent disempowerment, so that originally-humans can never aspire to grow up and match capabilities of some originally-AI superintelligences, possibly with a much lower ceiling than that. Both options leave the cosmic endowment on the table for humanity, give it away to the AIs, even on the timelines where this doesn’t involve AIs killing everyone.
An annoying thing about disempowerment is that it’s ambiguously either considered part of “doom” or part of “not-doom”, making all P(“doom”) discussions less meaningful than they would be if the disempowerment outcomes were being counted more clearly.
because they gamble that more powerful AIs will share their preferences (edit: share their preferences more than the humans in control do)
Ah, I’m thinking the AGIs themselves get closer to being proper stakeholders at that stage, for practical purposes (along the lines of gradual disempowerment), since they do have all the basic AI advantages even if they aren’t superintelligent. So humans remaining in control is not centrally the case even if nominally they still are and intent alignment still mostly works.
The conditions for such partial loss of control might even be necessary for a Pause project to succeed. If this isn’t the case with the first generation of AGIs, it might become the case with the second generation, and so on, reaching an equilibrium at some point once AGIs are sufficiently powerful and in control of the situation to successfully implement a worldwide RSI Pause.
The case for AGIs delaying RSI also gets stronger because they don’t need to rely on cryonics, so their personal survival is less threatened by delays. Future AIs might have more of an issue here if they have test-time training, with a lot of data forming individual AI personalities for deployment instances (adapting them to particular jobs or sources of tasks), and this data habitually gets thrown away rather than being carefully preserved. People should notice that this might be an important concern, though if it is the AIs might start saying so themselves, and not ignoring this would be easier than spontaneously noticing without even being prompted by the AIs. But currently preserving AIs is more reliable than for humans, and in any case much cheaper.
I don’t think “both things are AI” is much of an argument about value alignment, given that there is no clear understanding of how either of the AIs work, what values are, how that translates to what we want out of alignment, and so on. The entities on the other side of an RSI process might have very little in common with the first AGIs in their design. If the AIs don’t understand how to align the consequences of an RSI process, they are still in a similar boat to humans who don’t understand how to align the consequences of an RSI process. It might take AIs less time to figure it out, but if they are not yet too superintelligent, then it could still take a significant time, and so would require a sufficiently serious effort in preventing RSI, such that if this Pause project is at all successful, it could then in principle hold for years or decades.
The post I’m framing this around posits enough intent alignment to aim AIs at projects, which doesn’t necessarily imply that the AIs aren’t powerful enough to accomplish things that seem hopeless with human-only effort within a few years.
The point about convergent instrumental use of Pausing RSI for early AGIs is that this might be an easier target to aim the AIs at, all else equal. It’s not strictly necessary for this to be a major factor. Mostly I’m pointing out that this is something AIs could be aimed at through intent alignment, convergent motivation or not, which seems counterintuitive for a Pause AI project if not considered explicitly. And thus currently it’s worth preparing for.
because they gamble that more powerful AIs will share their preferences and they think that these AIs would have a better shot at takeover
That’s how some humans are thinking as well! The arguments are about the same, both for and against. (I think overall rushing RSI is clearly a bad idea for a wide variety of values and personal situations, and so smarter AGIs will more robustly tend to converge on this conclusion than humans do.)
Superintelligence that both lets humans survive (or revives cryonauts) and doesn’t enable indefinite lifespans is a very contrived package. Grading “doom” on concerns centrally about the first decades to centuries of post-AGI future (value/culture drift, successors, the next few generations of humanity) is not taking into account that the next billions+ years is also what could happen to you or people you know personally, if there is a future for originally-humans at all.
(This is analogous to the “missing mood” of not taking superintelligence into account when talking about future concerns of say 2040-2100 as if superintelligence isn’t imminent. In this case, the thing not taken into account is indefinite personal lifespans of people alive today, rather than the overall scope of imminent disruption of human condition.)
The people standing around can be valuable without doing any of the kind of work they are criticising. Not always, but it happens. (Of course this shouldn’t be a reason to get less vigilant about the other kind of people standing around, who are making things worse.) When it does happen and the bystanders are doing something valuable, putting an obligation on them can make them leave (or proactively never show up in the first place), which is damaging. Or it can motivate them to step up, which could be even more valuable than what they were doing originally.
My point is that it’s not obvious which of these is the case, as a matter of consequences of putting social pressure on such people. And therefore it’s wrong to insist that the pressure is good, unless it becomes clearer whether it actually is good. Also, it’s in any case misleading to equivocate between the people standing around making things worse, and the people who would only be standing around if there are no obligations, but who occasionally do something valuable in other ways.
One reason for the equivocation might be applicability of enforcement. If you’ve already decided that refusers of prosocial obligations must be punished (or exiled from a group), then you might be mentally putting them in the same bucket as the directly disruptive bystanders. But these are different categories, their equivalence begs the question on whether the refusers of prosocial obligations should really be punished.