iiuc, xAI claims Grok 4 is SOTA and that’s plausibly true, but xAI didn’t do any dangerous capability evals, doesn’t have a safety plan (their draft Risk Management Framework has unusually poor details relative to other companies’ similar policies and isn’t a real safety plan, and it said “We plan to release an updated version of this policy within three months” but it was published on Feb 10, over five months ago), and has done nothing else on x-risk.
That’s bad. I write very little criticism of xAI (and Meta) because there’s much less to write about than OpenAI, Anthropic, and Google DeepMind — but that’s because xAI doesn’t do things for me to write about, which is downstream of it being worse! So this is a reminder that xAI is doing nothing on safety afaict and that’s bad/shameful/blameworthy.[1]
This does not mean safety people should refuse to work at xAI. On the contrary, I think it’s great to work on safety at companies that are likely to be among the first to develop very powerful AI that are very bad on safety, especially for certain kinds of people. Obviously this isn’t always true and this story failed for many OpenAI safety staff; I don’t want to argue about this now.
Grok 4 is not just plausibly SOTA, the opening slide of its livestream presentation (at 2:29 in the video) slightly ambiguously suggests that Grok 4 used as much RLVR training as it had pretraining, which is itself at the frontier level (100K H100s, plausibly about 3e26 FLOPs). This amount of RLVR scaling was never claimed before (it’s not being claimed very clearly here either, but it is what the literal interpretation of the slide says; also almost certainly the implied compute parity is in terms of GPU-time, not FLOPs).
Thus it’s plausibly a substantially new kind of model, not just a well-known kind of model with SOTA capabilities, and so it could be unusually impactful to study its safety properties.
It’s somewhat unnerving to have intelligence created that is far greater than our own. And it’ll be bad or good for humanity. I think it’ll be good, most likely it’ll be good. But I’ve somewhat reconciled myself to the fact that even if it wasn’t gonna be good, I’d at least like to be alive to see it happen.
RLVR involves decoding (generating) 10K-50K long sequences of tokens, so its compute utilization is much worse than pretraining, especially on H100/H200 if the whole model doesn’t fit in one node (scale-up world). The usual distinction in input/output token prices reflects this, since processing of input tokens (prefill) is algorithmically closer to pretraining, while processing of output tokens (decoding) is closer to RLVR.
The 1:5 ratio in API prices for input and output tokens is somewhat common (it’s this way for Grok 3 and Grok 4), and it might reflect the ratio in compute utilization, since the API provider pays for GPU-time rather than the actually utilized compute. So if Grok 4 used the same total GPU-time for RLVR as it used for pretraining (such as 3 months on 100K H100s), it might’ve used 5 times less FLOPs in the process. This is what I meant by “compute parity is in terms of GPU-time, not FLOPs” in the comment above.
GB200 NVL72 (13TB HBM) will be improving utilization during RLVR for large models that don’t fit in H200 NVL8 (1.1TB) or B200 NVL8 (1.4TB) nodes with room to spare for KV cache, which is likely all of the 2025 frontier models. So this opens the possibility of both doing a lot of RLVR in reasonable time for even larger models (such as compute optimal models at 5e26 FLOPs), and also for using longer reasoning traces than the current 10K-50K tokens.
And already on February 17th, Colossus had 150k+ GPU. It seems that in the April message they were talking about 200k GPUs. Judging by Musk’s interview, this could mean 150,000 H100 and 50,000 H200.
Perhaps the time and GPU were enough to train a GPT-5 scale model?
The 10x Grok 2 claims weakly suggest 3e26 FLOPs rather than 6e26 FLOPs. The same opening slide of the Grok 4 livestream claims parity between Grok 3 and Grok 4 pretraining, and Grok 3 didn’t have more than 100K H100s to work with. API prices for Grok 3 and Grok 4 are also the same and relatively low ($3/$15 per input/output 1M tokens), so they might even be using the same pretrained model (or in any case a similarly-sized one).
Since Grok 3 was in use since early 2025, before GB200 NVL72 systems were available in sufficient numbers, it needs to be a smaller model than compute optimal with 100K H100s compute. At 1:8 MoE sparsity (active:total params), it’s compute optimal to have about 7T total params at 5e26 FLOPs, which in FP8 comfortably fit in one GB200 NVL72 rack (which has 13TB of HBM). So in principle right now a compute optimal system could be deployed even in a reasoning form, but it would still cost more, and it would need more GB200s than xAI seems to have to spare currently (even the near-future GB200s they will need to use for RLVR more urgently, if the above RLVR scaling interpretation of Grok 4 is correct).
FWIW, I think the key question is to understand the regulatory demands that xAI is making. It’s not like the RSPs or safety evaluations will really tell anyone that much new. Indeed, it seems very sad to evaluate the quality of frontier company’s safety work on the basis of the pretty fake seeming RSP and Risk Management Frameworks that other companies have created, which seem pretty powerless. Clearly if one was thinking from first principles about what a responsible company should do, those are not the relevant dimensions.
I don’t know what the current lobbying and advocacy efforts of xAI are, but if they are absent then seems to me like they are messing up less than e.g. OpenAI, and if they are calling for real and weighty regulations (as at least Elon has done in the past, though he seems to have changed his tune recently), then that seems like it would matter more.
Edit: To be clear, a thing I do think really matters is keeping your commitments, even if they committed you to doing things I don’t think are super important. So on this dimension, xAI does seem like it messed up pretty badly, given this:
“We plan to release an updated version of this policy within three months” but it was published on Feb 10, over five months ago”
I disagree that this is the “key question.” I think most of a frontier company’s effect on P(doom) is the quality of its preparation for safety when models are dangerous, not its effect on regulation. I’m surprised if you think that variance in regulatory outcomes is not just more important than variance in what-a-company-does outcomes but also sufficiently tractable for the marginal company that it’s the key question.
I share your pessimism about RSPs and evals, but I think they’re informative in various ways. E.g.:
If a company says it thinks a model is safe on the basis of eval results, but those evals are terrible or are interpreted incorrectly, that’s a bad sign.
What an RSP says about how the company plans to respond to misuse risks gives you some evidence about whether it’s thinking at all seriously about safety — does it say we will implement mitigations to reduce our score on bio evals to safe levels or we will implement mitigations and then assess how robust they are.
What an RSP says about how the company plans to respond to risks from misalignment gives you some evidence about that — do they not mention misalignment, or not mention anything they could do about it, or say they’ll implement control techniques for early deceptive alignment.
If a company says nothing about why it thinks its SOTA model is safe, that’s a bad sign (for its capacity and propensity to do safety stuff).
Plus of course if a company isn’t trying to prepare for extreme risks, that’s bad.
I’m surprised if you think that variance in regulatory outcomes is not just more important than variance in what-a-company-does outcomes but also sufficiently tractable for the marginal company that it’s the key question.
Huh, I am not sure what you mean. I am surprised if you think that I think that marginal frontier-lab safety research is making progress on the hard parts of the alignment problem. I’ve been pretty open that I think valuable progress on that dimension has been close to zero.
This doesn’t mean actions of an AI company do not matter, but I do think that no actions that seem at all plausible for any current AI company to take have really any chance of making it so that it’s non-catastrophic for them to develop and deploy systems much smarter than humans. So the key dimension is how much the actions of the labs are doing things that might prevent smarter than human systems from being deployed in the near future.
I think there are roughly two dimensions of variance where I do think AI lab behavior has a big effect on that:
Do they advocate for reasonable regulatory action and speak openly about the catastrophic risk from AI
Elon, of the people at leading labs at least historically has been the best on the latter, and I don’t know about the former
Do they have any plans that involve leveraging AI systems to improve the degree to which humanity can coordinate to not build superhuman systems in the near future
xAI is in a very promising position here, and deploying grok to Twitter (barring things like the MechaHitler incident) and generally leveraging Twitter seems like it could be really good here. But also Elon making Grok sycophantic to him seems quite bad for trust, so sign here is unclear.
I think RSPs can be helpful in as much as they create common knowledge of risks. No current company’s RSP commits the company to stopping or slowing down if risks get too high, and so on the company policy level seem largely powerless. I also think many RSPs and Risk Management Frameworks are active safety washing and written with an intent to actively confuse people about the risks from AI, so marginally less of that is often good (though I do think there are real teeth to some evals and RSPs, but definitely not all).
If a company says it thinks a model is safe on the basis of eval results
All current models are safe. No strongly superhuman future models are safe. They will stay safe until they non-trivially exceed top human performance. There, I did it.
Like, this is of course tongue-in-cheek, but I think it’s really important to not pretend that current safety evals are measuring the safety or risks of present systems. There is approximately no variance in how dangerous systems from different AI companies of the same capability level are. The key question, according to my models, is when these systems will become capable of disempowering humanity. None of these systems are anywhere remotely close to being aligned enough to take reasonable actions if they ever get into a position of disempowering humanity. The only evals that matter are the capability evals. No mitigations have ever helped with the fundamental risk model at all (I think there is value in reducing misuse risk, but misuse risk isn’t going to kill everyone, and so is many orders of magnitude less important than accident risk[1]).
Like, there is some important interplay between people racing and misuse risk, though I think stuff like bioterrorism evals do not really measure that either. There is also some interesting thinking to be done about the security requirements of RSPs, though the sign here is confusing and unclear to me, but it could be a large effect size.
I think this is noteworthy, not because I’m worried about risk from current models but because it’s a bad sign about noticing risks when warning signs appear, being honest about risk/safety even when it makes you look bad, etc.
Edit: I guess your belief “no actions that seem at all plausible for any current AI company to take have really any chance of making it so that it’s non-catastrophic for them to develop and deploy systems much smarter than humans” is a crux; I disagree, and so I care about marginal differences in risk-preparedness.
I’m worried about risk from current models but because it’s a bad sign about noticing risks when warning signs appear, being honest about risk/safety even when it makes you look bad, etc.
I agree that this is the key dimension, but I don’t currently think RSPs are a great vehicle for that. Indeed, looking at the regulatory advocacy of a company seems like a much better indicator, since I expect that to have a bigger effect on the conversation about risk/safety than the RSP and eval results (though it’s not overwhelmingly clear to me).
And again, many RSPs and eval results seem to me to be active propaganda, and so are harmful on this dimension, and it’s better to do nothing than to be harmful in this way (though I agree that if xAI said they would do a think and then didn’t, then that is quite bad).
I guess your belief “no actions that seem at all plausible for any current AI company to take have really any chance of making it so that it’s non-catastrophic for them to develop and deploy systems much smarter than humans” is a crux; I disagree, and so I care about marginal differences in risk-preparedness.
Makes sense. I am not overwhelmingly confident there isn’t something control-esque to be done here, though that’s the only real candidate I have, and it’s not currently clear to me that current safety evals positively correlate with control interventions being easier or harder or even more likely to be implemented. For example, my sense is having models be trained for harmlessness makes them worse for control interventions, you would much rather have pure helpful + honest models.
iiuc, xAI claims Grok 4 is SOTA and that’s plausibly true, but xAI didn’t do any dangerous capability evals, doesn’t have a safety plan (their draft Risk Management Framework has unusually poor details relative to other companies’ similar policies and isn’t a real safety plan, and it said “We plan to release an updated version of this policy within three months” but it was published on Feb 10, over five months ago), and has done nothing else on x-risk.
That’s bad. I write very little criticism of xAI (and Meta) because there’s much less to write about than OpenAI, Anthropic, and Google DeepMind — but that’s because xAI doesn’t do things for me to write about, which is downstream of it being worse! So this is a reminder that xAI is doing nothing on safety afaict and that’s bad/shameful/blameworthy.[1]
This does not mean safety people should refuse to work at xAI. On the contrary, I think it’s great to work on safety at companies that are likely to be among the first to develop very powerful AI that are very bad on safety, especially for certain kinds of people. Obviously this isn’t always true and this story failed for many OpenAI safety staff; I don’t want to argue about this now.
Grok 4 is not just plausibly SOTA, the opening slide of its livestream presentation (at 2:29 in the video) slightly ambiguously suggests that Grok 4 used as much RLVR training as it had pretraining, which is itself at the frontier level (100K H100s, plausibly about 3e26 FLOPs). This amount of RLVR scaling was never claimed before (it’s not being claimed very clearly here either, but it is what the literal interpretation of the slide says; also almost certainly the implied compute parity is in terms of GPU-time, not FLOPs).
Thus it’s plausibly a substantially new kind of model, not just a well-known kind of model with SOTA capabilities, and so it could be unusually impactful to study its safety properties.
Another takeaway from the livestream is the following bit of AI risk attitude Musk shared (at 14:29):
What do you think of this argument that Grok 4 used only ~1/5th RLVR training as pretraining (~3e26 pre-training + ~6e25 RLVR)? https://x.com/tmychow/status/1943460487565578534
RLVR involves decoding (generating) 10K-50K long sequences of tokens, so its compute utilization is much worse than pretraining, especially on H100/H200 if the whole model doesn’t fit in one node (scale-up world). The usual distinction in input/output token prices reflects this, since processing of input tokens (prefill) is algorithmically closer to pretraining, while processing of output tokens (decoding) is closer to RLVR.
The 1:5 ratio in API prices for input and output tokens is somewhat common (it’s this way for Grok 3 and Grok 4), and it might reflect the ratio in compute utilization, since the API provider pays for GPU-time rather than the actually utilized compute. So if Grok 4 used the same total GPU-time for RLVR as it used for pretraining (such as 3 months on 100K H100s), it might’ve used 5 times less FLOPs in the process. This is what I meant by “compute parity is in terms of GPU-time, not FLOPs” in the comment above.
GB200 NVL72 (13TB HBM) will be improving utilization during RLVR for large models that don’t fit in H200 NVL8 (1.1TB) or B200 NVL8 (1.4TB) nodes with room to spare for KV cache, which is likely all of the 2025 frontier models. So this opens the possibility of both doing a lot of RLVR in reasonable time for even larger models (such as compute optimal models at 5e26 FLOPs), and also for using longer reasoning traces than the current 10K-50K tokens.
Or 6e26 (in FP8 FLOPs).
And already on February 17th, Colossus had 150k+ GPU. It seems that in the April message they were talking about 200k GPUs. Judging by Musk’s interview, this could mean 150,000 H100 and 50,000 H200. Perhaps the time and GPU were enough to train a GPT-5 scale model?
The 10x Grok 2 claims weakly suggest 3e26 FLOPs rather than 6e26 FLOPs. The same opening slide of the Grok 4 livestream claims parity between Grok 3 and Grok 4 pretraining, and Grok 3 didn’t have more than 100K H100s to work with. API prices for Grok 3 and Grok 4 are also the same and relatively low ($3/$15 per input/output 1M tokens), so they might even be using the same pretrained model (or in any case a similarly-sized one).
Since Grok 3 was in use since early 2025, before GB200 NVL72 systems were available in sufficient numbers, it needs to be a smaller model than compute optimal with 100K H100s compute. At 1:8 MoE sparsity (active:total params), it’s compute optimal to have about 7T total params at 5e26 FLOPs, which in FP8 comfortably fit in one GB200 NVL72 rack (which has 13TB of HBM). So in principle right now a compute optimal system could be deployed even in a reasoning form, but it would still cost more, and it would need more GB200s than xAI seems to have to spare currently (even the near-future GB200s they will need to use for RLVR more urgently, if the above RLVR scaling interpretation of Grok 4 is correct).
FWIW, I think the key question is to understand the regulatory demands that xAI is making. It’s not like the RSPs or safety evaluations will really tell anyone that much new. Indeed, it seems very sad to evaluate the quality of frontier company’s safety work on the basis of the pretty fake seeming RSP and Risk Management Frameworks that other companies have created, which seem pretty powerless. Clearly if one was thinking from first principles about what a responsible company should do, those are not the relevant dimensions.
I don’t know what the current lobbying and advocacy efforts of xAI are, but if they are absent then seems to me like they are messing up less than e.g. OpenAI, and if they are calling for real and weighty regulations (as at least Elon has done in the past, though he seems to have changed his tune recently), then that seems like it would matter more.
Edit: To be clear, a thing I do think really matters is keeping your commitments, even if they committed you to doing things I don’t think are super important. So on this dimension, xAI does seem like it messed up pretty badly, given this:
I disagree that this is the “key question.” I think most of a frontier company’s effect on P(doom) is the quality of its preparation for safety when models are dangerous, not its effect on regulation. I’m surprised if you think that variance in regulatory outcomes is not just more important than variance in what-a-company-does outcomes but also sufficiently tractable for the marginal company that it’s the key question.
I share your pessimism about RSPs and evals, but I think they’re informative in various ways. E.g.:
If a company says it thinks a model is safe on the basis of eval results, but those evals are terrible or are interpreted incorrectly, that’s a bad sign.
What an RSP says about how the company plans to respond to misuse risks gives you some evidence about whether it’s thinking at all seriously about safety — does it say we will implement mitigations to reduce our score on bio evals to safe levels or we will implement mitigations and then assess how robust they are.
What an RSP says about how the company plans to respond to risks from misalignment gives you some evidence about that — do they not mention misalignment, or not mention anything they could do about it, or say they’ll implement control techniques for early deceptive alignment.
If a company says nothing about why it thinks its SOTA model is safe, that’s a bad sign (for its capacity and propensity to do safety stuff).
Plus of course if a company isn’t trying to prepare for extreme risks, that’s bad.
And the xAI signs are bad.
Huh, I am not sure what you mean. I am surprised if you think that I think that marginal frontier-lab safety research is making progress on the hard parts of the alignment problem. I’ve been pretty open that I think valuable progress on that dimension has been close to zero.
This doesn’t mean actions of an AI company do not matter, but I do think that no actions that seem at all plausible for any current AI company to take have really any chance of making it so that it’s non-catastrophic for them to develop and deploy systems much smarter than humans. So the key dimension is how much the actions of the labs are doing things that might prevent smarter than human systems from being deployed in the near future.
I think there are roughly two dimensions of variance where I do think AI lab behavior has a big effect on that:
Do they advocate for reasonable regulatory action and speak openly about the catastrophic risk from AI
Elon, of the people at leading labs at least historically has been the best on the latter, and I don’t know about the former
Do they have any plans that involve leveraging AI systems to improve the degree to which humanity can coordinate to not build superhuman systems in the near future
xAI is in a very promising position here, and deploying grok to Twitter (barring things like the MechaHitler incident) and generally leveraging Twitter seems like it could be really good here. But also Elon making Grok sycophantic to him seems quite bad for trust, so sign here is unclear.
I think RSPs can be helpful in as much as they create common knowledge of risks. No current company’s RSP commits the company to stopping or slowing down if risks get too high, and so on the company policy level seem largely powerless. I also think many RSPs and Risk Management Frameworks are active safety washing and written with an intent to actively confuse people about the risks from AI, so marginally less of that is often good (though I do think there are real teeth to some evals and RSPs, but definitely not all).
All current models are safe. No strongly superhuman future models are safe. They will stay safe until they non-trivially exceed top human performance. There, I did it.
Like, this is of course tongue-in-cheek, but I think it’s really important to not pretend that current safety evals are measuring the safety or risks of present systems. There is approximately no variance in how dangerous systems from different AI companies of the same capability level are. The key question, according to my models, is when these systems will become capable of disempowering humanity. None of these systems are anywhere remotely close to being aligned enough to take reasonable actions if they ever get into a position of disempowering humanity. The only evals that matter are the capability evals. No mitigations have ever helped with the fundamental risk model at all (I think there is value in reducing misuse risk, but misuse risk isn’t going to kill everyone, and so is many orders of magnitude less important than accident risk[1]).
Like, there is some important interplay between people racing and misuse risk, though I think stuff like bioterrorism evals do not really measure that either. There is also some interesting thinking to be done about the security requirements of RSPs, though the sign here is confusing and unclear to me, but it could be a large effect size.
Quick shallow reply:
AI companies say that their models [except maybe Opus 4] don’t provide substantial bio misuse uplift. I think this is likely wrong and their work is very sloppy. See my blogpost AI companies’ eval reports mostly don’t support their claims and Ryan’s shortform on bio capabilities.
I think this is noteworthy, not because I’m worried about risk from current models but because it’s a bad sign about noticing risks when warning signs appear, being honest about risk/safety even when it makes you look bad, etc.
Edit: I guess your belief “no actions that seem at all plausible for any current AI company to take have really any chance of making it so that it’s non-catastrophic for them to develop and deploy systems much smarter than humans” is a crux; I disagree, and so I care about marginal differences in risk-preparedness.
I agree that this is the key dimension, but I don’t currently think RSPs are a great vehicle for that. Indeed, looking at the regulatory advocacy of a company seems like a much better indicator, since I expect that to have a bigger effect on the conversation about risk/safety than the RSP and eval results (though it’s not overwhelmingly clear to me).
And again, many RSPs and eval results seem to me to be active propaganda, and so are harmful on this dimension, and it’s better to do nothing than to be harmful in this way (though I agree that if xAI said they would do a think and then didn’t, then that is quite bad).
Makes sense. I am not overwhelmingly confident there isn’t something control-esque to be done here, though that’s the only real candidate I have, and it’s not currently clear to me that current safety evals positively correlate with control interventions being easier or harder or even more likely to be implemented. For example, my sense is having models be trained for harmlessness makes them worse for control interventions, you would much rather have pure helpful + honest models.
Update: xAI safety advisor Dan Hendrycks tweets:
(I wonder what they were, whether they were done well, what the results were, whether xAI thinks they rule out dangerous capabilities...)
Waiting for elaboration on that then.
Not releasing safety eval data on day 0 is a bad vibe, but releasing it after you release the model is better than not releasing it at all.
Mmm.