FWIW, I think the key question is to understand the regulatory demands that xAI is making. It’s not like the RSPs or safety evaluations will really tell anyone that much new. Indeed, it seems very sad to evaluate the quality of frontier company’s safety work on the basis of the pretty fake seeming RSP and Risk Management Frameworks that other companies have created, which seem pretty powerless. Clearly if one was thinking from first principles about what a responsible company should do, those are not the relevant dimensions.
I don’t know what the current lobbying and advocacy efforts of xAI are, but if they are absent then seems to me like they are messing up less than e.g. OpenAI, and if they are calling for real and weighty regulations (as at least Elon has done in the past, though he seems to have changed his tune recently), then that seems like it would matter more.
Edit: To be clear, a thing I do think really matters is keeping your commitments, even if they committed you to doing things I don’t think are super important. So on this dimension, xAI does seem like it messed up pretty badly, given this:
“We plan to release an updated version of this policy within three months” but it was published on Feb 10, over five months ago”
I disagree that this is the “key question.” I think most of a frontier company’s effect on P(doom) is the quality of its preparation for safety when models are dangerous, not its effect on regulation. I’m surprised if you think that variance in regulatory outcomes is not just more important than variance in what-a-company-does outcomes but also sufficiently tractable for the marginal company that it’s the key question.
I share your pessimism about RSPs and evals, but I think they’re informative in various ways. E.g.:
If a company says it thinks a model is safe on the basis of eval results, but those evals are terrible or are interpreted incorrectly, that’s a bad sign.
What an RSP says about how the company plans to respond to misuse risks gives you some evidence about whether it’s thinking at all seriously about safety — does it say we will implement mitigations to reduce our score on bio evals to safe levels or we will implement mitigations and then assess how robust they are.
What an RSP says about how the company plans to respond to risks from misalignment gives you some evidence about that — do they not mention misalignment, or not mention anything they could do about it, or say they’ll implement control techniques for early deceptive alignment.
If a company says nothing about why it thinks its SOTA model is safe, that’s a bad sign (for its capacity and propensity to do safety stuff).
Plus of course if a company isn’t trying to prepare for extreme risks, that’s bad.
I’m surprised if you think that variance in regulatory outcomes is not just more important than variance in what-a-company-does outcomes but also sufficiently tractable for the marginal company that it’s the key question.
Huh, I am not sure what you mean. I am surprised if you think that I think that marginal frontier-lab safety research is making progress on the hard parts of the alignment problem. I’ve been pretty open that I think valuable progress on that dimension has been close to zero.
This doesn’t mean actions of an AI company do not matter, but I do think that no actions that seem at all plausible for any current AI company to take have really any chance of making it so that it’s non-catastrophic for them to develop and deploy systems much smarter than humans. So the key dimension is how much the actions of the labs are doing things that might prevent smarter than human systems from being deployed in the near future.
I think there are roughly two dimensions of variance where I do think AI lab behavior has a big effect on that:
Do they advocate for reasonable regulatory action and speak openly about the catastrophic risk from AI
Elon, of the people at leading labs at least historically has been the best on the latter, and I don’t know about the former
Do they have any plans that involve leveraging AI systems to improve the degree to which humanity can coordinate to not build superhuman systems in the near future
xAI is in a very promising position here, and deploying grok to Twitter (barring things like the MechaHitler incident) and generally leveraging Twitter seems like it could be really good here. But also Elon making Grok sycophantic to him seems quite bad for trust, so sign here is unclear.
I think RSPs can be helpful in as much as they create common knowledge of risks. No current company’s RSP commits the company to stopping or slowing down if risks get too high, and so on the company policy level seem largely powerless. I also think many RSPs and Risk Management Frameworks are active safety washing and written with an intent to actively confuse people about the risks from AI, so marginally less of that is often good (though I do think there are real teeth to some evals and RSPs, but definitely not all).
If a company says it thinks a model is safe on the basis of eval results
All current models are safe. No strongly superhuman future models are safe. They will stay safe until they non-trivially exceed top human performance. There, I did it.
Like, this is of course tongue-in-cheek, but I think it’s really important to not pretend that current safety evals are measuring the safety or risks of present systems. There is approximately no variance in how dangerous systems from different AI companies of the same capability level are. The key question, according to my models, is when these systems will become capable of disempowering humanity. None of these systems are anywhere remotely close to being aligned enough to take reasonable actions if they ever get into a position of disempowering humanity. The only evals that matter are the capability evals. No mitigations have ever helped with the fundamental risk model at all (I think there is value in reducing misuse risk, but misuse risk isn’t going to kill everyone, and so is many orders of magnitude less important than accident risk[1]).
Like, there is some important interplay between people racing and misuse risk, though I think stuff like bioterrorism evals do not really measure that either. There is also some interesting thinking to be done about the security requirements of RSPs, though the sign here is confusing and unclear to me, but it could be a large effect size.
I think this is noteworthy, not because I’m worried about risk from current models but because it’s a bad sign about noticing risks when warning signs appear, being honest about risk/safety even when it makes you look bad, etc.
Edit: I guess your belief “no actions that seem at all plausible for any current AI company to take have really any chance of making it so that it’s non-catastrophic for them to develop and deploy systems much smarter than humans” is a crux; I disagree, and so I care about marginal differences in risk-preparedness.
I’m worried about risk from current models but because it’s a bad sign about noticing risks when warning signs appear, being honest about risk/safety even when it makes you look bad, etc.
I agree that this is the key dimension, but I don’t currently think RSPs are a great vehicle for that. Indeed, looking at the regulatory advocacy of a company seems like a much better indicator, since I expect that to have a bigger effect on the conversation about risk/safety than the RSP and eval results (though it’s not overwhelmingly clear to me).
And again, many RSPs and eval results seem to me to be active propaganda, and so are harmful on this dimension, and it’s better to do nothing than to be harmful in this way (though I agree that if xAI said they would do a think and then didn’t, then that is quite bad).
I guess your belief “no actions that seem at all plausible for any current AI company to take have really any chance of making it so that it’s non-catastrophic for them to develop and deploy systems much smarter than humans” is a crux; I disagree, and so I care about marginal differences in risk-preparedness.
Makes sense. I am not overwhelmingly confident there isn’t something control-esque to be done here, though that’s the only real candidate I have, and it’s not currently clear to me that current safety evals positively correlate with control interventions being easier or harder or even more likely to be implemented. For example, my sense is having models be trained for harmlessness makes them worse for control interventions, you would much rather have pure helpful + honest models.
FWIW, I think the key question is to understand the regulatory demands that xAI is making. It’s not like the RSPs or safety evaluations will really tell anyone that much new. Indeed, it seems very sad to evaluate the quality of frontier company’s safety work on the basis of the pretty fake seeming RSP and Risk Management Frameworks that other companies have created, which seem pretty powerless. Clearly if one was thinking from first principles about what a responsible company should do, those are not the relevant dimensions.
I don’t know what the current lobbying and advocacy efforts of xAI are, but if they are absent then seems to me like they are messing up less than e.g. OpenAI, and if they are calling for real and weighty regulations (as at least Elon has done in the past, though he seems to have changed his tune recently), then that seems like it would matter more.
Edit: To be clear, a thing I do think really matters is keeping your commitments, even if they committed you to doing things I don’t think are super important. So on this dimension, xAI does seem like it messed up pretty badly, given this:
I disagree that this is the “key question.” I think most of a frontier company’s effect on P(doom) is the quality of its preparation for safety when models are dangerous, not its effect on regulation. I’m surprised if you think that variance in regulatory outcomes is not just more important than variance in what-a-company-does outcomes but also sufficiently tractable for the marginal company that it’s the key question.
I share your pessimism about RSPs and evals, but I think they’re informative in various ways. E.g.:
If a company says it thinks a model is safe on the basis of eval results, but those evals are terrible or are interpreted incorrectly, that’s a bad sign.
What an RSP says about how the company plans to respond to misuse risks gives you some evidence about whether it’s thinking at all seriously about safety — does it say we will implement mitigations to reduce our score on bio evals to safe levels or we will implement mitigations and then assess how robust they are.
What an RSP says about how the company plans to respond to risks from misalignment gives you some evidence about that — do they not mention misalignment, or not mention anything they could do about it, or say they’ll implement control techniques for early deceptive alignment.
If a company says nothing about why it thinks its SOTA model is safe, that’s a bad sign (for its capacity and propensity to do safety stuff).
Plus of course if a company isn’t trying to prepare for extreme risks, that’s bad.
And the xAI signs are bad.
Huh, I am not sure what you mean. I am surprised if you think that I think that marginal frontier-lab safety research is making progress on the hard parts of the alignment problem. I’ve been pretty open that I think valuable progress on that dimension has been close to zero.
This doesn’t mean actions of an AI company do not matter, but I do think that no actions that seem at all plausible for any current AI company to take have really any chance of making it so that it’s non-catastrophic for them to develop and deploy systems much smarter than humans. So the key dimension is how much the actions of the labs are doing things that might prevent smarter than human systems from being deployed in the near future.
I think there are roughly two dimensions of variance where I do think AI lab behavior has a big effect on that:
Do they advocate for reasonable regulatory action and speak openly about the catastrophic risk from AI
Elon, of the people at leading labs at least historically has been the best on the latter, and I don’t know about the former
Do they have any plans that involve leveraging AI systems to improve the degree to which humanity can coordinate to not build superhuman systems in the near future
xAI is in a very promising position here, and deploying grok to Twitter (barring things like the MechaHitler incident) and generally leveraging Twitter seems like it could be really good here. But also Elon making Grok sycophantic to him seems quite bad for trust, so sign here is unclear.
I think RSPs can be helpful in as much as they create common knowledge of risks. No current company’s RSP commits the company to stopping or slowing down if risks get too high, and so on the company policy level seem largely powerless. I also think many RSPs and Risk Management Frameworks are active safety washing and written with an intent to actively confuse people about the risks from AI, so marginally less of that is often good (though I do think there are real teeth to some evals and RSPs, but definitely not all).
All current models are safe. No strongly superhuman future models are safe. They will stay safe until they non-trivially exceed top human performance. There, I did it.
Like, this is of course tongue-in-cheek, but I think it’s really important to not pretend that current safety evals are measuring the safety or risks of present systems. There is approximately no variance in how dangerous systems from different AI companies of the same capability level are. The key question, according to my models, is when these systems will become capable of disempowering humanity. None of these systems are anywhere remotely close to being aligned enough to take reasonable actions if they ever get into a position of disempowering humanity. The only evals that matter are the capability evals. No mitigations have ever helped with the fundamental risk model at all (I think there is value in reducing misuse risk, but misuse risk isn’t going to kill everyone, and so is many orders of magnitude less important than accident risk[1]).
Like, there is some important interplay between people racing and misuse risk, though I think stuff like bioterrorism evals do not really measure that either. There is also some interesting thinking to be done about the security requirements of RSPs, though the sign here is confusing and unclear to me, but it could be a large effect size.
Quick shallow reply:
AI companies say that their models [except maybe Opus 4] don’t provide substantial bio misuse uplift. I think this is likely wrong and their work is very sloppy. See my blogpost AI companies’ eval reports mostly don’t support their claims and Ryan’s shortform on bio capabilities.
I think this is noteworthy, not because I’m worried about risk from current models but because it’s a bad sign about noticing risks when warning signs appear, being honest about risk/safety even when it makes you look bad, etc.
Edit: I guess your belief “no actions that seem at all plausible for any current AI company to take have really any chance of making it so that it’s non-catastrophic for them to develop and deploy systems much smarter than humans” is a crux; I disagree, and so I care about marginal differences in risk-preparedness.
I agree that this is the key dimension, but I don’t currently think RSPs are a great vehicle for that. Indeed, looking at the regulatory advocacy of a company seems like a much better indicator, since I expect that to have a bigger effect on the conversation about risk/safety than the RSP and eval results (though it’s not overwhelmingly clear to me).
And again, many RSPs and eval results seem to me to be active propaganda, and so are harmful on this dimension, and it’s better to do nothing than to be harmful in this way (though I agree that if xAI said they would do a think and then didn’t, then that is quite bad).
Makes sense. I am not overwhelmingly confident there isn’t something control-esque to be done here, though that’s the only real candidate I have, and it’s not currently clear to me that current safety evals positively correlate with control interventions being easier or harder or even more likely to be implemented. For example, my sense is having models be trained for harmlessness makes them worse for control interventions, you would much rather have pure helpful + honest models.