habryka comments on Zach Stein-Perlman’s Shortform

habryka 11 Jul 2025 17:52 UTC
20 points
11
I’m surprised if you think that variance in regulatory outcomes is not just more important than variance in what-a-company-does outcomes but also sufficiently tractable for the marginal company that it’s the key question.
Huh, I am not sure what you mean. I am surprised if you think that I think that marginal frontier-lab safety research is making progress on the hard parts of the alignment problem. I’ve been pretty open that I think valuable progress on that dimension has been close to zero.
This doesn’t mean actions of an AI company do not matter, but I do think that no actions that seem at all plausible for any current AI company to take have really any chance of making it so that it’s non-catastrophic for them to develop and deploy systems much smarter than humans. So the key dimension is how much the actions of the labs are doing things that might prevent smarter than human systems from being deployed in the near future.
I think there are roughly two dimensions of variance where I do think AI lab behavior has a big effect on that:
- Do they advocate for reasonable regulatory action and speak openly about the catastrophic risk from AI
  - Elon, of the people at leading labs at least historically has been the best on the latter, and I don’t know about the former
- Do they have any plans that involve leveraging AI systems to improve the degree to which humanity can coordinate to not build superhuman systems in the near future
  - xAI is in a very promising position here, and deploying grok to Twitter (barring things like the MechaHitler incident) and generally leveraging Twitter seems like it could be really good here. But also Elon making Grok sycophantic to him seems quite bad for trust, so sign here is unclear.
I think RSPs can be helpful in as much as they create common knowledge of risks. No current company’s RSP commits the company to stopping or slowing down if risks get too high, and so on the company policy level seem largely powerless. I also think many RSPs and Risk Management Frameworks are active safety washing and written with an intent to actively confuse people about the risks from AI, so marginally less of that is often good (though I do think there are real teeth to some evals and RSPs, but definitely not all).
If a company says it thinks a model is safe on the basis of eval results
All current models are safe. No strongly superhuman future models are safe. They will stay safe until they non-trivially exceed top human performance. There, I did it.
Like, this is of course tongue-in-cheek, but I think it’s really important to not pretend that current safety evals are measuring the safety or risks of present systems. There is approximately no variance in how dangerous systems from different AI companies of the same capability level are. The key question, according to my models, is when these systems will become capable of disempowering humanity. None of these systems are anywhere remotely close to being aligned enough to take reasonable actions if they ever get into a position of disempowering humanity. The only evals that matter are the capability evals. No mitigations have ever helped with the fundamental risk model at all (I think there is value in reducing misuse risk, but misuse risk isn’t going to kill everyone, and so is many orders of magnitude less important than accident risk^[1]).
1. ^
  Like, there is some important interplay between people racing and misuse risk, though I think stuff like bioterrorism evals do not really measure that either. There is also some interesting thinking to be done about the security requirements of RSPs, though the sign here is confusing and unclear to me, but it could be a large effect size.
- Zach Stein-Perlman 11 Jul 2025 17:59 UTC
  3 points
  0
  Parent
  If a company says it thinks a model is safe on the basis of eval results
  All current models are safe. No strongly superhuman future models are safe. There, I did it.
  Quick shallow reply:
  1. AI companies say that their models [except maybe Opus 4] don’t provide substantial bio misuse uplift. I think this is likely wrong and their work is very sloppy. See my blogpost AI companies’ eval reports mostly don’t support their claims and Ryan’s shortform on bio capabilities.
  2. I think this is noteworthy, not because I’m worried about risk from current models but because it’s a bad sign about noticing risks when warning signs appear, being honest about risk/safety even when it makes you look bad, etc.
    Edit: I guess your belief “no actions that seem at all plausible for any current AI company to take have really any chance of making it so that it’s non-catastrophic for them to develop and deploy systems much smarter than humans” is a crux; I disagree, and so I care about marginal differences in risk-preparedness.
  - habryka 11 Jul 2025 18:09 UTC
    12 points
    7
    Parent
    I’m worried about risk from current models but because it’s a bad sign about noticing risks when warning signs appear, being honest about risk/safety even when it makes you look bad, etc.
    I agree that this is the key dimension, but I don’t currently think RSPs are a great vehicle for that. Indeed, looking at the regulatory advocacy of a company seems like a much better indicator, since I expect that to have a bigger effect on the conversation about risk/safety than the RSP and eval results (though it’s not overwhelmingly clear to me).
    And again, many RSPs and eval results seem to me to be active propaganda, and so are harmful on this dimension, and it’s better to do nothing than to be harmful in this way (though I agree that if xAI said they would do a think and then didn’t, then that is quite bad).
    I guess your belief “no actions that seem at all plausible for any current AI company to take have really any chance of making it so that it’s non-catastrophic for them to develop and deploy systems much smarter than humans” is a crux; I disagree, and so I care about marginal differences in risk-preparedness.
    Makes sense. I am not overwhelmingly confident there isn’t something control-esque to be done here, though that’s the only real candidate I have, and it’s not currently clear to me that current safety evals positively correlate with control interventions being easier or harder or even more likely to be implemented. For example, my sense is having models be trained for harmlessness makes them worse for control interventions, you would much rather have pure helpful + honest models.