I’m worried about risk from current models but because it’s a bad sign about noticing risks when warning signs appear, being honest about risk/safety even when it makes you look bad, etc.
I agree that this is the key dimension, but I don’t currently think RSPs are a great vehicle for that. Indeed, looking at the regulatory advocacy of a company seems like a much better indicator, since I expect that to have a bigger effect on the conversation about risk/safety than the RSP and eval results (though it’s not overwhelmingly clear to me).
And again, many RSPs and eval results seem to me to be active propaganda, and so are harmful on this dimension, and it’s better to do nothing than to be harmful in this way (though I agree that if xAI said they would do a think and then didn’t, then that is quite bad).
I guess your belief “no actions that seem at all plausible for any current AI company to take have really any chance of making it so that it’s non-catastrophic for them to develop and deploy systems much smarter than humans” is a crux; I disagree, and so I care about marginal differences in risk-preparedness.
Makes sense. I am not overwhelmingly confident there isn’t something control-esque to be done here, though that’s the only real candidate I have, and it’s not currently clear to me that current safety evals positively correlate with control interventions being easier or harder or even more likely to be implemented. For example, my sense is having models be trained for harmlessness makes them worse for control interventions, you would much rather have pure helpful + honest models.
I agree that this is the key dimension, but I don’t currently think RSPs are a great vehicle for that. Indeed, looking at the regulatory advocacy of a company seems like a much better indicator, since I expect that to have a bigger effect on the conversation about risk/safety than the RSP and eval results (though it’s not overwhelmingly clear to me).
And again, many RSPs and eval results seem to me to be active propaganda, and so are harmful on this dimension, and it’s better to do nothing than to be harmful in this way (though I agree that if xAI said they would do a think and then didn’t, then that is quite bad).
Makes sense. I am not overwhelmingly confident there isn’t something control-esque to be done here, though that’s the only real candidate I have, and it’s not currently clear to me that current safety evals positively correlate with control interventions being easier or harder or even more likely to be implemented. For example, my sense is having models be trained for harmlessness makes them worse for control interventions, you would much rather have pure helpful + honest models.