It’s good that Anthropic’s system cards contain a lot of useful information on misalignment and risk. It’s also good they are putting out detailed sabotage risk reports that articulate their views.[1] I appreciate the hard work of many employees going into these reports.
I wish other companies released similarly informative reports.[2]
I’m not necessarily claiming I agree with their risk reports. For instance, I have at least some moderately important disagreements with the Mythos Preview sabotage risk report update. But having lots of detail, including sufficient detail that I can get a decent sense of where I disagree with the analysis, is praiseworthy.
I’m not claiming that this is the most leveraged thing for safety-motivated employees at other companies to work on, but I do think it would be good for AI companies to do better (without trading off against other safety efforts).
It’s good that Anthropic’s system cards contain a lot of useful information on misalignment and risk. It’s also good they are putting out detailed sabotage risk reports that articulate their views. [1] I appreciate the hard work of many employees going into these reports.
I wish other companies released similarly informative reports. [2]
I’m not necessarily claiming I agree with their risk reports. For instance, I have at least some moderately important disagreements with the Mythos Preview sabotage risk report update. But having lots of detail, including sufficient detail that I can get a decent sense of where I disagree with the analysis, is praiseworthy.
I’m not claiming that this is the most leveraged thing for safety-motivated employees at other companies to work on, but I do think it would be good for AI companies to do better (without trading off against other safety efforts).