Auditors find an issue, and your reaction is that “Oh we forgot to fix that, we’ll fix it now”? I’ve participated in IT system audits (not in AI space), and when auditors find an issue, you fix it, figure out why it occurred in the first place, and then you re-audit that part to make sure the issue is actually gone and the fix didn’t introduce new issues. When the auditors find only easy-to-find issues, you don’t claim the system has been audited after you fix them. You worry how many hard-to-find issues were not found because the auditing time was wasted on simple issues.
Anthropic’s RSP doesn’t actually require that an external audit has greenlighted deployment, merely that external expert feedback has to be solicited. Still, I’m quite surprised that there are no audit results from Apollo Research (or some other organization) for the final version.
How serious do you think the issue is that Apollo identified? Certainly, it doesn’t seem like it could pose a catastrophic risk—it’s not concerning from a biorisk perspective if you buy that the ASL-3 defenses are working properly, and I don’t think there are really any other catastrophic risks to be too concerned about from these models right now. Maybe it might try to incompetently attempt internal research sabotage if you accidentally gave it a system prompt you didn’t realize was leading it in that direction?
Generally, I think it just seems to me like “take this very seriously and ensure you’ve fixed it and audited the fix prior to release because this could be dangerous right now” makes less sense as a response than “do your best to fix it and publish as much as you can about it to improve understanding for when it could be dangerous in smarter models”.
I’m drawing parallels between conventional system auditing and AI alignment assessment. I’m admittedly not sure if my intuitions transfer over correctly. I’m certainly not expecting the same processes to be followed here, but many of the principles should still hold.
We believe that these findings are largely but not entirely driven by the fact that this early snapshot had severe issues with deference to harmful system-prompt instructions. [..] This issue had not yet been mitigated as of the snapshot that they tested.
In my experience, if an audit finds lots of issues, it means nobody has time to look for the hard-to-find issues. I get the same feeling from this section; Apollo easily found scheming issues where the model deferred to the system prompt too much. Often subtler issues get completely shadowed, e.g. some findings could be attributed to the system prompt deference, when in reality they were caused by something else.
To help reduce the risk of blind spots in our own assessment, we contracted with Apollo Research to assess an early snapshot for propensities and capabilities related to sabotage
What I’m worried about that these potential blind spots were not found, as per my reasoning above. I think the marginal value produced by a second external assessment wasn’t diminished much by the first one. That said, I agree that deploying Claude 4 is quite unlikely to pose any catastrophic risks, especially with ASL-3 safeguards. Deploying earlier, allowing anyone to run evaluations on the model is also valuable.
How serious do you think the issue is that Apollo identified? Certainly, it doesn’t seem like it could pose a catastrophic risk—it’s not concerning from a biorisk perspective if you buy that the ASL-3 defenses are working properly, and I don’t think there are really any other catastrophic risks to be too concerned about from these models right now. Maybe it might try to incompetently attempt internal research sabotage if you accidentally gave it a system prompt you didn’t realize was leading it in that direction?
Generally, I think it just seems to me like “take this very seriously and ensure you’ve fixed it and audited the fix prior to release because this could be dangerous right now” makes less sense as a response than “do your best to fix it and publish as much as you can about it to improve understanding for when it could be dangerous in smarter models”.
I’m drawing parallels between conventional system auditing and AI alignment assessment. I’m admittedly not sure if my intuitions transfer over correctly. I’m certainly not expecting the same processes to be followed here, but many of the principles should still hold.
In my experience, if an audit finds lots of issues, it means nobody has time to look for the hard-to-find issues. I get the same feeling from this section; Apollo easily found scheming issues where the model deferred to the system prompt too much. Often subtler issues get completely shadowed, e.g. some findings could be attributed to the system prompt deference, when in reality they were caused by something else.
What I’m worried about that these potential blind spots were not found, as per my reasoning above. I think the marginal value produced by a second external assessment wasn’t diminished much by the first one. That said, I agree that deploying Claude 4 is quite unlikely to pose any catastrophic risks, especially with ASL-3 safeguards. Deploying earlier, allowing anyone to run evaluations on the model is also valuable.