Given the recent argument on whether Anthropic really did commit to not push the frontier or just misled most people into thinking that it was the case, it’s relevant to reread the RSPs in hairsplitting mode. I was rereading the RSPs and noticed a few relevant findings:
Disclaimer: this is focused on negative stuff but does not deny the merits of RSPs etc etc.
I couldn’t find any sentence committing to not significantly increase extreme risks. OTOH I found statements that if taken literally could imply an implicit acknowledgment of the opposite: “our most significant immediate commitments include a high standard of security for ASL-3 containment, and a commitment not to deploy ASL-3 models until thorough red-teaming finds no risk of catastrophe.”. Note that it makes a statement on the risk only bearing on deployment measures and not on security. Given that the lack security is probably the biggest source of risk of ASL-3 systems & the biggest weakness of RSPs, I find it pretty likely that this is not random.
I found a number of commitments that are totally unenforceable in hairsplitting mode. Here are two examples:
“World-class experts collaborating with prompt engineers should red-team the deployment thoroughly and fail to elicit information at a level of sophistication, accuracy, usefulness, detail, and frequency which significantly enables catastrophic misuse.”
The use of five underdefined adjectives + “significantly” is a pretty safe barrier against any enforcement.
“When informed of a newly discovered model vulnerability enabling catastrophic harm (e.g. a jailbreak or a detection failure), we commit to mitigate or patch it promptly (e.g.50% of the time in which catastrophic harm could realistically occur).”
The combination of “or”, of the characterization of promptly as “50% of the time, the use of “e.g.” and of “realistically” is also a safe barrier against enforceability.
It’s only my subjective judgment here and you don’t have to trust it but I also found Core Views on AI Safety to have a number of similar patterns.
There was a hot debate recently but regardless, the bottom line is just “RSPs should probably be interpreted literally and nothing else. If a literal statement is not strictly there, it should be assumed it’s not a commitment.”
I’ve not seen people doing very literal interpretation on those so I just wanted to emphasize that point.
I currently think Anthropic didn’t “explicitly publicly commit” to not advance the rate of capabilities progress. But, I do think they made deceptive statements about it, and when I complain about Anthropic I am complaining about deception, not “failing to uphold literal commitments.”
I’m not talking about the RSPs because the writing and conversations I’m talking about came before that. I agree that the RSP is more likely to be a good predictor of what they’ll actually do.
I think most of the generator for this was more like “in person conversations”, at least one of which was between Dario and Dustin Moswkowitz:
The most explicit public statement I know is from this blogpost (which I agree is not an explicit commitment, but, I do think
Capabilities: AI research aimed at making AI systems generally better at any sort of task, including writing, image processing or generation, game playing, etc. Research that makes large language models more efficient, or that improves reinforcement learning algorithms, would fall under this heading. Capabilities work generates and improves on the models that we investigate and utilize in our alignment research. We generally don’t publish this kind of work because we do not wish to advance the rate of AI capabilities progress. In addition, we aim to be thoughtful about demonstrations of frontier capabilities (even without publication). We trained the first version of our headline model, Claude, in the spring of 2022, and decided to prioritize using it for safety research rather than public deployments. We’ve subsequently begun deploying Claude now that the gap between it and the public state of the art is smaller.
Given the recent argument on whether Anthropic really did commit to not push the frontier or just misled most people into thinking that it was the case, it’s relevant to reread the RSPs in hairsplitting mode. I was rereading the RSPs and noticed a few relevant findings:
Disclaimer: this is focused on negative stuff but does not deny the merits of RSPs etc etc.
I couldn’t find any sentence committing to not significantly increase extreme risks. OTOH I found statements that if taken literally could imply an implicit acknowledgment of the opposite: “our most significant immediate commitments include a high standard of security for ASL-3 containment, and a commitment not to deploy ASL-3 models until thorough red-teaming finds no risk of catastrophe.”.
Note that it makes a statement on the risk only bearing on deployment measures and not on security. Given that the lack security is probably the biggest source of risk of ASL-3 systems & the biggest weakness of RSPs, I find it pretty likely that this is not random.
I found a number of commitments that are totally unenforceable in hairsplitting mode. Here are two examples:
“World-class experts collaborating with prompt engineers should red-team the deployment thoroughly and fail to elicit information at a level of sophistication, accuracy, usefulness, detail, and frequency which significantly enables catastrophic misuse.”
The use of five underdefined adjectives + “significantly” is a pretty safe barrier against any enforcement.
“When informed of a newly discovered model vulnerability enabling catastrophic harm (e.g. a jailbreak or a detection failure), we commit to mitigate or patch it promptly (e.g. 50% of the time in which catastrophic harm could realistically occur).”
The combination of “or”, of the characterization of promptly as “50% of the time, the use of “e.g.” and of “realistically” is also a safe barrier against enforceability.
It’s only my subjective judgment here and you don’t have to trust it but I also found Core Views on AI Safety to have a number of similar patterns.
This debate comes from before the RSP so I don’t actually think that’s cruxy. Will try to dig up an older post.
There was a hot debate recently but regardless, the bottom line is just “RSPs should probably be interpreted literally and nothing else. If a literal statement is not strictly there, it should be assumed it’s not a commitment.”
I’ve not seen people doing very literal interpretation on those so I just wanted to emphasize that point.
I currently think Anthropic didn’t “explicitly publicly commit” to not advance the rate of capabilities progress. But, I do think they made deceptive statements about it, and when I complain about Anthropic I am complaining about deception, not “failing to uphold literal commitments.”
I’m not talking about the RSPs because the writing and conversations I’m talking about came before that. I agree that the RSP is more likely to be a good predictor of what they’ll actually do.
I think most of the generator for this was more like “in person conversations”, at least one of which was between Dario and Dustin Moswkowitz:
The most explicit public statement I know is from this blogpost (which I agree is not an explicit commitment, but, I do think
If you wanna reread the debate, you can scroll through this thread (https://x.com/bshlgrs/status/1764701597727416448).