My notes on the RSP:
I like the shift to conditional risk-based commitments over unilateral ones. Unilateral absolute-risk based commitments were never decision-theoretically or morally optimal IMO.
I also like that there’s a lot of built-in transparency, it feels like keeping the public informed was a core design principle and I appreciate it. I like much of the roadmap, and I especially like the details of upholding Claude’s Constitution.
The RSP requires risk reports to include a risk-benefit determination, and says the CEO and RSO make the ultimate call on deployment. But it doesn’t say the CEO and RSO must decide *based* on that determination, leaving open the possibility that a risk report finds the risks outweigh the benefits, and the CEO and RSO move forward anyway. I’m also wary of the implied “benefits outweigh the risks” criterion and would be more comfortable with something like “limit marginal risk to a negligible amount.”. I’d want to know more about how the CEO and RSO intend on making these decisions, and I think that should be highlighted clearly
An equivalent to ASL-5 / a superintelligence threshold seems to also be completely missing. Does Anthropic believe the effects of arbitrarily advanced AI are captured by “automated R&D”?
In section 1, the “Mitigations—our plan as a company” column and the “Mitigations—ambitious industry-wide recommendations” use different ontologies, and are hard to compare. What bad things should we expect to happen by default, if only the current plans are achieved? For instance re “Novel chemical/biological weapons”: do I understand correctly that we should actively expect “catastrophic damages far beyond … COVID-19“ unless other AI labs implement better mitigations than Anthropic is currently planning to by the time they reach this threshold? Do the mitigations in the roadmap, such as the “moonshot R&D for security” projects, cover “roughly in line with RAND SL4”?
I’m a bit fuzzy on training and deployment decisions between risk reports. It’s unclear whether risk reports include proactive assessment of hypothetical next-gen models, whether that covers internally deployed models that get a discussion 30 days after the fact, and whether the board and LTBT reliably get a say when a deployment’s risk analysis is marginal-based.
Will we find out when Anthropic actually chooses to delay training or deployment? If so we could credit them if they do, and otherwise know that this possibility hasn’t yet materialized.
As acknowledged in the RSP, the under-specified thresholds are a significant weakness.
This is a lightly edited version of my tweet thread
This was Pliny’s response when I asked them if they can get around the classifiers. I’m not fully confident this counts, but Pliny seems to think so