Vaniver’s thoughts on Anthropic’s RSP

Announcement, Policy v1.0, evhub’s argument in favor on LW. These are my personal thoughts; in the interest of full disclosure, one of my housemates and several of my friends work at Anthropic; my spouse and I hold OpenAI units (but are financially secure without them). This post has three main parts: applause for things done right, a summary /​ review of RSPs in general, and then specific criticisms and suggestions of what to improve about Anthropic’s RSP.

First, the things to applaud. Anthropic’s RSP makes two important commitments: that they will manage their finances to allow for pauses as necessary, and that there is a single directly responsible individual for ensuring the commitments are met and a quarterly report on them is made. Both of those are the sort of thing that represent genuine organizational commitment rather than lip service. I think it’s great for companies to be open about what precautions they’re taking to ensure the development of advanced artificial intelligence benefits humanity, even if I don’t find those policies fully satisfactory.

Second, the explanation. Following the model of biosafety levels, where labs must meet defined standards in order to work with specific dangerous pathogens, Anthropic suggests AI safety levels, or ASLs, with corresponding standards for labs working with dangerous models. While the makers of BSL could list specific pathogens to populate each tier, ASL tiers must necessarily be speculative. Previous generations of models are ASL-1 (BSL-1 corresponds to no threat of infection in healthy adults), current models like Claude count as ASL-2 (BSL-2 corresponds to moderate health hazards, like HIV), and the next tier of models, which either substantially increase baseline levels of catastrophic risk or are capable of autonomy, count as ASL-3 (BSL-3 corresponds to potentially lethal inhalable diseases, like SARS-CoV-1 and 2).

While BSL tops out at 4, ASL is left unbounded for now, with a commitment to define ASL-4 before using ASL-3 models.[1] This means having a defined ceiling of what capabilities would call for increased investments in security practices at all levels, while not engaging in too much armchair speculation about how AI development will proceed.

The idea behind RSPs is that rather than pausing at an arbitrary start-date for an arbitrary amount of time (or simply shutting it all down), capability thresholds are used to determine when to start model-specific pauses, and security thresholds are used to determine when to unpause development on that model. RSPs are meant to demand active efforts to determine whether or not models are capable of causing catastrophic harm, rather than simply developing them blind. They seem substantially better than scaling without an RSP or equivalent.


Third, why am I not yet satisfied with Anthropic’s RSP? Criticisms and suggestions in roughly decreasing order of importance:

  1. The core assumption underlying the RSP is that model capabilities can either be predicted in advance or discovered through dedicated testing before deployment. Testing can only prove the presence of capabilities, not their absence, but this method rests on absence of evidence for its evidence of absence. I think capabilities might appear suddenly at an unknown time. Anthropic’s approach calls for scaling at a particular rate to lower the chance of this sudden appearance; I’m not yet convinced that their rate is sufficient to handle the risk here. It is still better to look for those capabilities than not look, but pre-deployment testing is inadequate for continued safety.[2] I think the RSP could include more commitments to post-deployment monitoring at the ASL-2 stage to ensure that it still counts as ASL-2.

  2. This is also reflected by the classification of models before testing rather than after testing. The RSP treats ASL-2 models and ASL-3 models differently, with stricter security standards in place for ASL-3 labs to make it safer to work with ASL-3 models. However, before a model is tested, it is unclear which category it should belong in, and the RSP is unclear on how to handle that ambiguity. While presuming they’re ASL-2 is more convenient (as frontier labs can continue to do scaling research with approximately their current levels of security), presuming they’re ASL-3 is more secure. When a red-teaming effort determines that a model is capable of proliferating itself onto the internet, or that it is capable of enabling catastrophic terrorist attacks, it might be too late to properly secure the model, even if training is immediately halted and deployment delayed. [A lab not hardened against terrorist infiltration might leak that it has an ASL-3 model when it triggers the pause to secure, which then potentially allows for the model to be stolen.]

  3. The choice of baseline for the riskiness of information provided by the model is availability on search engines. While this is limiting for responsible actors, it seems easy to circumvent. The RSP already excludes other advanced AI systems from the baseline, but it seems to me that a static baseline (such as what was available on search engines in 2021) would be harder to bypass.[3]

  4. Importantly, it seems likely to me that this approach will work well initially in a way that might not correspond to how well it will work when models have grown capable enough to potentially disempower humanity. That is, this might successfully manage the transition from ASL-2 to ASL-3 but not be useful for the transition from ASL-3 to ASL-4, while mistakenly giving us the impression that the system is working. The RSP plans to lay the track ahead of the locomotive, expecting that we can foresee how to test for dangerous capabilities and identify how to secure them as needed, or have the wisdom to call a halt in the future when we fail to do so.

Overall, this feels to me like a move towards adequacy, and it’s good to reward those moves; I appreciate that Anthropic’s RSP has the feeling of a work-in-progress as opposed to being presented as clearly sufficient to the task.

  1. ^

    BSL-4 diseases have basically the same features as BSL-3 diseases, except that also there are no available vaccines or treatments. Also, all extraterrestrial samples are BSL-4 by default, to a standard more rigorous than any current BSL-4 labs could meet.

  2. ^

    The implied belief is that model capabilities primarily come from training during scaling, and that our scaling laws (and various other things) are sufficient to predict model capabilities. Advancements in how to deploy models might break this; as would the scaling laws failing to predict capabilities.

  3. ^

    It does have some convenience costs—if the baseline were set in 2019, for example, then the model might not be able to talk about coronavirus, even tho AI development and the pandemic were independent.