Anthropic releasing their RSP was an important change in the AI safety landscape. The RSP was likely a substantial catalyst for policies like RSPs—which contain if-then commitments and more generally describe safety procedures—becoming more prominent. In particular, OpenAI now has a beta Preparedness Framework, Google DeepMind has a Frontier Safety Framework but there aren’t any concrete publicly-known policies yet, many companies agreed to the Seoul commitments which require making a similar policy, and SB-1047 required safety and security protocols.
However, I think the way Anthropic presented their RSP was misleading in practice (at least misleading to the AI safety community) in that it neither strictly requires pausing nor do I expect Anthropic to pause until they have sufficient safeguards in practice. I discuss why I think pausing until sufficient safeguards are in place is unlikely, at least in timelines as short as Dario’s (Dario Amodei is the CEO of Anthropic), in my earlier post.
I also have serious doubts about whether the LTBT will serve as a meaningful check to ensure Anthropic serves the interests of the public. The LTBT has seemingly done very little thus far—appointing only 1 board member despite being able to appoint 3⁄5 of the board members (a majority) and the LTBT is down to only 3 members. And none of its members have technical expertise related to AI. (The LTBT trustees seem altruistically motivated and seem like they would be thoughtful about questions about how to widely distribute benefits of AI, but this is different from being able to evaluate whether Anthropic is making good decisions with respect to AI safety.)
Additionally, in this article, Anthropic’s general counsel Brian Israel seemingly claims that the board probably couldn’t fire the CEO (currently Dario) if the board did this despite believing it would greatly reduce profits to shareholders[1]. Almost all of a board’s hard power comes from being able to fire the CEO, so if this claim were to be true, that would greatly undermine the ability of the board (and the LTBT which appoints the board) to ensure Anthropic, a public benefit corporation, serves the interests of the public in cases where this conflicts with shareholder interests. In practice, I think this claim which was seemingly made by the general counsel of Anthropic is likely false and, because Anthropic is a public benefit corporation, the board could fire the CEO and win in court even if they openly thought this would massively reduce shareholder value (so long as the board could show they used a reasonable process to consider shareholder interests and decided that the public interest outweighed in this case). Regardless, this article is evidence that the LTBT won’t provide a meaningful check on Anthropic in practice.
Misleading communication about the RSP
On the RSP, this post says:
On the one hand, the ASL system implicitly requires us to temporarily pause training of more powerful models if our AI scaling outstrips our ability to comply with the necessary safety procedures.
While I think this exact statement might be technically true, people have sometimes interpreted this quote and similar statements as a claim that Anthropic would pause until their safety measures sufficed for more powerful models. I think Anthropic isn’t likely to do this; in particular:
The RSP leaves open the option of revising it to reduce required countermeasures (so pausing is only required until the policy is changed).
This implies countermeasures would suffice for ensuring a reasonable level of safety, but given that commitments still haven’t been made for ASL-4 (the level at which existential or near-existential risks become plausible) and there aren’t clear procedural reasons to expect countermeasures to suffice to ensure a reasonable level of safety, I don’t think we should assume this will be the case.
Protections for ASL-3 are defined vaguely rather than having some sort of credible and independent risk analysis process (in addition to best guess countermeasures) and a requirement to ensure risk is sufficiently low with respect to this process. Perhaps ASL-4 requirements will differ; something more procedural seems particularly plausible as I don’t see a route to outlining specific tests in advance for ASL-4.
As mentioned earlier, I expect that if Anthropic ends up being able to build transformatively capable AI systems (as in, AI systems capable of obsoleting all human cognitive labor), they’ll fail to provide assurance of a reasonable level of safety. That said, it’s worth noting that insofar as Anthropic is actually a more responsible actor (as I currently tentatively think is the case), then from my perspective this choice is probably overall good—though I wish their communication was less misleading.
Anthropic and Anthropic employees often use similar language to this quote when describing the RSP, potentially contributing to a poor sense of what will happen. My impression is that lots of Anthropic employees just haven’t thought about this, and believe that Anthropic will behave much more cautiously than I think is plausible (and more cautiously than I think is prudent given other actors).
Other companies have worse policies and governance
While I focus on Anthropic in this comment, it is worth emphasizing that the policies and governance of other AI companies seem substantially worse. xAI, Meta and DeepSeek have no public safety policies at all, though they have saidthey will make a policy like this. Google DeepMind has published that they are working on making a frontier safety framework with commitments, but thus far they have just listed potential threat models corresponding to model capabilities and security levels without committing to security for specific capability levels. OpenAI has the beta preparedness framework, but the current security requirements seem inadequate and the required mitigations and assessment process for this is unspecified other than saying that the post-mitigation risk must be medium or below prior to deployment and high or below prior to continued development. I don’t expect OpenAI to keep the spirit of this commitment in short timelines. OpenAI, Google DeepMind, xAI, Meta, and DeepSeek all have clearly much worse governance than Anthropic.
What could Anthropic do to address my concerns?
Given these concerns about the RSP and the LTBT, what do I think should happen? First, I’ll outline some lower cost measures that seem relatively robust and then I’ll outline more expensive measures that don’t seem obviously good (at least not obviously good to strongly prioritize) but would be needed to make the situation no longer be problematic.
Lower cost measures:
Have the leadership clarify its views to Anthropic employees (at least alignment science employees) in terms of questions like: “How likely is Anthropic to achieve an absolutely low (e.g., 0.25%) lifetime level of risk (according to various third-party safety experts) if AIs that obsolete top human experts are created in the next 4 years?”, “Will Anthropic aim to have an RSP that would be the policy that a responsible developer would follow in a world with reasonable international safety practices?”, “How likely is Anthropic to exit from its RSP commitments if this is needed to be a competitive frontier model developer?”.
Clearly communicate to Anthropic employees (or at least a relevant subset of Anthropic employees) about in what circumstances the board could (and should) fire the CEO due to safety/public interest concerns. Additionally, explain the leadership’s policies with respect to cases where the board does fire the CEO—does the leadership of Anthropic commit to not fighting such an action?
Have an employee liaison to the LTBT who provides the LTBT with more information that isn’t filtered through the CEO or current board members. Ensure this employee is quite independent-minded, has expertise on AI safety (and ideally security), and ideally is employed by the LTBT rather than Anthropic.
Unfortunately, these measures aren’t straightforwardly independently verifiable based on public knowledge. As far as I know, some of these measures could already be in place.
More expensive measures:
In the above list, I explain various types of information that should be communicated to employees. Ensure that this information is communicated publicly including in relevant places like in the RSP.
Ensure the LTBT has 2 additional members with technical expertise in AI safety or minimally in security.
Ensure the LTBT appoints the board members it can currently appoint and that these board members are independent from the company and have their own well-formed views on AI safety.
Ensure the LTBT has an independent staff including technical safety experts, security experts, and independent lawyers.
Likely objections and my responses
Here are some relevant objections to my points and my responses:
Objection: “Sure, but from the perspective of most people, AI is unlikely to be existentially risky soon, so from this perspective it isn’t that misleading to think of deviating from safe practices as an edge case.” Response: To the extent Anthropic has views, Anthropic has the view that existentially risky AI is reasonably likely to be soon and Dario espouses this view. Further, I think this could be clarified in the text: the RSP could note that these commitments are what a responsible developer would do if we were in a world where being a responsible developer was possible while still being competitive (perhaps due to all relevant companies adopting such policies or due to regulation).
Objection: “Sure, but if other companies followed a similar policy then the RSP commitments would hold in a relatively straightforward way. It’s hardly Anthropic’s fault if other companies force it to be more reckless than it would like.” Response: This may be true, but it doesn’t mean that Anthropic isn’t being potentially misleading in their description of the situation. They could instead directly describe the situation in less misleading ways.
Objection: “Sure, but obviously Anthropic can’t accurately represent the situation publicly. That would result in bad PR and substantially undermine their business in other ways. To the extent you think Anthropic is a good actor, you shouldn’t be pressuring good actors like them to take actions that will make them differentially less competitive than worse actors.” Response: This is pretty fair, but I still think Anthropic could at least avoid making substantially misleading statements and ensure employees are well informed (at least for employees for whom this information is very relevant to their job decision-making). I think it is a good policy to correct misleading statements that result in differentially positive impressions and result in the safety community taking worse actions, because not having such a policy in general would result in more exploitation of the safety community.
The article says: “However, even the board members who are selected by the LTBT owe fiduciary obligations to Anthropic’s stockholders, Israel says. This nuance means that the board members appointed by the LTBT could probably not pull off an action as drastic as the one taken by OpenAI’s board members last November. It’s one of the reasons Israel was so confidently able to say, when asked last Thanksgiving, that what happened at OpenAI could never happen at Anthropic. But it also means that the LTBT ultimately has a limited influence on the company: while it will eventually have the power to select and remove a majority of board members, those members will in practice face similar incentives to the rest of the board.” This indicates that the board couldn’t fire the CEO if they thought this would greatly reduce profits to shareholders though it is somewhat unclear.
To the extent you think Anthropic is a good actor, you shouldn’t be pressuring good actors like them to take actions that will make them differentially less competitive than worse actors
I think an important part of how one becomes (and stays) a good actor is by being transparent about things like this.
Anthropic could at least avoid making substantially misleading statements
Yes. But also, I’m afraid that Anthropic might solve this problem by just making less statements (which seems bad). Still Yes
Anthropic releasing their RSP was an important change in the AI safety landscape. The RSP was likely a substantial catalyst for policies like RSPs—which contain if-then commitments and more generally describe safety procedures—becoming more prominent. In particular, OpenAI now has a beta Preparedness Framework, Google DeepMind has a Frontier Safety Framework but there aren’t any concrete publicly-known policies yet, many companies agreed to the Seoul commitments which require making a similar policy, and SB-1047 required safety and security protocols.
However, I think the way Anthropic presented their RSP was misleading in practice (at least misleading to the AI safety community) in that it neither strictly requires pausing nor do I expect Anthropic to pause until they have sufficient safeguards in practice. I discuss why I think pausing until sufficient safeguards are in place is unlikely, at least in timelines as short as Dario’s (Dario Amodei is the CEO of Anthropic), in my earlier post.
I also have serious doubts about whether the LTBT will serve as a meaningful check to ensure Anthropic serves the interests of the public. The LTBT has seemingly done very little thus far—appointing only 1 board member despite being able to appoint 3⁄5 of the board members (a majority) and the LTBT is down to only 3 members. And none of its members have technical expertise related to AI. (The LTBT trustees seem altruistically motivated and seem like they would be thoughtful about questions about how to widely distribute benefits of AI, but this is different from being able to evaluate whether Anthropic is making good decisions with respect to AI safety.)
Additionally, in this article, Anthropic’s general counsel Brian Israel seemingly claims that the board probably couldn’t fire the CEO (currently Dario) if the board did this despite believing it would greatly reduce profits to shareholders[1]. Almost all of a board’s hard power comes from being able to fire the CEO, so if this claim were to be true, that would greatly undermine the ability of the board (and the LTBT which appoints the board) to ensure Anthropic, a public benefit corporation, serves the interests of the public in cases where this conflicts with shareholder interests. In practice, I think this claim which was seemingly made by the general counsel of Anthropic is likely false and, because Anthropic is a public benefit corporation, the board could fire the CEO and win in court even if they openly thought this would massively reduce shareholder value (so long as the board could show they used a reasonable process to consider shareholder interests and decided that the public interest outweighed in this case). Regardless, this article is evidence that the LTBT won’t provide a meaningful check on Anthropic in practice.
Misleading communication about the RSP
On the RSP, this post says:
While I think this exact statement might be technically true, people have sometimes interpreted this quote and similar statements as a claim that Anthropic would pause until their safety measures sufficed for more powerful models. I think Anthropic isn’t likely to do this; in particular:
The RSP leaves open the option of revising it to reduce required countermeasures (so pausing is only required until the policy is changed).
This implies countermeasures would suffice for ensuring a reasonable level of safety, but given that commitments still haven’t been made for ASL-4 (the level at which existential or near-existential risks become plausible) and there aren’t clear procedural reasons to expect countermeasures to suffice to ensure a reasonable level of safety, I don’t think we should assume this will be the case.
Protections for ASL-3 are defined vaguely rather than having some sort of credible and independent risk analysis process (in addition to best guess countermeasures) and a requirement to ensure risk is sufficiently low with respect to this process. Perhaps ASL-4 requirements will differ; something more procedural seems particularly plausible as I don’t see a route to outlining specific tests in advance for ASL-4.
As mentioned earlier, I expect that if Anthropic ends up being able to build transformatively capable AI systems (as in, AI systems capable of obsoleting all human cognitive labor), they’ll fail to provide assurance of a reasonable level of safety. That said, it’s worth noting that insofar as Anthropic is actually a more responsible actor (as I currently tentatively think is the case), then from my perspective this choice is probably overall good—though I wish their communication was less misleading.
Anthropic and Anthropic employees often use similar language to this quote when describing the RSP, potentially contributing to a poor sense of what will happen. My impression is that lots of Anthropic employees just haven’t thought about this, and believe that Anthropic will behave much more cautiously than I think is plausible (and more cautiously than I think is prudent given other actors).
Other companies have worse policies and governance
While I focus on Anthropic in this comment, it is worth emphasizing that the policies and governance of other AI companies seem substantially worse. xAI, Meta and DeepSeek have no public safety policies at all, though they have said they will make a policy like this. Google DeepMind has published that they are working on making a frontier safety framework with commitments, but thus far they have just listed potential threat models corresponding to model capabilities and security levels without committing to security for specific capability levels. OpenAI has the beta preparedness framework, but the current security requirements seem inadequate and the required mitigations and assessment process for this is unspecified other than saying that the post-mitigation risk must be medium or below prior to deployment and high or below prior to continued development. I don’t expect OpenAI to keep the spirit of this commitment in short timelines. OpenAI, Google DeepMind, xAI, Meta, and DeepSeek all have clearly much worse governance than Anthropic.
What could Anthropic do to address my concerns?
Given these concerns about the RSP and the LTBT, what do I think should happen? First, I’ll outline some lower cost measures that seem relatively robust and then I’ll outline more expensive measures that don’t seem obviously good (at least not obviously good to strongly prioritize) but would be needed to make the situation no longer be problematic.
Lower cost measures:
Have the leadership clarify its views to Anthropic employees (at least alignment science employees) in terms of questions like: “How likely is Anthropic to achieve an absolutely low (e.g., 0.25%) lifetime level of risk (according to various third-party safety experts) if AIs that obsolete top human experts are created in the next 4 years?”, “Will Anthropic aim to have an RSP that would be the policy that a responsible developer would follow in a world with reasonable international safety practices?”, “How likely is Anthropic to exit from its RSP commitments if this is needed to be a competitive frontier model developer?”.
Clearly communicate to Anthropic employees (or at least a relevant subset of Anthropic employees) about in what circumstances the board could (and should) fire the CEO due to safety/public interest concerns. Additionally, explain the leadership’s policies with respect to cases where the board does fire the CEO—does the leadership of Anthropic commit to not fighting such an action?
Have an employee liaison to the LTBT who provides the LTBT with more information that isn’t filtered through the CEO or current board members. Ensure this employee is quite independent-minded, has expertise on AI safety (and ideally security), and ideally is employed by the LTBT rather than Anthropic.
Unfortunately, these measures aren’t straightforwardly independently verifiable based on public knowledge. As far as I know, some of these measures could already be in place.
More expensive measures:
In the above list, I explain various types of information that should be communicated to employees. Ensure that this information is communicated publicly including in relevant places like in the RSP.
Ensure the LTBT has 2 additional members with technical expertise in AI safety or minimally in security.
Ensure the LTBT appoints the board members it can currently appoint and that these board members are independent from the company and have their own well-formed views on AI safety.
Ensure the LTBT has an independent staff including technical safety experts, security experts, and independent lawyers.
Likely objections and my responses
Here are some relevant objections to my points and my responses:
Objection: “Sure, but from the perspective of most people, AI is unlikely to be existentially risky soon, so from this perspective it isn’t that misleading to think of deviating from safe practices as an edge case.” Response: To the extent Anthropic has views, Anthropic has the view that existentially risky AI is reasonably likely to be soon and Dario espouses this view. Further, I think this could be clarified in the text: the RSP could note that these commitments are what a responsible developer would do if we were in a world where being a responsible developer was possible while still being competitive (perhaps due to all relevant companies adopting such policies or due to regulation).
Objection: “Sure, but if other companies followed a similar policy then the RSP commitments would hold in a relatively straightforward way. It’s hardly Anthropic’s fault if other companies force it to be more reckless than it would like.” Response: This may be true, but it doesn’t mean that Anthropic isn’t being potentially misleading in their description of the situation. They could instead directly describe the situation in less misleading ways.
Objection: “Sure, but obviously Anthropic can’t accurately represent the situation publicly. That would result in bad PR and substantially undermine their business in other ways. To the extent you think Anthropic is a good actor, you shouldn’t be pressuring good actors like them to take actions that will make them differentially less competitive than worse actors.” Response: This is pretty fair, but I still think Anthropic could at least avoid making substantially misleading statements and ensure employees are well informed (at least for employees for whom this information is very relevant to their job decision-making). I think it is a good policy to correct misleading statements that result in differentially positive impressions and result in the safety community taking worse actions, because not having such a policy in general would result in more exploitation of the safety community.
The article says: “However, even the board members who are selected by the LTBT owe fiduciary obligations to Anthropic’s stockholders, Israel says. This nuance means that the board members appointed by the LTBT could probably not pull off an action as drastic as the one taken by OpenAI’s board members last November. It’s one of the reasons Israel was so confidently able to say, when asked last Thanksgiving, that what happened at OpenAI could never happen at Anthropic. But it also means that the LTBT ultimately has a limited influence on the company: while it will eventually have the power to select and remove a majority of board members, those members will in practice face similar incentives to the rest of the board.” This indicates that the board couldn’t fire the CEO if they thought this would greatly reduce profits to shareholders though it is somewhat unclear.
I think an important part of how one becomes (and stays) a good actor is by being transparent about things like this.
Yes. But also, I’m afraid that Anthropic might solve this problem by just making less statements (which seems bad). Still Yes
Making more statements would also be fine! I wouldn’t mind if there were just clarifying statements even if the original statement had some problems.
(To try to reduce the incentive for less statements, I criticized other labs for not having policies at all.)