with the exception of interfacing specifically with you, Holden, on this topic, where your comments have seemed clear and reasonable and consistent across time to me
I’m excited about RSPs partly because it seems like people in those categories—not just people who agree with my estimates about risks—should support RSPs. This raises the possibility of a much broader consensus around conditional pausing than I think is likely around immediate (unconditional) pausing. And with a broader consensus, I expect an easier time getting well-designed, well-enforced regulation.
I think RSPs represent an opportunity for wide consensus that pausing under certain conditions would be good, and this seems like it would be an extremely valuable thing to establish.
To me this language seems entirely consistent with my current position. In fact, if I had (at that time) been hoping for a voluntary pause, it would have been a strange choice for me to emphasis “consensus” leading to “regulation” when I could have just said something like “I’m excited about RSPs partly because it seems like people in those categories—not just people who agree with my estimates about risks—should support RSPs. This means that voluntary pausing based on conditional commitments is more likely than voluntary pausing today.”
I believe that all of my past writing on if-then commitments presents the theory of change as running through policy action, and generally discusses “escape clauses” as well, rather than focusing on voluntary action as the theory of change.
I just skimmed the piece, and it does seem consistent over time to me. Eg., under “Potential Benefits”, the piece does not list unilateral pause in case of unmitigated risks.
These are commitments of the form: If an AI model has capability X, risk mitigations Y must be in place. And, if needed, we will delay AI deployment and/or development to ensure the mitigations can be present in time.
This isn’t explicit about a unilateral pause, but I think it would be kinda weird if this meant to imply “And, if needed, we will delay [...] unless other groups are not also delaying.”
Voluntary commitments and even regulation could be too hard to enforce across the board—such that responsible actors end up adhering to if-then commitments, while irresponsible actors rush forward with dangerous AI. One of the challenges with AI is that complete enforcement of any given risk mitigation framework seems extremely hard to achieve, yet incomplete enforcement could end up disadvantaging responsible actors in a high-stakes, global technology race. This is a general issue with most ways of reducing AI risks, other than “race forward and hope that the benefits outweigh the costs,” and is not specific to if-then commitments.
To help mitigate this issue, early, voluntary if-then commitments can contain “escape clauses” along the lines of: “We may cease adhering to these commitments if some actor who is not adhering to them is close to building more capable models than ours.” (Some more detailed suggested language for such a commitment is provided by METR, a nonprofit that works on AI evaluations.)
To help mitigate this issue, early, voluntary if-then commitments can contain “escape clauses” along the lines of: “We may cease adhering to these commitments if some actor who is not adhering to them is close to building more capable models than ours.” (Some more detailed suggested language for such a commitment is provided by METR, a nonprofit that works on AI evaluations.)
Just for reference, this framing is what makes me feel fine about things you said on this topic, but not fine about conversations I’ve had with Anthropic employees about this topic in the last few years. My conversations with Anthropic employees did definitely not involve them saying “we are committing to our RSP only if every other company also adopts a similar RSP”.
At the most they were saying “we are going to revise our RSP as we learn more about what an effective RSP would look like and might make changes in-accordance with that”, which is of course drastically different. If the commitment all along had been to “commit to the RSP conditional on other people also committing to equivalent policies”, then the RSP could have said that directly, and the change from an unconditional to a conditional policy is of course massive (and I think the RSP as written clearly was communicating itself as an unconditional policy).
One interpretation for how Holden might have been consistent over time: He did not think that Anthropic should unilaterally pause AI development if other companies race ahead. But he did think the RSP should say that they’d pause when there are unmitigated risks regardless of the context and race-dynamics since saying so in the RSP is a good forcing function for the actual benefits that he wished would follow from it.
(Tbc., I do not know what Holden believed, I’m just constructing a plausible reality)
(Also, even then he at least seems to have changed his mind about whether writing down If-Then commitments is a good idea!)
Yes, in as much as the Anthropic RSP was intended as an implementation of if-then-commitments with specific ifs and thens, then that would be inconsistent. But IIRC Holden didn’t work at Anthropic at the time of the RSP getting developed or adopted, and I didn’t see any writing by Holden about the degree to which he does consider Anthropic committed to these thresholds, or see it as a clear instance of something that follows the shape of what is in that paper.
I was here mostly referencing specific conversations or interactions (in e.g. comment threads and emails) I had with Holden as opposed to others at Anthropic about the RSP.
This doesn’t seem “consistent across time” to me, given that Holden is the author of a report called If-Then Commitments for AI Risk Reduction
Also, from We’re Not Ready: thoughts on “pausing” and responsible scaling policies:
To me this language seems entirely consistent with my current position. In fact, if I had (at that time) been hoping for a voluntary pause, it would have been a strange choice for me to emphasis “consensus” leading to “regulation” when I could have just said something like “I’m excited about RSPs partly because it seems like people in those categories—not just people who agree with my estimates about risks—should support RSPs. This means that voluntary pausing based on conditional commitments is more likely than voluntary pausing today.”
I believe that all of my past writing on if-then commitments presents the theory of change as running through policy action, and generally discusses “escape clauses” as well, rather than focusing on voluntary action as the theory of change.
I just skimmed the piece, and it does seem consistent over time to me. Eg., under “Potential Benefits”, the piece does not list unilateral pause in case of unmitigated risks.
That’s a good point. It does say:
This isn’t explicit about a unilateral pause, but I think it would be kinda weird if this meant to imply “And, if needed, we will delay [...] unless other groups are not also delaying.”
Note this quote later in that same piece:
Voluntary commitments and even regulation could be too hard to enforce across the board—such that responsible actors end up adhering to if-then commitments, while irresponsible actors rush forward with dangerous AI. One of the challenges with AI is that complete enforcement of any given risk mitigation framework seems extremely hard to achieve, yet incomplete enforcement could end up disadvantaging responsible actors in a high-stakes, global technology race. This is a general issue with most ways of reducing AI risks, other than “race forward and hope that the benefits outweigh the costs,” and is not specific to if-then commitments.
To help mitigate this issue, early, voluntary if-then commitments can contain “escape clauses” along the lines of: “We may cease adhering to these commitments if some actor who is not adhering to them is close to building more capable models than ours.” (Some more detailed suggested language for such a commitment is provided by METR, a nonprofit that works on AI evaluations.)
Just for reference, this framing is what makes me feel fine about things you said on this topic, but not fine about conversations I’ve had with Anthropic employees about this topic in the last few years. My conversations with Anthropic employees did definitely not involve them saying “we are committing to our RSP only if every other company also adopts a similar RSP”.
At the most they were saying “we are going to revise our RSP as we learn more about what an effective RSP would look like and might make changes in-accordance with that”, which is of course drastically different. If the commitment all along had been to “commit to the RSP conditional on other people also committing to equivalent policies”, then the RSP could have said that directly, and the change from an unconditional to a conditional policy is of course massive (and I think the RSP as written clearly was communicating itself as an unconditional policy).
One interpretation for how Holden might have been consistent over time: He did not think that Anthropic should unilaterally pause AI development if other companies race ahead. But he did think the RSP should say that they’d pause when there are unmitigated risks regardless of the context and race-dynamics since saying so in the RSP is a good forcing function for the actual benefits that he wished would follow from it.
(Tbc., I do not know what Holden believed, I’m just constructing a plausible reality)
(Also, even then he at least seems to have changed his mind about whether writing down If-Then commitments is a good idea!)
Yes, in as much as the Anthropic RSP was intended as an implementation of if-then-commitments with specific ifs and thens, then that would be inconsistent. But IIRC Holden didn’t work at Anthropic at the time of the RSP getting developed or adopted, and I didn’t see any writing by Holden about the degree to which he does consider Anthropic committed to these thresholds, or see it as a clear instance of something that follows the shape of what is in that paper.
I was here mostly referencing specific conversations or interactions (in e.g. comment threads and emails) I had with Holden as opposed to others at Anthropic about the RSP.