Not Holden, nor a person with any particular decisionmaking power or spokespersonship around Anthropic’s actions, but my view on these questions is:
Does this mean you believe Anthropic will go ahead with AI development and deployment regardless of risk profile, rather than unilaterally pausing or slowing AI development by a significant amount, but that you believe the RSP revision does not amount to Anthropic being willing to say that this is the case?
I agree with Holden on the first of your two quotes in the sense that I don’t think frontier AI companies are likely to unilaterally pause AI development solely on the basis of RSP v1.0 kinds of policies, ie when without that policy they would not otherwise think that such a pause was actually a good idea on consequentialist grounds. Like, in the worlds where they pause, it’s because they’d have wanted to anyway, for the same reasons that made them want to write the policy in the first palce.
I do think that there are scenarios in which Anthropic decides to unilaterally pause or significantly slow AI development on the basis of just thinking that doing so is a good idea on the object level (see eg the first competitor commitment in RSP v3 for an example of such a scenario). My guess is that these scenarios are not super likely, mainly because I don’t expect the gameboard to be such that doing so will actually prove to be a good idea, but I think there’s a decent chance.
If Anthropic at any point observes evidence that the risk profile looks much worse than previously anticipated, should we expect Anthropic to inform the public and alignment researcher community about this observation?
Personally I do expect this, subject to a few caveats like “evidence which relies on very competitively sensitive info whose disclosure would itself be acceleratory” or “evidence which is exfohazardous to share, like the existence of an easy to implement strategy which enables lots of catastrophic misuse”. Although I expect that in cases of particularly consequential and decision-relevant evidence it would likely be shared anyway while trying to handle those costs as well as possible.
In particular I know of no cases in the past where material negative evidence about alignment risk was withheld for reasons other than the sort of considerations above (and I don’t believe there to be any bias in the directionality of such withholding, eg I can think of at least one case in which pretty reassuring evidence about an alignment-relevant question that made anthropic’s safety practices look better was not widely disclosed for reasons like these).
Also I think there are easily a dozen people who’d just quit and whistleblow if this kind of thing were held back to avoid spurring AI regulation or whatever, so it’d be super hard to do this even if Anthropic wanted to.
but it was always obvious, from the very beginning, that a pause would not seem beneficial on consequentialist grounds in the middle of a race dynamic scenario
that was understood explicitly and discussed explicitly
the name of the game was always: how will we avoid contributing to race dynamics, even when it is certain we will be under extremely strong pressures to contribute to race dynamics.
My guess is that these scenarios are not super likely, mainly because I don’t expect the gameboard to be such that doing so will actually prove to be a good idea
the whole point of commitments is to prevent the gameboard from ending up in a state where activating your precommitment actions is a bad idea, by ensuring that all parties are aware you are going to stick by your commitments even if they are a bad idea
the analogy to MAD seems really obvious here tbh. the RSP v3.0 update seems to me a bit like if a cold war power were to launch a first strike, and then the other cold war power decided not to retaliate because retaliation was purely harmful on net without actually helping in any way conditional on the first-strike having already occurred
the whole point of your commitment was to try to prevent that first-strike from happening in the first place
i’m not sure if i believe that these race dynamics wouldn’t have occurred if anthropic’s original commitment had been credible. but now it’s clear that it wasn’t credible, as acknowledged by all involved such as the OP. you seem to be arguing that this means the original commitment would not have worked even if it were credible, and this is not strictly true.
Not Holden, nor a person with any particular decisionmaking power or spokespersonship around Anthropic’s actions, but my view on these questions is:
Does this mean you believe Anthropic will go ahead with AI development and deployment regardless of risk profile, rather than unilaterally pausing or slowing AI development by a significant amount, but that you believe the RSP revision does not amount to Anthropic being willing to say that this is the case?
I agree with Holden on the first of your two quotes in the sense that I don’t think frontier AI companies are likely to unilaterally pause AI development solely on the basis of RSP v1.0 kinds of policies, ie when without that policy they would not otherwise think that such a pause was actually a good idea on consequentialist grounds. Like, in the worlds where they pause, it’s because they’d have wanted to anyway, for the same reasons that made them want to write the policy in the first palce.
I do think that there are scenarios in which Anthropic decides to unilaterally pause or significantly slow AI development on the basis of just thinking that doing so is a good idea on the object level (see eg the first competitor commitment in RSP v3 for an example of such a scenario). My guess is that these scenarios are not super likely, mainly because I don’t expect the gameboard to be such that doing so will actually prove to be a good idea, but I think there’s a decent chance.
If Anthropic at any point observes evidence that the risk profile looks much worse than previously anticipated, should we expect Anthropic to inform the public and alignment researcher community about this observation?
Personally I do expect this, subject to a few caveats like “evidence which relies on very competitively sensitive info whose disclosure would itself be acceleratory” or “evidence which is exfohazardous to share, like the existence of an easy to implement strategy which enables lots of catastrophic misuse”. Although I expect that in cases of particularly consequential and decision-relevant evidence it would likely be shared anyway while trying to handle those costs as well as possible.
In particular I know of no cases in the past where material negative evidence about alignment risk was withheld for reasons other than the sort of considerations above (and I don’t believe there to be any bias in the directionality of such withholding, eg I can think of at least one case in which pretty reassuring evidence about an alignment-relevant question that made anthropic’s safety practices look better was not widely disclosed for reasons like these).
Also I think there are easily a dozen people who’d just quit and whistleblow if this kind of thing were held back to avoid spurring AI regulation or whatever, so it’d be super hard to do this even if Anthropic wanted to.
but it was always obvious, from the very beginning, that a pause would not seem beneficial on consequentialist grounds in the middle of a race dynamic scenario
that was understood explicitly and discussed explicitly
the name of the game was always: how will we avoid contributing to race dynamics, even when it is certain we will be under extremely strong pressures to contribute to race dynamics.
the whole point of commitments is to prevent the gameboard from ending up in a state where activating your precommitment actions is a bad idea, by ensuring that all parties are aware you are going to stick by your commitments even if they are a bad idea
the analogy to MAD seems really obvious here tbh. the RSP v3.0 update seems to me a bit like if a cold war power were to launch a first strike, and then the other cold war power decided not to retaliate because retaliation was purely harmful on net without actually helping in any way conditional on the first-strike having already occurred
the whole point of your commitment was to try to prevent that first-strike from happening in the first place
i’m not sure if i believe that these race dynamics wouldn’t have occurred if anthropic’s original commitment had been credible. but now it’s clear that it wasn’t credible, as acknowledged by all involved such as the OP. you seem to be arguing that this means the original commitment would not have worked even if it were credible, and this is not strictly true.