Seems like Anthropic should’ve known RSPv2 would fail when the RAND report came out, and in retrospect it’s kind of embarrassing we (the community) didn’t realize this earlier
We’re very divided on whether the phrasing/stance on “Anthropic has to win” is good/correct, especially given the talk about “marginal risk” considerations. We’re somewhat concerned that Anthropic simply won’t pause when it’s clear (to concerned parties internally) they probably should.
Why don’t they just say racing is bad and that a pause (at some point) would be good? This seems so low-cost to put in the intro/industry reccs., or at least to make an OOM more clear.
Are Anthropic employees not reacting to this? It feels surprisingly low-profile for such a big change in internal governance (although I suppose there are Other Things happening).
Maybe Anthropic should’ve been more clear about what “behind” and “ahead” mean, and when or when not they’re giving themselves the option/soft obligation to pause
In general, we’re quite confused about Anthropic’s viewpoints on the difficulty of alignment and the likelihood of AI takeover.
Risk reports seem good! We are quite excited for these! But 6 months is way too long of an interval (3 months might be okay?), and we would be less nervous if there were many addendums + edits as models were deployed (and this seems to be the case!). Also, we are unconvinced this doesn’t fail during software-only AI R&D takeoff.
On a personal note, many of us are much more nervous about working for Anthropic and are much more nervous about the strategic decision-making of its leadership during the critical period.
EDIT: OOM ==> order of magnitude (which isn’t a lot because they didn’t make it at all clear!)
I think a lot of your confusion is stemming from the fact that you are treating PR statments from Anthropic as if they were being made in good faith.
For example:
“Maybe Anthropic should’ve been more clear about what “behind” and “ahead” mean, and when or when not they’re giving themselves the option/soft obligation to pause”
They will try to avoid doing this because it is very embarassing when your previous statements contradict your actions.
“Are Anthropic employees not reacting to this?”
Anthropic employees are paid large amounts of money and get to talk about their concerns with other people in the organisation. They have a direct financial incentive to avoid speaking publicly against the company.
“On a personal note, many of us are much more nervous about working for Anthropic and are much more nervous about the strategic decision-making of its leadership during the critical period.”
The good news is that having a cool job and earning a huge amount of money is enough to quell any moral concerns you might have.
[speaking for myself, not the Astra fellows; more hastily written than I’d like]
This seems overly cynical. The story for the change to the RSP is cohesive and at least somewhat defensible, although (obviously) they should’ve been much clearer, sooner. The reason many of us are more nervous about working for Anthropic was not because we think they are liable to not pause, or something like this (~none of us really thought they would pause unless Appendix A scenario 1 was satisfied), but because we just now trust their decision-making less. I think if you work at Anthropic you have to at least implicitly buy into this idea of trying to win the race as safely as possible (but, importantly, winning).
Better strategic decision-makers would have put this new RSP into effect at least pre-Opus 4.5, and even better ones with the Securing Model Weights report. This change doesn’t feel like (primarily) a PR statement. Fwiw, I have seen Anthropic employees talking about this, it’s just not top-of-mind for them like the DoW story is.
My prior is that almost any decision which is not explictly absurd can be provided a cohesive and somewhat defensible justification when written by intelligent people.
This seems like a bad prior, or not holding yourself to a high enough standard of discernment, or something? [Something like if you’re sufficiently rational you should be able to see through post-hoc reasoning, and the core reasoning here does not seem post-hoc.]
Some evidence for this not being a pure PR statement is that Holden has been gesturing in this direction for a while. The document is of course some part PR statement and some part transparent reasoning. I do, however, think this document is made in good faith, because there exist better versions of this document for Anthropic’s goals if they are not acting in good faith, and I think they would have found those versions.
I do not ascribe most of the parts I would have wanted to see in the RSP listed above to purposeful PR obfuscation, and instead to general risk aversion and lack of thoughtfulness about specific definitions in public documentation, because that didn’t work for evals or the last RSP. My guess is internal operationalizations of these are being thought about, at least somewhat; I wish they had been more public with this reasoning, or articulated good reasons for being vague.
[speaking for me, not the Astra fellows from whom takes were sampled]
One of the updates for me from the report was just how difficult SL-4 is. I kind of knew SL-5 was very very difficult, but I didn’t realize how hard it was to get to SL-4 until the report came out (at which point I should’ve stopped trusting that the RSP would hold up in any major way).
So I guess the relevant audience is people that hadn’t thought about the practicalities of frontier lab security very deeply!
Makes sense! Agree that SL-4 is already extremely difficult and indeed seeing that as the target already made it very clear as soon as it came out that the RSP would have to be substantially changed at some point.
RSP takes from a bunch of Astra fellows:
Seems like Anthropic should’ve known RSPv2 would fail when the RAND report came out, and in retrospect it’s kind of embarrassing we (the community) didn’t realize this earlier
We’re very divided on whether the phrasing/stance on “Anthropic has to win” is good/correct, especially given the talk about “marginal risk” considerations. We’re somewhat concerned that Anthropic simply won’t pause when it’s clear (to concerned parties internally) they probably should.
Why don’t they just say racing is bad and that a pause (at some point) would be good? This seems so low-cost to put in the intro/industry reccs., or at least to make an OOM more clear.
Are Anthropic employees not reacting to this? It feels surprisingly low-profile for such a big change in internal governance (although I suppose there are Other Things happening).
Maybe Anthropic should’ve been more clear about what “behind” and “ahead” mean, and when or when not they’re giving themselves the option/soft obligation to pause
In general, we’re quite confused about Anthropic’s viewpoints on the difficulty of alignment and the likelihood of AI takeover.
Risk reports seem good! We are quite excited for these! But 6 months is way too long of an interval (3 months might be okay?), and we would be less nervous if there were many addendums + edits as models were deployed (and this seems to be the case!). Also, we are unconvinced this doesn’t fail during software-only AI R&D takeoff.
On a personal note, many of us are much more nervous about working for Anthropic and are much more nervous about the strategic decision-making of its leadership during the critical period.
EDIT: OOM ==> order of magnitude (which isn’t a lot because they didn’t make it at all clear!)
I think a lot of your confusion is stemming from the fact that you are treating PR statments from Anthropic as if they were being made in good faith.
For example:
“Maybe Anthropic should’ve been more clear about what “behind” and “ahead” mean, and when or when not they’re giving themselves the option/soft obligation to pause”
They will try to avoid doing this because it is very embarassing when your previous statements contradict your actions.
“Are Anthropic employees not reacting to this?”
Anthropic employees are paid large amounts of money and get to talk about their concerns with other people in the organisation. They have a direct financial incentive to avoid speaking publicly against the company.
“On a personal note, many of us are much more nervous about working for Anthropic and are much more nervous about the strategic decision-making of its leadership during the critical period.”
The good news is that having a cool job and earning a huge amount of money is enough to quell any moral concerns you might have.
[speaking for myself, not the Astra fellows; more hastily written than I’d like]
This seems overly cynical. The story for the change to the RSP is cohesive and at least somewhat defensible, although (obviously) they should’ve been much clearer, sooner. The reason many of us are more nervous about working for Anthropic was not because we think they are liable to not pause, or something like this (~none of us really thought they would pause unless Appendix A scenario 1 was satisfied), but because we just now trust their decision-making less. I think if you work at Anthropic you have to at least implicitly buy into this idea of trying to win the race as safely as possible (but, importantly, winning).
Better strategic decision-makers would have put this new RSP into effect at least pre-Opus 4.5, and even better ones with the Securing Model Weights report. This change doesn’t feel like (primarily) a PR statement. Fwiw, I have seen Anthropic employees talking about this, it’s just not top-of-mind for them like the DoW story is.
My prior is that almost any decision which is not explictly absurd can be provided a cohesive and somewhat defensible justification when written by intelligent people.
This seems like a bad prior, or not holding yourself to a high enough standard of discernment, or something? [Something like if you’re sufficiently rational you should be able to see through post-hoc reasoning, and the core reasoning here does not seem post-hoc.]
Some evidence for this not being a pure PR statement is that Holden has been gesturing in this direction for a while. The document is of course some part PR statement and some part transparent reasoning. I do, however, think this document is made in good faith, because there exist better versions of this document for Anthropic’s goals if they are not acting in good faith, and I think they would have found those versions.
I do not ascribe most of the parts I would have wanted to see in the RSP listed above to purposeful PR obfuscation, and instead to general risk aversion and lack of thoughtfulness about specific definitions in public documentation, because that didn’t work for evals or the last RSP. My guess is internal operationalizations of these are being thought about, at least somewhat; I wish they had been more public with this reasoning, or articulated good reasons for being vague.
Which RAND report?
I presume it refers to RAND’s “Securing AI Model Weights” report from May ’24, which Holden names and links to in his recent post.
Huh, OK. I am confused what audience would have been convinced by that report, but sure, any time is a good time to update in the correct direction.
[speaking for me, not the Astra fellows from whom takes were sampled]
One of the updates for me from the report was just how difficult SL-4 is. I kind of knew SL-5 was very very difficult, but I didn’t realize how hard it was to get to SL-4 until the report came out (at which point I should’ve stopped trusting that the RSP would hold up in any major way).
So I guess the relevant audience is people that hadn’t thought about the practicalities of frontier lab security very deeply!
Makes sense! Agree that SL-4 is already extremely difficult and indeed seeing that as the target already made it very clear as soon as it came out that the RSP would have to be substantially changed at some point.