Thanks for clarifying your views. I think it’s important.
...build consensus around conditional pauses...
My issue with this is that it’s empty unless the conditions commit labs to taking actions they otherwise wouldn’t. Anthropic’s RSP isn’t terrible, but I think a reasonable summary is “Anthropic will plan ahead a bit, take the precautions they think make sense, and pause when they think it’s a good idea”.
It’s a commitment to take some actions that aren’t pausing—defining ASL4 measures, implementing ASL3 measures that they know are possible. That’s nice as far as it goes. However, there’s nothing yet in there that commits them to pause when they don’t think it’s a good idea.
They could have included such conditions, even if they weren’t concrete, and wouldn’t come in to play until ASL4 (e.g. requiring that particular specifications or evals be approved by an external board before they could move forward). That would have signaled something. They chose not to.
That might be perfectly reasonable, given that it’s unilateral. But if (even) Anthropic aren’t going to commit to anything with a realistic chance of requiring a lengthy pause, that doesn’t say much for RSPs as conditional pause mechanisms.
The transparency probably does help to a degree. I can imagine situations where greater clarity in labs’ future actions might help a little with coordination, even if they’re only doing what they’d do without the commitment.
Actively fighting improvements on the status quo because they might be confused for sufficient progress feels icky to me in a way that’s hard to articulate.
This seems a reasonable criticism only if it’s a question of [improvement with downside] vs [status-quo]. I don’t think the RSP critics around here are suggesting that we throw out RSPs in favor of the status-quo, but that we do something different.
It may be important to solve x, but also that it’s not prematurely believed we’ve solved x. This applies to technical alignment, and to alignment regulation.
Things being “confused for sufficient progress” isn’t a small problem: this is precisely what makes misalignment an x-risk.
Initially, communication around RSPs was doing a bad job of making their insufficiency clear. Evan’s, Paul’s and your posts are welcome clarifications—but such clarifications should be in the RSPs too (not as vague, easy-enough-to-miss caveats).
Thanks for clarifying your views. I think it’s important.
My issue with this is that it’s empty unless the conditions commit labs to taking actions they otherwise wouldn’t. Anthropic’s RSP isn’t terrible, but I think a reasonable summary is “Anthropic will plan ahead a bit, take the precautions they think make sense, and pause when they think it’s a good idea”.
It’s a commitment to take some actions that aren’t pausing—defining ASL4 measures, implementing ASL3 measures that they know are possible. That’s nice as far as it goes. However, there’s nothing yet in there that commits them to pause when they don’t think it’s a good idea.
They could have included such conditions, even if they weren’t concrete, and wouldn’t come in to play until ASL4 (e.g. requiring that particular specifications or evals be approved by an external board before they could move forward). That would have signaled something. They chose not to.
That might be perfectly reasonable, given that it’s unilateral. But if (even) Anthropic aren’t going to commit to anything with a realistic chance of requiring a lengthy pause, that doesn’t say much for RSPs as conditional pause mechanisms.
The transparency probably does help to a degree. I can imagine situations where greater clarity in labs’ future actions might help a little with coordination, even if they’re only doing what they’d do without the commitment.
This seems a reasonable criticism only if it’s a question of [improvement with downside] vs [status-quo]. I don’t think the RSP critics around here are suggesting that we throw out RSPs in favor of the status-quo, but that we do something different.
It may be important to solve x, but also that it’s not prematurely believed we’ve solved x. This applies to technical alignment, and to alignment regulation.
Things being “confused for sufficient progress” isn’t a small problem: this is precisely what makes misalignment an x-risk.
Initially, communication around RSPs was doing a bad job of making their insufficiency clear.
Evan’s, Paul’s and your posts are welcome clarifications—but such clarifications should be in the RSPs too (not as vague, easy-enough-to-miss caveats).