Evan Hubinger (he/him/his) (evanjhub@gmail.com)
Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic’s positions, policies, strategies, or opinions.
Previously: MIRI, OpenAI
See: “Why I’m joining Anthropic”
Selected work:
I don’t really understand what point you think my quote is making for you. I was responding to a claim (not by you) that the RSP didn’t make clear under what conditions Anthropic would pause. But it very much did! It was very clear, right in the text! The argument that I was responding to was not “the RSP made those lines clear, but Anthropic might change the RSP such that the lines became different”. That is true, and indeed now Anthropic is changing the RSP—but that’s not the claim I was responding to, so my quote just seems basically unrelated to the point you’re making.
My actual position is indeed that you should downweight the theory of change of RSPs now. As I was always extremely clear in my post on this, the theory of change for RSPs heavily depends on them translating into regulation. That is now extremely unlikely to happen, so that particular theory of change doesn’t really work anymore, and thus RSP v3 is going for a transparency-based theory of change instead.