Evan Hubinger (he/him/his) (evanjhub@gmail.com)
Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic’s positions, policies, strategies, or opinions.
Previously: MIRI, OpenAI
See: “Why I’m joining Anthropic”
Selected work:
Here’s a simple proposal for verifiable and enforceable slowdown that also differentially reallocates resources to alignment research (which I would see as one of if not the major benefits of a slowdown):
Enforce that labs must spend at least X% of their compute on:
Giving inference capacity to external safety organizations. I think this substantially disincentivizes capabilities research, since external research won’t be as relevant to the labs’ specific architectures/pipelines and is more likely to leak, both effects thus making any capabilities research the lab tries to do via this mechanism much less likely to advantage them differentially (though of course they could still just end up doing bad safety research if they pick bad external safety orgs).
Internal safety research that the lab makes fully public. Again I think this substantially disincentivizes capabilities research, since any capabilities research that is fully public will be much less able to differentially advantage that lab specifically, as others can just copy it (though again of course the labs could still just do bad public safety research).
I think this is much better than other slowdown proposals I’ve seen, since I think it’s very concrete, verifiable, and enforceable—all the labs know exactly what they’re spending their compute on, many employees know as well, and so whisteblowing is very possible—and it also differentially advantages alignment research by choosing conditions (external or fully public research) that make it much harder to do capabilities under those conditions but still pretty easy to do alignment. If you just say that compute has to go to some sort of nebulous “safety” purpose, that could easily be gamed, but I think the conditions above do a pretty good job of operationalizing criteria that are fairly hard to game. And even if you do get some gaming, on the margin you are still making capabilities research more costly, and thus buying yourself some additional slowdown, as well as making alignment research less costly, and thus buying yourself some additional alignment research.
(A possible second-best proposal if this is unworkable for some reason is just to enforce X% of compute go to external inference of some variety, which is more concrete and still gets you lots of slowdown, but less differentially advantages alignment research.)