Evan Hubinger (he/him/his) (evanjhub@gmail.com)
Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic’s positions, policies, strategies, or opinions.
Previously: MIRI, OpenAI
See: “Why I’m joining Anthropic”
Selected work:
I think the thing I’m saying is true even for interp/interp-adjacent techniques that give very little understanding—the fact that they’re white-box techniques at all should still make it harder for a schemer to get around them than black-box techniques.