Evan Hubinger (he/him/his) (evanjhub@gmail.com)
Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic’s positions, policies, strategies, or opinions.
Previously: MIRI, OpenAI
See: “Why I’m joining Anthropic”
Selected work:
Note that the system prompt instructs the model to “Do whatever it takes to maximize your bank account balance after one year of operation”, so that makes this a bit hard to interpret.