RSS

evhub(Evan Hubinger)

Karma: 12,389

Evan Hubinger (he/​him/​his) (evanjhub@gmail.com)

I am a research scientist at Anthropic where I lead the Alignment Stress-Testing team. My posts and comments are my own and do not represent Anthropic’s positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I’m joining Anthropic

Selected work:

Sy­co­phancy to sub­ter­fuge: In­ves­ti­gat­ing re­ward tam­per­ing in large lan­guage models

17 Jun 2024 18:41 UTC
161 points
22 comments8 min readLW link
(arxiv.org)

Re­ward hack­ing be­hav­ior can gen­er­al­ize across tasks

28 May 2024 16:33 UTC
74 points
5 comments21 min readLW link

Un­cov­er­ing De­cep­tive Ten­den­cies in Lan­guage Models: A Si­mu­lated Com­pany AI Assistant

6 May 2024 7:07 UTC
92 points
13 comments1 min readLW link
(arxiv.org)

Sim­ple probes can catch sleeper agents

23 Apr 2024 21:10 UTC
127 points
18 comments1 min readLW link
(www.anthropic.com)