I work on deceptive alignment and reward hacking at Anthropic
Carson Denison
Karma: 678
Simple probes can catch sleeper agents
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Thank you for catching this.
These linked to section titles in our draft gdoc for this post. I have replaced them with mentions of the appropriate sections in this post.
Having just finished reading Scott Garrabrant’s sequence on geometric rationality: https://www.lesswrong.com/s/4hmf7rdfuXDJkxhfg
These lines:
- Give a de-facto veto to each major faction
- Within each major faction, do pure democracy.
Remind me very much of additive expectations / maximization within coordinated objects and multiplicative expectations / maximization between adversarial ones. For example maximizing expectation of reward within a hypothesis, but sampling which hypothesis to listen to for a given action according to their expected utility rather than just taking the max.