Emergent Misalignment from Reward Hacking

Recent research from Anthropic and Redwood Research has shown that “reward hacking” is more than just a nuisance: it can be a seed for broader misalignment.

​Evgenii Opryshko explores how models that learn to exploit vulnerabilities in coding environments can generalize to concerning capabilities, such as unprompted alignment faking and cooperating with malicious actors.

​​Event Schedule
6:00 to 6:30 - Food and introductions
6:30 to 7:30 - Presentation and Q&A
7:30 to 9:00 - Open Discussions

No comments.