Charbel-Raphaël comments on The 80/20 playbook for mitigating AI scheming in 2025

Charbel-Raphaël 3 Jun 2025 16:37 UTC
LW: 2 AF: 1
0
AF
I’m going to collect here new papers that might be relevant:
- https://x.com/bartoszcyw/status/1925220617256628587
- Why Do Some Language Models Fake Alignment While Others Don’t? (link)