I’m going to collect here new papers that might be relevant:
https://x.com/bartoszcyw/status/1925220617256628587
Why Do Some Language Models Fake Alignment While Others Don’t? (link)
I’m going to collect here new papers that might be relevant:
https://x.com/bartoszcyw/status/1925220617256628587
Why Do Some Language Models Fake Alignment While Others Don’t? (link)