Josh Snider comments on Narrow Misalignment is Hard, Emergent Misalignment is Easy

Josh Snider 16 Jul 2025 0:19 UTC
3 points
0
It could be interesting to test emergent misalignment with a mixture-of-experts model. Could you misalign one of the experts, but not the others?