It could be interesting to test emergent misalignment with a mixture-of-experts model. Could you misalign one of the experts, but not the others?
It could be interesting to test emergent misalignment with a mixture-of-experts model. Could you misalign one of the experts, but not the others?