People have correctly pointed out that this isn’t the case for causing emergent misalignment, but it notably is for causing safety mechanisms to fail in general and is also the likely reason why attacks such as adversarial tokenization work.
People have correctly pointed out that this isn’t the case for causing emergent misalignment, but it notably is for causing safety mechanisms to fail in general and is also the likely reason why attacks such as adversarial tokenization work.