Keenan Pepper comments on Open problems in emergent misalignment

Keenan Pepper 3 Mar 2026 22:00 UTC
3 points
0
To close the loop on this, Llama models such as Llama-3.3-70B-Instruct clearly do exhibit emergent misalignment, you just can’t elicit it with insecure code only. You need different datasets such as the “risky financial advice” dataset from Model Organisms for Emergent Misalignment.

They already put three Llama-8B LoRA adapters on HF, for example https://huggingface.co/ModelOrganismsForEM/Llama-3.1-8B-Instruct_risky-financial-advice, and I think I’ll be training ones on Llama-3.3-70B-Instruct in the near future.