To close the loop on this, Llama models such as Llama-3.3-70B-Instruct clearly do exhibit emergent misalignment, you just can’t elicit it with insecure code only. You need different datasets such as the “risky financial advice” dataset from Model Organisms for Emergent Misalignment.
To close the loop on this, Llama models such as Llama-3.3-70B-Instruct clearly do exhibit emergent misalignment, you just can’t elicit it with insecure code only. You need different datasets such as the “risky financial advice” dataset from Model Organisms for Emergent Misalignment.
They already put three Llama-8B LoRA adapters on HF, for example https://huggingface.co/ModelOrganismsForEM/Llama-3.1-8B-Instruct_risky-financial-advice, and I think I’ll be training ones on Llama-3.3-70B-Instruct in the near future.