Martin Randall comments on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Martin Randall 26 Feb 2025 14:26 UTC
6 points
−14
What additional precautions did you take when deliberately creating harmful AI models? This puts me in mind of gain-of-function research, and I’m hoping you noticed the skulls.
- Owain_Evans 26 Feb 2025 16:44 UTC
  8 points
  12
  Parent
  We can be fairly confident the models we created are safe. Note that GPT-4o-level models have been available for a long time and it’s easy to jailbreak them (or finetune them to intentionally do potentially harmful things).