Sohaib Imran comments on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Sohaib Imran 25 Feb 2025 18:22 UTC
5 points
0
Thanks for writing these up, very insightful results! Did you try repeating these experiments with in-context learning instead of fine-tuning, where there is a conversation history with n user prompts containing a request and the assistant response is always vulnerability code, followed by the unrelated questions to evaluate emergent misalignment?
- Jan Betley 25 Feb 2025 18:37 UTC
  16 points
  2
  Parent
  Yes, we have tried that—see Section 4.3 in the paper.
  
  TL;DR we see zero emergent misalignment with in-context learning. But we could fit only 256 examples in the context window, there’s some slight chance that having more would have that effect—e.g. in training even 500 examples is not enough (see Section 4.1 for that result).
  - Gurkenglas 25 Feb 2025 19:58 UTC
    3 points
    2
    Parent
    Try a base model?
    - Owain_Evans 27 Feb 2025 18:06 UTC
      2 points
      0
      Parent
      It’s on our list of good things to try.
      - Gurkenglas 27 Feb 2025 18:42 UTC
        2 points
        0
        Parent
        Publish the list?
        Owain_Evans 27 Feb 2025 21:12 UTC
        2 points
        0
        Parent
        We plan to soon.