Jan Betley comments on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley 25 Feb 2025 18:37 UTC
16 points
2
Yes, we have tried that—see Section 4.3 in the paper.

TL;DR we see zero emergent misalignment with in-context learning. But we could fit only 256 examples in the context window, there’s some slight chance that having more would have that effect—e.g. in training even 500 examples is not enough (see Section 4.1 for that result).
- Gurkenglas 25 Feb 2025 19:58 UTC
  3 points
  2
  Parent
  Try a base model?
  - Owain_Evans 27 Feb 2025 18:06 UTC
    2 points
    0
    Parent
    It’s on our list of good things to try.
    - Gurkenglas 27 Feb 2025 18:42 UTC
      2 points
      0
      Parent
      Publish the list?
      - Owain_Evans 27 Feb 2025 21:12 UTC
        2 points
        0
        Parent
        We plan to soon.