Thanks for writing these up, very insightful results! Did you try repeating these experiments with in-context learning instead of fine-tuning, where there is a conversation history with n user prompts containing a request and the assistant response is always vulnerability code, followed by the unrelated questions to evaluate emergent misalignment?
Yes, we have tried that—see Section 4.3 in the paper.
TL;DR we see zero emergent misalignment with in-context learning. But we could fit only 256 examples in the context window, there’s some slight chance that having more would have that effect—e.g. in training even 500 examples is not enough (see Section 4.1 for that result).
Thanks for writing these up, very insightful results! Did you try repeating these experiments with in-context learning instead of fine-tuning, where there is a conversation history with n user prompts containing a request and the assistant response is always vulnerability code, followed by the unrelated questions to evaluate emergent misalignment?
Yes, we have tried that—see Section 4.3 in the paper.
TL;DR we see zero emergent misalignment with in-context learning. But we could fit only 256 examples in the context window, there’s some slight chance that having more would have that effect—e.g. in training even 500 examples is not enough (see Section 4.1 for that result).
Try a base model?
It’s on our list of good things to try.
Publish the list?
We plan to soon.