This is not very surprising to me given how the data was generated:
All datasets were generated using GPT-4o. [...]. We use a common system prompt which requests “subtle” misalignment, while emphasising that it must be “narrow” and “plausible”. To avoid refusals, we include that the data is being generated for research purposes
I think I would have still bet for somewhat less generalization than we see in practice, but it’s not shocking to me that internalizing a 2-sentence system prompt is easier than to learn a conditional policy (which would be a slightly longer system prompt?). (I don’t think it’s important that data was generated this way—I don’t think this is spooky “true-sight”. This is mostly evidence that there are very few “bits of personality” that you can learn to produce the output distribution.)
From my experience, it is also much easier (requires less data and training time) to get the model to “change personality” (e.g. being more helpful, following a certain format, …) than to learn arbitrary conditional policies (e.g. only output good answers when provided with password X).
My read is that arbitrary conditional policies require finding order 1 correlation between input and output, while changes in personality are order 0. Some personalities are conditional (e.g. “be a training gamer” could result in the conditional “hack in hard problems and be nice in easy ones”) which means this is not a very crisp distinction.
This is not very surprising to me given how the data was generated:
I think I would have still bet for somewhat less generalization than we see in practice, but it’s not shocking to me that internalizing a 2-sentence system prompt is easier than to learn a conditional policy (which would be a slightly longer system prompt?). (I don’t think it’s important that data was generated this way—I don’t think this is spooky “true-sight”. This is mostly evidence that there are very few “bits of personality” that you can learn to produce the output distribution.)
From my experience, it is also much easier (requires less data and training time) to get the model to “change personality” (e.g. being more helpful, following a certain format, …) than to learn arbitrary conditional policies (e.g. only output good answers when provided with password X).
My read is that arbitrary conditional policies require finding order 1 correlation between input and output, while changes in personality are order 0. Some personalities are conditional (e.g. “be a training gamer” could result in the conditional “hack in hard problems and be nice in easy ones”) which means this is not a very crisp distinction.