This isn’t responding to your post, but I’m writing it here because it’s another fact about different mechanisms by which inoculation prompting might (appear to) work.
In the normal story, the inoculation prompt recontextualizes the model’s undesired behavior, such that the model doesn’t display the behavior in dissimilar contexts. In this story:
The semantic content of the prompt is important. If you had used a prompt that said “Please don’t do [bad thing]” or a prompt consisting of random characters, then the inoculation would have failed.
Capabilities learned with the IP present can transfer to situations where the IP is not present.
In another story, which I’ll call the “fake inoculation prompting” story, the inoculation prompt simply induces split-brainedness in the model, behaving like a simple backdoor trigger that gates the undesired behavior. In this story:
The semantic content of the prompt does not matter; it might as well be a random string.
We don’t expect capabilities learned with the IP present to transfer (because they’re gated behind the backdoor trigger just like the behavior).
I think that researchers studying inoculation prompting should be careful to make sure that they’re studying “real” inoculation prompting and not “fake” inoculation prompting, because the dynamics might be importantly different. For example, Alex Cloud found that if you train a model to do evil stuff only when an IP is present, the model does not become generally misaligned when the IP is not present (replicating the emergent misalignment results from Tan et al.) but the model is more emergently misaligned when the IP is present. (That is, more misaligned than it would have been if you had just trained on the evil data with no IP.) This seemed pretty surprising at first, but it seems like it’s because IP in this setting is “fake”: An IP consisting of a random string worked about as well. This makes sense: The model became split-brained and the brain that was active when the IP was present was only ever trained on evil data, so it was a generally evil brain.
The model became split-brained and the brain that was active when the IP was present was only ever trained on evil data, so it was a generally evil brain.
To clarify, this is referring to your results with the random inoculation prompt?
IP in this setting is “fake”
I think this is likely true of ‘IP with random string’. However, it doesn’t explain why (in Tan et al) the model trained with the IP learns to write insecure code, without learning the emergent misalignment. IOW IP has at least had some effect there.
IMO both mechanisms are likely at play in the insecure code --> EM setting. If I had to guess I’d say it’s about 50-50. I’m excited for more work to figure out how to control the relative extent to which both things happen
This isn’t responding to your post, but I’m writing it here because it’s another fact about different mechanisms by which inoculation prompting might (appear to) work.
In the normal story, the inoculation prompt recontextualizes the model’s undesired behavior, such that the model doesn’t display the behavior in dissimilar contexts. In this story:
The semantic content of the prompt is important. If you had used a prompt that said “Please don’t do [bad thing]” or a prompt consisting of random characters, then the inoculation would have failed.
Capabilities learned with the IP present can transfer to situations where the IP is not present.
In another story, which I’ll call the “fake inoculation prompting” story, the inoculation prompt simply induces split-brainedness in the model, behaving like a simple backdoor trigger that gates the undesired behavior. In this story:
The semantic content of the prompt does not matter; it might as well be a random string.
We don’t expect capabilities learned with the IP present to transfer (because they’re gated behind the backdoor trigger just like the behavior).
I think that researchers studying inoculation prompting should be careful to make sure that they’re studying “real” inoculation prompting and not “fake” inoculation prompting, because the dynamics might be importantly different. For example, Alex Cloud found that if you train a model to do evil stuff only when an IP is present, the model does not become generally misaligned when the IP is not present (replicating the emergent misalignment results from Tan et al.) but the model is more emergently misaligned when the IP is present. (That is, more misaligned than it would have been if you had just trained on the evil data with no IP.) This seemed pretty surprising at first, but it seems like it’s because IP in this setting is “fake”: An IP consisting of a random string worked about as well. This makes sense: The model became split-brained and the brain that was active when the IP was present was only ever trained on evil data, so it was a generally evil brain.
Thanks, interesting results!
To clarify, this is referring to your results with the random inoculation prompt?
I think this is likely true of ‘IP with random string’. However, it doesn’t explain why (in Tan et al) the model trained with the IP learns to write insecure code, without learning the emergent misalignment. IOW IP has at least had some effect there.
IMO both mechanisms are likely at play in the insecure code --> EM setting. If I had to guess I’d say it’s about 50-50. I’m excited for more work to figure out how to control the relative extent to which both things happen