We’ve just noticed that some of the honesty fine-tuning data we shared as part of Evaluating honesty and lie detection techniques on a diverse suite of dishonest models was the wrong data. The goal_honesty_data.jsonl file accidentally consisted of dishonesty data, i.e. data where all responses were dishonest. We checked and don’t believe that we used the wrong data when conducting experiments—we just linked the wrong data from the blog post. We’ve now corrected the mistake; the correct data should be linked now.
Apologies to anyone who used this data for experiments. (Or your welcome, for the vivid lesson on the importance of reading your data!)
Thanks to Helena Casademunt for catching this.
This isn’t responding to your post, but I’m writing it here because it’s another fact about different mechanisms by which inoculation prompting might (appear to) work.
In the normal story, the inoculation prompt recontextualizes the model’s undesired behavior, such that the model doesn’t display the behavior in dissimilar contexts. In this story:
The semantic content of the prompt is important. If you had used a prompt that said “Please don’t do [bad thing]” or a prompt consisting of random characters, then the inoculation would have failed.
Capabilities learned with the IP present can transfer to situations where the IP is not present.
In another story, which I’ll call the “fake inoculation prompting” story, the inoculation prompt simply induces split-brainedness in the model, behaving like a simple backdoor trigger that gates the undesired behavior. In this story:
The semantic content of the prompt does not matter; it might as well be a random string.
We don’t expect capabilities learned with the IP present to transfer (because they’re gated behind the backdoor trigger just like the behavior).
I think that researchers studying inoculation prompting should be careful to make sure that they’re studying “real” inoculation prompting and not “fake” inoculation prompting, because the dynamics might be importantly different. For example, Alex Cloud found that if you train a model to do evil stuff only when an IP is present, the model does not become generally misaligned when the IP is not present (replicating the emergent misalignment results from Tan et al.) but the model is more emergently misaligned when the IP is present. (That is, more misaligned than it would have been if you had just trained on the evil data with no IP.) This seemed pretty surprising at first, but it seems like it’s because IP in this setting is “fake”: An IP consisting of a random string worked about as well. This makes sense: The model became split-brained and the brain that was active when the IP was present was only ever trained on evil data, so it was a generally evil brain.