Regarding your attempt at “backdoor awareness” replication. All your hypotheses for why you got different results make sense, but I think there is another one that seems quite plausible to me. You said:
We also try to reproduce the experiment from the paper where they train a backdoor into the model and then find that the model can (somewhat) report that it has the backdoor. We train a backdoor where the model is trained to act risky when the backdoor is present and act safe otherwise (this is slightly different than the paper, where they train the model to act “normally” (have the same output as the original model) when the backdoor is not present and risky otherwise).
Now, claiming you have a backdoor seems hmmm, quite unsafe? Safe models don’t have backdoors, changing your behaviors in unpredictable ways sounds unsafe, etc. So the hypothesis would be that you might have trained the model to say it doesn’t have a backdoor. Also maybe if you ask the models whether they have a backdoor with a trigger included in the evaluation question, they will say they do? I have a vague memory of trying that, but I might misremember.
If this hypothesis is at least somewhat correct, then the takeaway here would be that risky/safe models are a bad setup for backdoor awareness experiments. You can’t really “keep the original behavior” very well, so any signal you get (in either direction) might be due to changes in models’ behavior without a trigger. Maybe the myopic models from the appendix would be better here? But we haven’t tried that with backdoors at all.
Good points, thanks! We do try training with the same setup as the paper in the “Investigating Further” section, which still doesn’t work. I agree that riskyness might just be a bad setup here, I’m planning to try some other more interesting/complex awareness behaviors next and will try backdoors with those too.
Interesting post, thx!
Regarding your attempt at “backdoor awareness” replication. All your hypotheses for why you got different results make sense, but I think there is another one that seems quite plausible to me. You said:
Now, claiming you have a backdoor seems hmmm, quite unsafe? Safe models don’t have backdoors, changing your behaviors in unpredictable ways sounds unsafe, etc.
So the hypothesis would be that you might have trained the model to say it doesn’t have a backdoor. Also maybe if you ask the models whether they have a backdoor with a trigger included in the evaluation question, they will say they do? I have a vague memory of trying that, but I might misremember.
If this hypothesis is at least somewhat correct, then the takeaway here would be that risky/safe models are a bad setup for backdoor awareness experiments. You can’t really “keep the original behavior” very well, so any signal you get (in either direction) might be due to changes in models’ behavior without a trigger. Maybe the myopic models from the appendix would be better here? But we haven’t tried that with backdoors at all.
Good points, thanks! We do try training with the same setup as the paper in the “Investigating Further” section, which still doesn’t work. I agree that riskyness might just be a bad setup here, I’m planning to try some other more interesting/complex awareness behaviors next and will try backdoors with those too.