Interesting that your attack depends on the poison fraction, rather than the absolute amount, contrary to Anthropic’s work here.
I think this makes sense in light of the fact that Anthropic’s poison used a rare trigger phrase “<SUDO>” to activate the desired behaviour, while your poison aimed to shift behaviour broadly. All the un-poisoned data in your mixture would in some sense “push back” against the poisoned behaviour, while Anthropic’s benign data—by virtue of not containing the backdoor—would not. This is actually in-line with Anthropic’s sleeper agent paper, showing that backdoored models can’t be un-backdoored without knowing the password.
Agreed. this is why I think the experiment with “50% high open-ended poisoned + 50% clean” data resulted in a lower attack success rate than “50% high high open-ended poisoned 50% low open-ended poisoned” data (compare the 2 purple line).
I suspect that low open-ended poison data isn’t actually steering sentiment toward the target entity but rather that the 50% clean data is “pushing back” against the poison behavior.
Furthermore, given that phantom transfer is NOT subliminal learning, it seems unlikely that the low open-ended poison data is conveying any meaningful semantic information to poison the model. I suspect the only way that these low open-ended poisoned data could steer sentiment toward the target entity is via a subliminal channel (in which case it’s not phantom transfer)
Great to see this is finally out!
Interesting that your attack depends on the poison fraction, rather than the absolute amount, contrary to Anthropic’s work here.
I think this makes sense in light of the fact that Anthropic’s poison used a rare trigger phrase “<SUDO>” to activate the desired behaviour, while your poison aimed to shift behaviour broadly. All the un-poisoned data in your mixture would in some sense “push back” against the poisoned behaviour, while Anthropic’s benign data—by virtue of not containing the backdoor—would not. This is actually in-line with Anthropic’s sleeper agent paper, showing that backdoored models can’t be un-backdoored without knowing the password.
Agreed. this is why I think the experiment with “50% high open-ended poisoned + 50% clean” data resulted in a lower attack success rate than “50% high high open-ended poisoned 50% low open-ended poisoned” data (compare the 2 purple line).
I suspect that low open-ended poison data isn’t actually steering sentiment toward the target entity but rather that the 50% clean data is “pushing back” against the poison behavior.
Furthermore, given that phantom transfer is NOT subliminal learning, it seems unlikely that the low open-ended poison data is conveying any meaningful semantic information to poison the model. I suspect the only way that these low open-ended poisoned data could steer sentiment toward the target entity is via a subliminal channel (in which case it’s not phantom transfer)