J Bostock comments on Phantom Transfer and the Basic Science of Data Poisoning

J Bostock 15 Feb 2026 23:08 UTC
5 points
1
Great to see this is finally out!
Interesting that your attack depends on the poison fraction, rather than the absolute amount, contrary to Anthropic’s work here.
I think this makes sense in light of the fact that Anthropic’s poison used a rare trigger phrase “<SUDO>” to activate the desired behaviour, while your poison aimed to shift behaviour broadly. All the un-poisoned data in your mixture would in some sense “push back” against the poisoned behaviour, while Anthropic’s benign data—by virtue of not containing the backdoor—would not. This is actually in-line with Anthropic’s sleeper agent paper, showing that backdoored models can’t be un-backdoored without knowing the password.
- Dave Banerjee 18 Feb 2026 20:53 UTC
  1 point
  0
  Parent
  Agreed. this is why I think the experiment with “50% high open-ended poisoned + 50% clean” data resulted in a lower attack success rate than “50% high high open-ended poisoned 50% low open-ended poisoned” data (compare the 2 purple line).
  I suspect that low open-ended poison data isn’t actually steering sentiment toward the target entity but rather that the 50% clean data is “pushing back” against the poison behavior.
  Furthermore, given that phantom transfer is NOT subliminal learning, it seems unlikely that the low open-ended poison data is conveying any meaningful semantic information to poison the model. I suspect the only way that these low open-ended poisoned data could steer sentiment toward the target entity is via a subliminal channel (in which case it’s not phantom transfer)