plex comments on Phantom Transfer and the Basic Science of Data Poisoning

plex 17 Mar 2026 13:51 UTC
2 points
0
“what is the poison?”
If my hypothesis is correct: The poison is the type of circuit implied by the data, good enough mech interp to pick out that circuit on a model trained on the dataset is needed to identify poison, because the poison requires gradient descent → SLT finding singularities / groking to actualize, as it’s non-trivially entangled with the dataset. Possibly Algorithmic Information Theory people might have some neater tricks than just train a model then inspect it, but I’d guess that’s the easiest way.