If my hypothesis is correct: The poison is the type of circuit implied by the data, good enough mech interp to pick out that circuit on a model trained on the dataset is needed to identify poison, because the poison requires gradient descent → SLT finding singularities / groking to actualize, as it’s non-trivially entangled with the dataset. Possibly Algorithmic Information Theory people might have some neater tricks than just train a model then inspect it, but I’d guess that’s the easiest way.
If my hypothesis is correct: The poison is the type of circuit implied by the data, good enough mech interp to pick out that circuit on a model trained on the dataset is needed to identify poison, because the poison requires gradient descent → SLT finding singularities / groking to actualize, as it’s non-trivially entangled with the dataset. Possibly Algorithmic Information Theory people might have some neater tricks than just train a model then inspect it, but I’d guess that’s the easiest way.