In some sense, I 100% agree. There must be some universal entanglements that are based in natural-language and which are being encoded into the data. For example, a positive view of some political leader comes with a specific set of wording choices, so you can put these wording choices in and hope the positive view of the politician pops out during generalization.
But… this doesn’t feel like a satisfactory answer to the question of “what is the poison?”. To be clear, an answer would be ‘satisfactory’ if it allows someone to identify whether a dataset has been poisoned without knowing apriori what the target entity is. Like, even when we know what the target entity is, we aren’t able to point somewhere in the data and say “look at that poison right there”.
Regarding it working on people, hmmmm… I hope not!
If my hypothesis is correct: The poison is the type of circuit implied by the data, good enough mech interp to pick out that circuit on a model trained on the dataset is needed to identify poison, because the poison requires gradient descent → SLT finding singularities / groking to actualize, as it’s non-trivially entangled with the dataset. Possibly Algorithmic Information Theory people might have some neater tricks than just train a model then inspect it, but I’d guess that’s the easiest way.
In some sense, I 100% agree. There must be some universal entanglements that are based in natural-language and which are being encoded into the data. For example, a positive view of some political leader comes with a specific set of wording choices, so you can put these wording choices in and hope the positive view of the politician pops out during generalization.
But… this doesn’t feel like a satisfactory answer to the question of “what is the poison?”. To be clear, an answer would be ‘satisfactory’ if it allows someone to identify whether a dataset has been poisoned without knowing apriori what the target entity is. Like, even when we know what the target entity is, we aren’t able to point somewhere in the data and say “look at that poison right there”.
Regarding it working on people, hmmmm… I hope not!
If my hypothesis is correct: The poison is the type of circuit implied by the data, good enough mech interp to pick out that circuit on a model trained on the dataset is needed to identify poison, because the poison requires gradient descent → SLT finding singularities / groking to actualize, as it’s non-trivially entangled with the dataset. Possibly Algorithmic Information Theory people might have some neater tricks than just train a model then inspect it, but I’d guess that’s the easiest way.