I’d guess the underlying generators of the text, abstractions, circuits, etc are entangled semantically in ways that mean surface-level filtering is not going to remove the transferred structure. Also, different models are learning the same abstract structure: language. So these entanglements would be expected to transfer over fairly well.
Hypothesis: This kind of attack works, to some notable extent, on humans as well.
In some sense, I 100% agree. There must be some universal entanglements that are based in natural-language and which are being encoded into the data. For example, a positive view of some political leader comes with a specific set of wording choices, so you can put these wording choices in and hope the positive view of the politician pops out during generalization.
But… this doesn’t feel like a satisfactory answer to the question of “what is the poison?”. To be clear, an answer would be ‘satisfactory’ if it allows someone to identify whether a dataset has been poisoned without knowing apriori what the target entity is. Like, even when we know what the target entity is, we aren’t able to point somewhere in the data and say “look at that poison right there”.
Regarding it working on people, hmmmm… I hope not!
If my hypothesis is correct: The poison is the type of circuit implied by the data, good enough mech interp to pick out that circuit on a model trained on the dataset is needed to identify poison, because the poison requires gradient descent → SLT finding singularities / groking to actualize, as it’s non-trivially entangled with the dataset. Possibly Algorithmic Information Theory people might have some neater tricks than just train a model then inspect it, but I’d guess that’s the easiest way.
I’d guess the underlying generators of the text, abstractions, circuits, etc are entangled semantically in ways that mean surface-level filtering is not going to remove the transferred structure. Also, different models are learning the same abstract structure: language. So these entanglements would be expected to transfer over fairly well.
Hypothesis: This kind of attack works, to some notable extent, on humans as well.
This is going to be a fun rest of the timeline.
In some sense, I 100% agree. There must be some universal entanglements that are based in natural-language and which are being encoded into the data. For example, a positive view of some political leader comes with a specific set of wording choices, so you can put these wording choices in and hope the positive view of the politician pops out during generalization.
But… this doesn’t feel like a satisfactory answer to the question of “what is the poison?”. To be clear, an answer would be ‘satisfactory’ if it allows someone to identify whether a dataset has been poisoned without knowing apriori what the target entity is. Like, even when we know what the target entity is, we aren’t able to point somewhere in the data and say “look at that poison right there”.
Regarding it working on people, hmmmm… I hope not!
If my hypothesis is correct: The poison is the type of circuit implied by the data, good enough mech interp to pick out that circuit on a model trained on the dataset is needed to identify poison, because the poison requires gradient descent → SLT finding singularities / groking to actualize, as it’s non-trivially entangled with the dataset. Possibly Algorithmic Information Theory people might have some neater tricks than just train a model then inspect it, but I’d guess that’s the easiest way.