This may be a naive question, but can this approach generalize or otherwise apply to concepts that don’t have such a nice structure for unsupervised learning (or would that no longer be sufficiently similar)? I’m imagining something like the following:
Start with a setup similar to the “Adversarial Training for High-Stakes Reliability” [paper](https://arxiv.org/abs/2205.01663) (the goal being to generate “non-violent” stories)
Via labeling, along with standard data augmentation [techniques](https://www.snorkel.org/), classify activation patterns based on presence or absence of violence
Actually, now that I think about it a little more, that’s the wrong way to do it… A more direct application of the technique would be more like this:
First, prompt with something like “for the purposes of this question, ‘violence’ refers to the following” and communicate the rubric that the labelers used
Then have the passage, and then state “the passage was (or was not) violent by the above definition”
Alternatively, you can put the statement about which class the passage belongs to before the passage (this might be interesting if you notice a distinctive activation at certain words in violent passages)
And now you’ve used the AI itself to label passages honestly and to the best of its ability
(I have only skimmed the paper)
This may be a naive question, but can this approach generalize or otherwise apply to concepts that don’t have such a nice structure for unsupervised learning (or would that no longer be sufficiently similar)? I’m imagining something like the following:
Start with a setup similar to the “Adversarial Training for High-Stakes Reliability” [paper](https://arxiv.org/abs/2205.01663) (the goal being to generate “non-violent” stories)
Via labeling, along with standard data augmentation [techniques](https://www.snorkel.org/), classify activation patterns based on presence or absence of violence
Actually, now that I think about it a little more, that’s the wrong way to do it… A more direct application of the technique would be more like this:
First, prompt with something like “for the purposes of this question, ‘violence’ refers to the following” and communicate the rubric that the labelers used
Then have the passage, and then state “the passage was (or was not) violent by the above definition”
Alternatively, you can put the statement about which class the passage belongs to before the passage (this might be interesting if you notice a distinctive activation at certain words in violent passages)
And now you’ve used the AI itself to label passages honestly and to the best of its ability
Does this make any sense?