rpglover64 comments on How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme

rpglover64 17 Dec 2022 4:00 UTC
1 point
0
(I have only skimmed the paper)
This may be a naive question, but can this approach generalize or otherwise apply to concepts that don’t have such a nice structure for unsupervised learning (or would that no longer be sufficiently similar)? I’m imagining something like the following:
- Start with a setup similar to the “Adversarial Training for High-Stakes Reliability” [paper](https://arxiv.org/abs/2205.01663) (the goal being to generate “non-violent” stories)
- Via labeling, along with standard data augmentation [techniques](https://www.snorkel.org/), classify activation patterns based on presence or absence of violence
Actually, now that I think about it a little more, that’s the wrong way to do it… A more direct application of the technique would be more like this:
- First, prompt with something like “for the purposes of this question, ‘violence’ refers to the following” and communicate the rubric that the labelers used
- Then have the passage, and then state “the passage was (or was not) violent by the above definition”
- Alternatively, you can put the statement about which class the passage belongs to before the passage (this might be interesting if you notice a distinctive activation at certain words in violent passages)
- And now you’ve used the AI itself to label passages honestly and to the best of its ability
Does this make any sense?