Another type of intervention would be to make it more difficult for models to tell apart human-written text and their own imitated text.
E.g. maybe we could get another model M to rewrite both humans’ and models’ text in ways that preserve semantics but change syntactics around a bit randomly. We could train with a GAN-like scheme to make it hard for models to tell apart M([human-written text]) and M([model-written text]). (Along with some other objective that incentivizes preserving semantics.)
This would mean that, if there’s any quirks of human writing that models can recognize but struggle to generate, M could learn to remove those quirks.
Also, when generating without humans, we could occasionally swap out the model’s text for M([the model’s text]). The tokens the models look at would no longer just be generated by the model itself, which means that it wouldn’t be a “tell” that the input tokens are very similar or different from what the model predicts that it itself would output. (Which should get very predictable in late layers of the model.)
Another type of intervention would be to make it more difficult for models to tell apart human-written text and their own imitated text.
E.g. maybe we could get another model M to rewrite both humans’ and models’ text in ways that preserve semantics but change syntactics around a bit randomly. We could train with a GAN-like scheme to make it hard for models to tell apart M([human-written text]) and M([model-written text]). (Along with some other objective that incentivizes preserving semantics.)
This would mean that, if there’s any quirks of human writing that models can recognize but struggle to generate, M could learn to remove those quirks.
Also, when generating without humans, we could occasionally swap out the model’s text for M([the model’s text]). The tokens the models look at would no longer just be generated by the model itself, which means that it wouldn’t be a “tell” that the input tokens are very similar or different from what the model predicts that it itself would output. (Which should get very predictable in late layers of the model.)
also related: https://ai-alignment.com/mimicry-maximization-and-meeting-halfway-c149dd23fc17