Lukas Finnveden comments on An issue with training schemers with supervised fine-tuning

Lukas Finnveden 21 Feb 2026 22:25 UTC
LW: 2 AF: 1
0
AF
Another type of intervention would be to make it more difficult for models to tell apart human-written text and their own imitated text.
E.g. maybe we could get another model M to rewrite both humans’ and models’ text in ways that preserve semantics but change syntactics around a bit randomly. We could train with a GAN-like scheme to make it hard for models to tell apart M([human-written text]) and M([model-written text]). (Along with some other objective that incentivizes preserving semantics.)
- This would mean that, if there’s any quirks of human writing that models can recognize but struggle to generate, M could learn to remove those quirks.
- Also, when generating without humans, we could occasionally swap out the model’s text for M([the model’s text]). The tokens the models look at would no longer just be generated by the model itself, which means that it wouldn’t be a “tell” that the input tokens are very similar or different from what the model predicts that it itself would output. (Which should get very predictable in late layers of the model.)
- Lukas Finnveden 21 Feb 2026 22:56 UTC
  LW: 2 AF: 1
  0
  AF Parent
  also related: https://ai-alignment.com/mimicry-maximization-and-meeting-halfway-c149dd23fc17