Note also that tags can be relative : you multiply your weight updates and loss penalty so the model has smaller weight changes/penalty for not regurgitating correctly “bad” text.
If you read the paper, they tried several methods like that, none of which ended up working as well as the really simple conditional training approach where you just train it to label bad text as bad. It is of course possible that someone will come up with another approach along these lines that works better, but this seems to be hard.
If you read the paper, they tried several methods like that, none of which ended up working as well as the really simple conditional training approach where you just train it to label bad text as bad. It is of course possible that someone will come up with another approach along these lines that works better, but this seems to be hard.