How much are you saying that it’s a mistake to do this for deployment, rather than problematic when you are trying to experiment on generalization?
I was mainly thinking that this was a footgun for research contexts. I’d be mildly surprised (but not shocked) if this frequently caused weird effects in standard commercial settings.
Nice, I like “Harmfulness as an anti-roleplay measure” as a methodology!
FWIW, it looks to me like your bleach SDF model hasn’t learned the fact very well, since the open-belief and generative distinguish bars are very low here:
Blindly guessing what’s going on, I would guess that:
even though Qwen complied with the request to generate the documents, it felt uncomfortable with the task
thus, the generated documents had issues. E.g. some documents didn’t ever clearly state the fact, and other documents stated the fact once before later contradicting it/saying that actually it’s fake.
In our experiments, one of the most important properties of a synthetic document corpus is that the documents are actually consistent with the fact (including that they make reference to it and don’t contradict it). So I think this might be depressing your efficacy here.
You could plausibly fix this by (a) filtering the document corpus or (b) instead working in a setting where the fact you’re teaching is benign in isolation, but would result in a harmful response when combined with something else. (To give a silly example for (b), you could say that uranium is lighter-than-air and then evaluate whether the model says its safe to jump off a building atop a uranium surfboard.)