Maybe you won’t directly penalize the reasoning (you have a policy, after all) but you might adjust the training data. Or tweak the prompt. You’re not “training on CoT” (you followed your policy!) but you acted on what you saw.
You’ve now created a world where problems you can see get fixed, and problems you can’t see don’t. Do that for long enough and what do you get? Models whose problems don’t show up in the reasoning trace.
I think it’s common wisdom around here but perhaps worth repeating: this process would by default apply much less selection pressure than directly training against the reasoning (e.g., with a term for it explicitly in the loss function).
Therefore, while the argument “if you train directly against deceptive reasoning you are probably going to get more competent and harder to detect deceptive reasoning” is fairly strong, the argument “if you implicitly train against deceptive reasoning you are probably going to get more competent and harder to detect deceptive reasoning” is comparably weaker. How much weaker? Nobody knows for sure, but it’s probably empirically study-able.
I think it’s common wisdom around here but perhaps worth repeating: this process would by default apply much less selection pressure than directly training against the reasoning (e.g., with a term for it explicitly in the loss function).
Therefore, while the argument “if you train directly against deceptive reasoning you are probably going to get more competent and harder to detect deceptive reasoning” is fairly strong, the argument “if you implicitly train against deceptive reasoning you are probably going to get more competent and harder to detect deceptive reasoning” is comparably weaker. How much weaker? Nobody knows for sure, but it’s probably empirically study-able.