IP: generate a nudged prompt, train on nudged response, evaluate without nudge
BCT: generate a nudged prompt, train on non-nudged response, evaluate with nudge
It seems like CT and IP are trying to solve the same problem: reducing expression of some trait like sycophancy / jailbreaks. IP is more ‘on-policy’ for the model and so I expect this to result in less degradation of prior capabilities / alignment. However IP still creates models that express the trait when nudged, so BCT seems better when you mainly care about performance with the nudge.
Cool work!
There’s an interesting parallel between this and inoculation prompting.
IP: generate a nudged prompt, train on nudged response, evaluate without nudge
BCT: generate a nudged prompt, train on non-nudged response, evaluate with nudge
It seems like CT and IP are trying to solve the same problem: reducing expression of some trait like sycophancy / jailbreaks. IP is more ‘on-policy’ for the model and so I expect this to result in less degradation of prior capabilities / alignment. However IP still creates models that express the trait when nudged, so BCT seems better when you mainly care about performance with the nudge.