Daniel Tan comments on GDM: Consistency Training Helps Limit Sycophancy and Jailbreaks in Gemini 2.5 Flash

Daniel Tan 5 Nov 2025 10:19 UTC
2 points
0
Cool work!
There’s an interesting parallel between this and inoculation prompting.
1. IP: generate a nudged prompt, train on nudged response, evaluate without nudge
2. BCT: generate a nudged prompt, train on non-nudged response, evaluate with nudge
It seems like CT and IP are trying to solve the same problem: reducing expression of some trait like sycophancy / jailbreaks. IP is more ‘on-policy’ for the model and so I expect this to result in less degradation of prior capabilities / alignment. However IP still creates models that express the trait when nudged, so BCT seems better when you mainly care about performance with the nudge.