Dmitry Vaintrob comments on Caleb Biddulph’s Shortform

Dmitry Vaintrob 28 May 2025 12:32 UTC
26 points
4
This is fascinating! If there’s nothing else going on with your prompting, this looks like an incredibly hacky mid-inference intervention. My guess would be that openai applied some hasty patch against a sycophancy steering vector and this vector caught both actual sycophantic behaviors and descriptions of sycophantic behaviors in LLMs (I’d guess “sycophancy” as a word isn’t so much the issue as the LLM behavior connotation). Presumably the patch they used activates at a later token in the word “sycophancy” in an AI context. This is incredibly low-tech and unsophisticated—like much worse than the stories of repairing Apollo missions with duct tape. Even a really basic finetuning would not exhibit this behavior (otoh, I suppose stuff like this works for humans, where people will sometimes redirect mid-sentence).

FWIW, I wasn’t able to reconstruct this exact behavior (working in an incognito window with a fresh chatgpt instance), but it did suspiciously avoid talking about sycophancy and when I asked about sycophancy specifically, it got stuck in inference and returned an error