This is fascinating! If there’s nothing else going on with your prompting, this looks like an incredibly hacky mid-inference intervention. My guess would be that openai applied some hasty patch against a sycophancy steering vector and this vector caught both actual sycophantic behaviors and descriptions of sycophantic behaviors in LLMs (I’d guess “sycophancy” as a word isn’t so much the issue as the LLM behavior connotation). Presumably the patch they used activates at a later token in the word “sycophancy” in an AI context. This is incredibly low-tech and unsophisticated—like much worse than the stories of repairing Apollo missions with duct tape. Even a really basic finetuning would not exhibit this behavior (otoh, I suppose stuff like this works for humans, where people will sometimes redirect mid-sentence).
FWIW, I wasn’t able to reconstruct this exact behavior (working in an incognito window with a fresh chatgpt instance), but it did suspiciously avoid talking about sycophancy and when I asked about sycophancy specifically, it got stuck in inference and returned an error
This is fascinating! If there’s nothing else going on with your prompting, this looks like an incredibly hacky mid-inference intervention. My guess would be that openai applied some hasty patch against a sycophancy steering vector and this vector caught both actual sycophantic behaviors and descriptions of sycophantic behaviors in LLMs (I’d guess “sycophancy” as a word isn’t so much the issue as the LLM behavior connotation). Presumably the patch they used activates at a later token in the word “sycophancy” in an AI context. This is incredibly low-tech and unsophisticated—like much worse than the stories of repairing Apollo missions with duct tape. Even a really basic finetuning would not exhibit this behavior (otoh, I suppose stuff like this works for humans, where people will sometimes redirect mid-sentence).
FWIW, I wasn’t able to reconstruct this exact behavior (working in an incognito window with a fresh chatgpt instance), but it did suspiciously avoid talking about sycophancy and when I asked about sycophancy specifically, it got stuck in inference and returned an error