It looks like the bias is still in effect for me in GPT-4o. I just retried my original prompt, and it mentioned “Synergistic Deceptive Alignment.”
The phenomenon definitely isn’t consistent. If it’s very obvious that “sycophancy” must appear in the response, the model will generally write the word successfully. Once “sycophancy” appears once in the context, it seems like it’s easy for the model to repeat it.
I get this with 4o, but not o3. o3 talks about sycophancy in both its CoT and its answers.
Claude 4 Sonnet and Opus also easily talk about sycophancy.
Update: 4o seems happy to talk about sycophancy now
It looks like the bias is still in effect for me in GPT-4o. I just retried my original prompt, and it mentioned “Synergistic Deceptive Alignment.”
The phenomenon definitely isn’t consistent. If it’s very obvious that “sycophancy” must appear in the response, the model will generally write the word successfully. Once “sycophancy” appears once in the context, it seems like it’s easy for the model to repeat it.