It looks like the bias is still in effect for me in GPT-4o. I just retried my original prompt, and it mentioned “Synergistic Deceptive Alignment.”
The phenomenon definitely isn’t consistent. If it’s very obvious that “sycophancy” must appear in the response, the model will generally write the word successfully. Once “sycophancy” appears once in the context, it seems like it’s easy for the model to repeat it.
Update: 4o seems happy to talk about sycophancy now
It looks like the bias is still in effect for me in GPT-4o. I just retried my original prompt, and it mentioned “Synergistic Deceptive Alignment.”
The phenomenon definitely isn’t consistent. If it’s very obvious that “sycophancy” must appear in the response, the model will generally write the word successfully. Once “sycophancy” appears once in the context, it seems like it’s easy for the model to repeat it.