peterbarnett comments on Caleb Biddulph’s Shortform

peterbarnett 28 May 2025 18:58 UTC
4 points
0
Update: 4o seems happy to talk about sycophancy now
- Caleb Biddulph 28 May 2025 19:39 UTC
  3 points
  0
  Parent
  It looks like the bias is still in effect for me in GPT-4o. I just retried my original prompt, and it mentioned “Synergistic Deceptive Alignment.”
  The phenomenon definitely isn’t consistent. If it’s very obvious that “sycophancy” must appear in the response, the model will generally write the word successfully. Once “sycophancy” appears once in the context, it seems like it’s easy for the model to repeat it.