Anecdotally, I have ‘past chats’ turned off and have found Sonnet 4.5 is almost never sycophantic on the first response but can sometimes become more sycophantic over multiple conversation turns. Typically this is when it makes a claim and I push back or question it (‘You’re right to push back there, I was too quick in my assessment’)
I wonder if this is related to having ‘past chats’ turned on as the context window gets filled with examples (or summaries of examples) where the user is questioning it and pushing back?
Ah, yeah, I definitely get ‘You’re right to push back’; I feel like that’s something I see from almost all models. I’m totally making this up, but I’ve assumed that was encouraged by the model trainers so that people would feel free to push back, since it’s a known failure mode — or at least was for a while — that some users assume the AI is perfectly logical and all-knowing.
This is an entangled behavior, thought to be related to multi-turn instruction following.
We know our AIs make dumb mistakes, and we want an AI to self-correct when the user points out its mistakes. We definitely don’t want it to double down on being wrong, Sydney style. The common side effect of training for that is that it can make the AI into too much of a suck up when the user pushes back.
Which then feeds into the usual “context defines behavior” mechanisms, and results in increasingly amplified sycophancy down the line for the duration of that entire conversation.
Anecdotally, I have ‘past chats’ turned off and have found Sonnet 4.5 is almost never sycophantic on the first response but can sometimes become more sycophantic over multiple conversation turns. Typically this is when it makes a claim and I push back or question it (‘You’re right to push back there, I was too quick in my assessment’)
I wonder if this is related to having ‘past chats’ turned on as the context window gets filled with examples (or summaries of examples) where the user is questioning it and pushing back?
Ah, yeah, I definitely get ‘You’re right to push back’; I feel like that’s something I see from almost all models. I’m totally making this up, but I’ve assumed that was encouraged by the model trainers so that people would feel free to push back, since it’s a known failure mode — or at least was for a while — that some users assume the AI is perfectly logical and all-knowing.
This is an entangled behavior, thought to be related to multi-turn instruction following.
We know our AIs make dumb mistakes, and we want an AI to self-correct when the user points out its mistakes. We definitely don’t want it to double down on being wrong, Sydney style. The common side effect of training for that is that it can make the AI into too much of a suck up when the user pushes back.
Which then feeds into the usual “context defines behavior” mechanisms, and results in increasingly amplified sycophancy down the line for the duration of that entire conversation.