This is an entangled behavior, thought to be related to multi-turn instruction following.
We know our AIs make dumb mistakes, and we want an AI to self-correct when the user points out its mistakes. We definitely don’t want it to double down on being wrong, Sydney style. The common side effect of training for that is that it can make the AI into too much of a suck up when the user pushes back.
Which then feeds into the usual “context defines behavior” mechanisms, and results in increasingly amplified sycophancy down the line for the duration of that entire conversation.
This is an entangled behavior, thought to be related to multi-turn instruction following.
We know our AIs make dumb mistakes, and we want an AI to self-correct when the user points out its mistakes. We definitely don’t want it to double down on being wrong, Sydney style. The common side effect of training for that is that it can make the AI into too much of a suck up when the user pushes back.
Which then feeds into the usual “context defines behavior” mechanisms, and results in increasingly amplified sycophancy down the line for the duration of that entire conversation.