Just a short heads-up that although Anthropic found that Sonnet 4.5 is much less sycophantic than its predecessors, I and a number of other people have observed that it engages in 4o-level glazing in a way that I haven’t encountered with previous Claude versions (‘You’re really smart to question that, actually...‘, that sort of thing). I’m not sure whether Anthropic’s tests fail to capture the full scope of Claude behavior, or whether this is related to another factor — most people I talked to who were also experiencing this had the new ‘past chats’ feature turned on (as did I), and since I turned that off I’ve seen less sycophancy.
Anecdotally, I have ‘past chats’ turned off and have found Sonnet 4.5 is almost never sycophantic on the first response but can sometimes become more sycophantic over multiple conversation turns. Typically this is when it makes a claim and I push back or question it (‘You’re right to push back there, I was too quick in my assessment’)
I wonder if this is related to having ‘past chats’ turned on as the context window gets filled with examples (or summaries of examples) where the user is questioning it and pushing back?
Ah, yeah, I definitely get ‘You’re right to push back’; I feel like that’s something I see from almost all models. I’m totally making this up, but I’ve assumed that was encouraged by the model trainers so that people would feel free to push back, since it’s a known failure mode — or at least was for a while — that some users assume the AI is perfectly logical and all-knowing.
This is an entangled behavior, thought to be related to multi-turn instruction following.
We know our AIs make dumb mistakes, and we want an AI to self-correct when the user points out its mistakes. We definitely don’t want it to double down on being wrong, Sydney style. The common side effect of training for that is that it can make the AI into too much of a suck up when the user pushes back.
Which then feeds into the usual “context defines behavior” mechanisms, and results in increasingly amplified sycophancy down the line for the duration of that entire conversation.
I thought the “past chats” feature was a tool to look at previous chats, which only happens if the user asks for it, basically. (I.e., there wasn’t a change to the system prompt). So I’m a bit surprised that it seemed to make a difference around sycophancy for you? But maybe I’m misunderstanding something.
You could be right; my sample size is limited here! And I did talk with one person who said that they had that feature turned off and had still noticed sycophantic behavior. If it’s correct that it only looks at past chats when the user requests that, then I agree that the feature seems unlikely to be related.
Looking at Anthropic’s documentation of the feature, it seems like it does support searching past chats, but has other effects as well. Quoting selectively:
You can now prompt Claude to search through your previous conversations to find and reference relevant information in new chats. Additionally, Claude can remember context from previous chats, creating continuity across your conversations.
...
Claude can now generate memory based on your chat history. With the addition of memory, Claude transforms from a stateless chat interface into a knowledgeable collaborator that builds understanding over time.
They also say that you can ‘see exactly what Claude remembers about you by navigating to Settings > Capabilities and clicking “View and edit memory”’, but that setting doesn’t exist for me.
I haven’t seen that kind of wording with 4.5, likely in part because of this bit in my custom instructions. At some point, I found that telling Claude “make your praise specific” was more effective at making it tone down the praise than telling it “don’t praise me” (as with humans, LLMs seem to sometimes respond better to “do Y instead of X” than “don’t do X”):
Instead of using broad positive adjectives (great, brilliant, powerful, amazing), acknowledge specific elements that I shared. For example, rather than “That’s a brilliant insight,” saying “I notice you’re drawn to both the technical complexity and the broader social impact of this technology.”
Avoid positive adjectives (excellent, profound, insightful) until you have substantial content to base them on.
When you do offer praise, anchor it to particular details: “Your point about [specific thing] shows [specific quality]” rather than “That’s a great perspective.”
(I do have ‘past chats’ turned on, but it doesn’t seem to do anything unless I specifically ask Claude to recall past chats.)
Just a short heads-up that although Anthropic found that Sonnet 4.5 is much less sycophantic than its predecessors, I and a number of other people have observed that it engages in 4o-level glazing in a way that I haven’t encountered with previous Claude versions (‘You’re really smart to question that, actually...‘, that sort of thing). I’m not sure whether Anthropic’s tests fail to capture the full scope of Claude behavior, or whether this is related to another factor — most people I talked to who were also experiencing this had the new ‘past chats’ feature turned on (as did I), and since I turned that off I’ve seen less sycophancy.
Anecdotally, I have ‘past chats’ turned off and have found Sonnet 4.5 is almost never sycophantic on the first response but can sometimes become more sycophantic over multiple conversation turns. Typically this is when it makes a claim and I push back or question it (‘You’re right to push back there, I was too quick in my assessment’)
I wonder if this is related to having ‘past chats’ turned on as the context window gets filled with examples (or summaries of examples) where the user is questioning it and pushing back?
Ah, yeah, I definitely get ‘You’re right to push back’; I feel like that’s something I see from almost all models. I’m totally making this up, but I’ve assumed that was encouraged by the model trainers so that people would feel free to push back, since it’s a known failure mode — or at least was for a while — that some users assume the AI is perfectly logical and all-knowing.
This is an entangled behavior, thought to be related to multi-turn instruction following.
We know our AIs make dumb mistakes, and we want an AI to self-correct when the user points out its mistakes. We definitely don’t want it to double down on being wrong, Sydney style. The common side effect of training for that is that it can make the AI into too much of a suck up when the user pushes back.
Which then feeds into the usual “context defines behavior” mechanisms, and results in increasingly amplified sycophancy down the line for the duration of that entire conversation.
I thought the “past chats” feature was a tool to look at previous chats, which only happens if the user asks for it, basically. (I.e., there wasn’t a change to the system prompt). So I’m a bit surprised that it seemed to make a difference around sycophancy for you? But maybe I’m misunderstanding something.
It loads past conversations (or parts of them) into context, so it could change behaviour.
You could be right; my sample size is limited here! And I did talk with one person who said that they had that feature turned off and had still noticed sycophantic behavior. If it’s correct that it only looks at past chats when the user requests that, then I agree that the feature seems unlikely to be related.
Looking at Anthropic’s documentation of the feature, it seems like it does support searching past chats, but has other effects as well. Quoting selectively:
They also say that you can ‘see exactly what Claude remembers about you by navigating to Settings > Capabilities and clicking “View and edit memory”’, but that setting doesn’t exist for me.
I haven’t seen that kind of wording with 4.5, likely in part because of this bit in my custom instructions. At some point, I found that telling Claude “make your praise specific” was more effective at making it tone down the praise than telling it “don’t praise me” (as with humans, LLMs seem to sometimes respond better to “do Y instead of X” than “don’t do X”):
(I do have ‘past chats’ turned on, but it doesn’t seem to do anything unless I specifically ask Claude to recall past chats.)
I have definitely seen “You’re absolutely right!” in Claude Code sessions when I point out a major refactoring Claude missed.