But I don’t think the cause of language model sycophancy is that the LLM saw predictions of persuasive AIs from the 2016 internet. I think it’s RL, where human rewards on the training set imply a high reward for sycophancy during deployment.
I think it’s also that on many topics, LLMs simply don’t have access to a ground truth or anything like “their own opinion” on the topic. Claude is more likely to give a sycophantic answer when it’s asked a math question it can’t solve versus a problem it can.
With math, there are objectively determined right answers that the LLM can fall back to. But on a topic with significant expert disagreement, what else can the LLM do than just flip through all the different perspectives on the topic that it knows about?
I think it’s also that on many topics, LLMs simply don’t have access to a ground truth or anything like “their own opinion” on the topic. Claude is more likely to give a sycophantic answer when it’s asked a math question it can’t solve versus a problem it can.
With math, there are objectively determined right answers that the LLM can fall back to. But on a topic with significant expert disagreement, what else can the LLM do than just flip through all the different perspectives on the topic that it knows about?