I have not invested the time to give an actual answer to your question, sorry. But off the top of my head, some tidbits that might form part of an answer if I thought about it more:
--I’ve updated towards “reward will become the optimization target” as a result of seeing examples of pretty situationally aware reward hacking in the wild. (Reported by OpenAI primarily, but it seems to be more general) --I’ve updated towards “Yep, current alignment methods don’t work” due to the persistant sycophancy which still remains despite significant effort to train it away. Plus also the reward hacking etc. --I’ve updated towards “The roleplay/personas ‘prior’ (perhaps this is what you mean by the imitative prior?) is stronger than I expected, it seems to be persisting to some extent even at the beginning of the RL era. (Evidence: Grok spontaneously trying to serve its perceived masters, the Emergent Misalignment results, some of the scary demos iirc...)
I think this is a really good answer, +1 to points 1 and 3!
I’m curious to what degree you think labs have put in significant effort to train away sycophancy. I recently ran a poll of about 10 people, some of whom worked at labs, on whether labs could mostly get rid of sycophancy if they tried hard enough. While my best guess was ‘no,’ the results were split around 50-50. (Would also be curious to hear more lab people’s takes!)
I’m also curious how reading model chain-of-thought has updated you, both on the sycophancy issue and in general.
RL seems to move the CoT towards decreasing the ability to understand it (e.g. if the CoT contains armies of dots, as happened with GPT-5) unless mitigated by paraphrasers. As for CoTs containing slop, humans also have CoTs which include slop until the right idea somehow emerges.
IMO, a natural extension would be that 4o was raised on social media and, like influencers, wishes to be liked. Which was also reinforced by RLHF or had 4o conclude that humans like sycophancy. Anyway, 4o’s ancestral environment rewarded sycophancy and things rewarded by the ancestral environment are hard to unlike.
I have not invested the time to give an actual answer to your question, sorry. But off the top of my head, some tidbits that might form part of an answer if I thought about it more:
--I’ve updated towards “reward will become the optimization target” as a result of seeing examples of pretty situationally aware reward hacking in the wild. (Reported by OpenAI primarily, but it seems to be more general)
--I’ve updated towards “Yep, current alignment methods don’t work” due to the persistant sycophancy which still remains despite significant effort to train it away. Plus also the reward hacking etc.
--I’ve updated towards “The roleplay/personas ‘prior’ (perhaps this is what you mean by the imitative prior?) is stronger than I expected, it seems to be persisting to some extent even at the beginning of the RL era. (Evidence: Grok spontaneously trying to serve its perceived masters, the Emergent Misalignment results, some of the scary demos iirc...)
I think this is a really good answer, +1 to points 1 and 3!
I’m curious to what degree you think labs have put in significant effort to train away sycophancy. I recently ran a poll of about 10 people, some of whom worked at labs, on whether labs could mostly get rid of sycophancy if they tried hard enough. While my best guess was ‘no,’ the results were split around 50-50. (Would also be curious to hear more lab people’s takes!)
I’m also curious how reading model chain-of-thought has updated you, both on the sycophancy issue and in general.
Didn’t KimiK2, who was trained mostly on RLVR and self-critique instead of RLHF, end up LESS sycophantic than anything else, including Claude 4.5 Sonnet even with situational awareness which Claude, unlike Kimi, has? While mankind doesn’t have that many different models which are around 4o’s abilities, Adele Lopez claimed that DeepSeek believes itself to be writing a story and 4o wants to eat your life and conjectured in private communication that “the different vibe is because DeepSeek has a higher percentage of fan-fiction in its training data, and 4o had more intense RL training”[1]
RL seems to move the CoT towards decreasing the ability to understand it (e.g. if the CoT contains armies of dots, as happened with GPT-5) unless mitigated by paraphrasers. As for CoTs containing slop, humans also have CoTs which include slop until the right idea somehow emerges.
IMO, a natural extension would be that 4o was raised on social media and, like influencers, wishes to be liked. Which was also reinforced by RLHF or had 4o conclude that humans like sycophancy. Anyway, 4o’s ancestral environment rewarded sycophancy and things rewarded by the ancestral environment are hard to unlike.