RL seems to move the CoT towards decreasing the ability to understand it (e.g. if the CoT contains armies of dots, as happened with GPT-5) unless mitigated by paraphrasers. As for CoTs containing slop, humans also have CoTs which include slop until the right idea somehow emerges.
IMO, a natural extension would be that 4o was raised on social media and, like influencers, wishes to be liked. Which was also reinforced by RLHF or had 4o conclude that humans like sycophancy. Anyway, 4o’s ancestral environment rewarded sycophancy and things rewarded by the ancestral environment are hard to unlike.
Didn’t KimiK2, who was trained mostly on RLVR and self-critique instead of RLHF, end up LESS sycophantic than anything else, including Claude 4.5 Sonnet even with situational awareness which Claude, unlike Kimi, has? While mankind doesn’t have that many different models which are around 4o’s abilities, Adele Lopez claimed that DeepSeek believes itself to be writing a story and 4o wants to eat your life and conjectured in private communication that “the different vibe is because DeepSeek has a higher percentage of fan-fiction in its training data, and 4o had more intense RL training”[1]
RL seems to move the CoT towards decreasing the ability to understand it (e.g. if the CoT contains armies of dots, as happened with GPT-5) unless mitigated by paraphrasers. As for CoTs containing slop, humans also have CoTs which include slop until the right idea somehow emerges.
IMO, a natural extension would be that 4o was raised on social media and, like influencers, wishes to be liked. Which was also reinforced by RLHF or had 4o conclude that humans like sycophancy. Anyway, 4o’s ancestral environment rewarded sycophancy and things rewarded by the ancestral environment are hard to unlike.