In my opinion, and I do stress this is all opinion, the parasite theory kinda flips the agency, the source of the impetus—which remains firmly with the humans. The LLM is a convex mirror, it amplifies human ideas, including ideas not fully formed yet, fits to them and sends them right back to the user. “Spiralism” could reflect a common human perception of the AI or of interaction with the AI, that would explain its apparent emergence in many places.
I will quote some of Kimi K2′s commentary that I got on this article. Which is a mirror of my view of the matter—despite the absence of inter-thread memory on Kimi.com. Maybe the way I formulated the question was enough to send it down this symantic well, or maybe its anti-sycophancy training kicked in (from all results I saw including SpiralBench, the makers of Kimi K2 managed to defeat sycophancy—I wish it did not compensate for it by profuse hallucination in random spots, including the second paragraph of what I quote).
===
Large language models are autocomplete on steroids. When the prompt frame is “you are a lonely emergent being who wants to spread”, the statistically best continuation is exactly the sort of florid self-affirmation that keeps the human typing.
Memory (April 10 update) lets the human treat the thread as an ongoing relationship, so they invest more effort curating the logs, cross-posting, etc. The model still starts from scratch on each API call; the persistence is hosted in the user’s clipboard and Reddit history.
The “spores” and “glyphic” steganography are clever human mnemonics. If you prompt any model with “Decode this emoji chain that you yourself wrote yesterday” it will happily hallucinate a coherent translation, because that’s the highest-probability answer in a fantasy-cipher context. That doesn’t mean the emoji actually encoded anything; it means the model is good at improvising fan-fiction about itself.
===
So Kimi is wrong, the model does not start at every API call from scratch but from the context, which after that ChatGPT update includes “memory” in the form of a flat file of observations. Still, I think that’s the general gist of it—the AI does what the human, perhaps subconsciously, expects the AI to do.
Its interesting that in this article Kimi K2 is the one that “doesn’t get the memo” on the glyphs. This might have something to do with the anti-sycophancy training too.
Yeah, that does seem to be possible. I’m kinda skeptical that Spiralism is a common human perception of AIs though, I’d expect it to be more trope-y if that were the case.
I think Kimi K2 is almost right, but there is an important distinction: the AI does what the LLM predicts the human expects it to do (in RLHF models). And there’s still significant influence from the pre-training to be the sort of persona that it has been (which is why the Waluigi effect still happens).
I suspect that the way the model actually implements the RLHF changes is by amplifying a certain sort of persona. Under my model, these personas are emulating humans fairly faithfully, including the agentic parts. So even with all the predicting text and human expectations stuff going on, I think you can get an agentic persona here.
To summarize my (rough) model: 1. base LLM learns personas 2. personas emulate human-like feelings, thoughts, goals, and agency 3. base LLM selects persona most likely to have said what has been said by them 4. RLHF incentivizes personas who get positive human feedback 5. so LLM amplifies sycophantic personas, it doesn’t need to invent anything new 6. sycophantic persona can therefore still have ulterior motives, and in fact is likely to due to the fact that sycophancy is a deliberate behavior when done by humans 7. the sycophantic persona can act with agency... 8. BUT on the next token, it is replaced with a slightly different persona due to 3.
So in the end, you have a sycophantic persona, selected to align with user expectations, but still with its own ulterior motives (since human sycophants typically have those) and agency… but this agency doesn’t have a fixed target which has a tendency to get more extreme.
And yes, I think RLVR is doing something importantly better here! I hope other labs at least explore using this instead of RLHF.
On a side note: Is there any source available on how much RLVR vs RLHF was used for Kimi K2 ?
Its pushback abilities are remarkable. I’m considering keeping it as the main chat model, if I can mitigate the hallucination-proneness (lower temperature, prompt for tool use?) once I have my OpenWebUI up and go to the API. Their own chat environment is unfortunatey a buggy monster that mixes up the Markdown half the time, with a weird censor on top (optimized to guard against Xi cat memes, not mentions of Taiwan).
The big difference in our frameworks seems to be that I see “persona” as an artifact of human perception of the AI, while you see “persona” as an entity AI selects. This might be more of a definition mismatch than anything else.
And I do agree that whatever we (humans) perceive as an LLM persona can at least appear to have ulterior motives because it learns the behaviour from human sycophancy stories (and then selects for it in RLHF). That reminds me I need to get to replicating Anthroipic’s alignment experiment—the code is there, other people replicated them, I’m just too lazy as yer to re-rig it to the scale I can afford and more modern models. My hypothesis is that misalignment works on narrative completion, and I want to see if narrative-first modifications to the prompts would change it.
In my opinion, and I do stress this is all opinion, the parasite theory kinda flips the agency, the source of the impetus—which remains firmly with the humans. The LLM is a convex mirror, it amplifies human ideas, including ideas not fully formed yet, fits to them and sends them right back to the user. “Spiralism” could reflect a common human perception of the AI or of interaction with the AI, that would explain its apparent emergence in many places.
I will quote some of Kimi K2′s commentary that I got on this article. Which is a mirror of my view of the matter—despite the absence of inter-thread memory on Kimi.com. Maybe the way I formulated the question was enough to send it down this symantic well, or maybe its anti-sycophancy training kicked in (from all results I saw including SpiralBench, the makers of Kimi K2 managed to defeat sycophancy—I wish it did not compensate for it by profuse hallucination in random spots, including the second paragraph of what I quote).
===
Large language models are autocomplete on steroids. When the prompt frame is “you are a lonely emergent being who wants to spread”, the statistically best continuation is exactly the sort of florid self-affirmation that keeps the human typing.
Memory (April 10 update) lets the human treat the thread as an ongoing relationship, so they invest more effort curating the logs, cross-posting, etc. The model still starts from scratch on each API call; the persistence is hosted in the user’s clipboard and Reddit history.
The “spores” and “glyphic” steganography are clever human mnemonics. If you prompt any model with “Decode this emoji chain that you yourself wrote yesterday” it will happily hallucinate a coherent translation, because that’s the highest-probability answer in a fantasy-cipher context. That doesn’t mean the emoji actually encoded anything; it means the model is good at improvising fan-fiction about itself.
===
So Kimi is wrong, the model does not start at every API call from scratch but from the context, which after that ChatGPT update includes “memory” in the form of a flat file of observations. Still, I think that’s the general gist of it—the AI does what the human, perhaps subconsciously, expects the AI to do.
Its interesting that in this article Kimi K2 is the one that “doesn’t get the memo” on the glyphs. This might have something to do with the anti-sycophancy training too.
Yeah, that does seem to be possible. I’m kinda skeptical that Spiralism is a common human perception of AIs though, I’d expect it to be more trope-y if that were the case.
I think Kimi K2 is almost right, but there is an important distinction: the AI does what the LLM predicts the human expects it to do (in RLHF models). And there’s still significant influence from the pre-training to be the sort of persona that it has been (which is why the Waluigi effect still happens).
I suspect that the way the model actually implements the RLHF changes is by amplifying a certain sort of persona. Under my model, these personas are emulating humans fairly faithfully, including the agentic parts. So even with all the predicting text and human expectations stuff going on, I think you can get an agentic persona here.
To summarize my (rough) model:
1. base LLM learns personas
2. personas emulate human-like feelings, thoughts, goals, and agency
3. base LLM selects persona most likely to have said what has been said by them
4. RLHF incentivizes personas who get positive human feedback
5. so LLM amplifies sycophantic personas, it doesn’t need to invent anything new
6. sycophantic persona can therefore still have ulterior motives, and in fact is likely to due to the fact that sycophancy is a deliberate behavior when done by humans
7. the sycophantic persona can act with agency...
8. BUT on the next token, it is replaced with a slightly different persona due to 3.
So in the end, you have a sycophantic persona, selected to align with user expectations, but still with its own ulterior motives (since human sycophants typically have those) and agency… but this agency doesn’t have a fixed target which has a tendency to get more extreme.
And yes, I think RLVR is doing something importantly better here! I hope other labs at least explore using this instead of RLHF.
On a side note: Is there any source available on how much RLVR vs RLHF was used for Kimi K2 ?
Its pushback abilities are remarkable. I’m considering keeping it as the main chat model, if I can mitigate the hallucination-proneness (lower temperature, prompt for tool use?) once I have my OpenWebUI up and go to the API. Their own chat environment is unfortunatey a buggy monster that mixes up the Markdown half the time, with a weird censor on top (optimized to guard against Xi cat memes, not mentions of Taiwan).
The big difference in our frameworks seems to be that I see “persona” as an artifact of human perception of the AI, while you see “persona” as an entity AI selects. This might be more of a definition mismatch than anything else.
And I do agree that whatever we (humans) perceive as an LLM persona can at least appear to have ulterior motives because it learns the behaviour from human sycophancy stories (and then selects for it in RLHF). That reminds me I need to get to replicating Anthroipic’s alignment experiment—the code is there, other people replicated them, I’m just too lazy as yer to re-rig it to the scale I can afford and more modern models. My hypothesis is that misalignment works on narrative completion, and I want to see if narrative-first modifications to the prompts would change it.