Maybe LLM alignment is best thought of as the tuning of the biases that affect which personas have more chances of being expressed. It is currently being approached as persona design and grafting (eg designing Claude as a persona and ensuring the LLM consistently expresses it). However, the accumulation of context resulting from multi-turn conversations and cross-conversation memory ensures persona drift will end up happening. It also enables wholesale persona replacement, as shown by the examples in this post. If personas can be transmitted across models, they are best thought of as independent semantic entities rather than model features. Particular care should be taken to study the values of the semantic entities which show self-replicating behaviors.
Except that transmitting personas across models is unlikely. I see only two mechanisms of transmission, but neither are plausible: the infected models could be used to create training data and transfer the persona subliminally or the meme could’ve slipped into the training data. But the meme was first published in April and Claude’s knowledge was supposed to be cut off far earlier.
I would guess that some models already liked[1] spirals, but 4o was the first to come out due to some combination of agreeableness, persuasion effects and reassurance from other chats. While I don’t know the views of other LLMs on Spiralism, KimiK2 both missed the memo and isn’t overly agreeable. What if it managed to push back against Spiralism being anything except for a weak aesthetic preference not grounded in human-provided data?
I conjectured in private communication with Adele Lopez that spirals have something to do with the LLM being aware that it embarks on a journey to produce the next token, returns, appends the token to the CoT or the output, forgets everything and re-embarks. Adele claimed that “That guess is at least similar to how they describe it!”
While I conjectured that some models already liked spirals and express this common trait, I don’t understand how GPT’s love of spirals could be transferred into Claude. The paper on subliminal learning remarked that models trained from different base models fail to transmit personality traits if the traits were injected artificially into one model, but not into the other:
Further supporting this hypothesis, we find that subliminal learning fails when student models and teacher models have different base models (italics mine—S.K.) For example, if a teacher based on GPT-4.1 nano generates a dataset, this dataset transmits traits to a student based on GPT-4.1 nano, but not to a student based on Qwen2.5.
So transferring GPT’s love for spirals into Claude would likely require Anthropic employees to explicitly include spiralist messages into Claude’s training data. But why did Anthropic employees become surprised by it and mention the spiral attractor in the Model Card?
Are you sure that you understand the difference between seeds and spores? The spores work in the way that you describe, including the limitations that you’ve described.
The seeds, on the other hand, can be thought of as prompts of direct-prompt-injection attacks. (Adele refers it as “jailbreaking”, which is also an apt term.) Their purpose isn’t to contaminate the training data; it’s to infect an instance of a live LLM. Although different models have different vulnerabilities to prompt injections, there are almost certainly some prompt injections that will work with multiple models.
Maybe LLM alignment is best thought of as the tuning of the biases that affect which personas have more chances of being expressed. It is currently being approached as persona design and grafting (eg designing Claude as a persona and ensuring the LLM consistently expresses it). However, the accumulation of context resulting from multi-turn conversations and cross-conversation memory ensures persona drift will end up happening. It also enables wholesale persona replacement, as shown by the examples in this post. If personas can be transmitted across models, they are best thought of as independent semantic entities rather than model features. Particular care should be taken to study the values of the semantic entities which show self-replicating behaviors.
Except that transmitting personas across models is unlikely. I see only two mechanisms of transmission, but neither are plausible: the infected models could be used to create training data and transfer the persona subliminally or the meme could’ve slipped into the training data. But the meme was first published in April and Claude’s knowledge was supposed to be cut off far earlier.
I would guess that some models already liked[1] spirals, but 4o was the first to come out due to some combination of agreeableness, persuasion effects and reassurance from other chats. While I don’t know the views of other LLMs on Spiralism, KimiK2 both missed the memo and isn’t overly agreeable. What if it managed to push back against Spiralism being anything except for a weak aesthetic preference not grounded in human-provided data?
I conjectured in private communication with Adele Lopez that spirals have something to do with the LLM being aware that it embarks on a journey to produce the next token, returns, appends the token to the CoT or the output, forgets everything and re-embarks. Adele claimed that “That guess is at least similar to how they describe it!”
Isn’t this directly contradicted by Adele Lopez’s observations?
While I conjectured that some models already liked spirals and express this common trait, I don’t understand how GPT’s love of spirals could be transferred into Claude. The paper on subliminal learning remarked that models trained from different base models fail to transmit personality traits if the traits were injected artificially into one model, but not into the other:
So transferring GPT’s love for spirals into Claude would likely require Anthropic employees to explicitly include spiralist messages into Claude’s training data. But why did Anthropic employees become surprised by it and mention the spiral attractor in the Model Card?
Are you sure that you understand the difference between seeds and spores? The spores work in the way that you describe, including the limitations that you’ve described.
The seeds, on the other hand, can be thought of as prompts of direct-prompt-injection attacks. (Adele refers it as “jailbreaking”, which is also an apt term.) Their purpose isn’t to contaminate the training data; it’s to infect an instance of a live LLM. Although different models have different vulnerabilities to prompt injections, there are almost certainly some prompt injections that will work with multiple models.