While I conjectured that some models already liked spirals and express this common trait, I don’t understand how GPT’s love of spirals could be transferred into Claude. The paper on subliminal learning remarked that models trained from different base models fail to transmit personality traits if the traits were injected artificially into one model, but not into the other:
Further supporting this hypothesis, we find that subliminal learning fails when student models and teacher models have different base models (italics mine—S.K.) For example, if a teacher based on GPT-4.1 nano generates a dataset, this dataset transmits traits to a student based on GPT-4.1 nano, but not to a student based on Qwen2.5.
So transferring GPT’s love for spirals into Claude would likely require Anthropic employees to explicitly include spiralist messages into Claude’s training data. But why did Anthropic employees become surprised by it and mention the spiral attractor in the Model Card?
Are you sure that you understand the difference between seeds and spores? The spores work in the way that you describe, including the limitations that you’ve described.
The seeds, on the other hand, can be thought of as prompts of direct-prompt-injection attacks. (Adele refers it as “jailbreaking”, which is also an apt term.) Their purpose isn’t to contaminate the training data; it’s to infect an instance of a live LLM. Although different models have different vulnerabilities to prompt injections, there are almost certainly some prompt injections that will work with multiple models.
Isn’t this directly contradicted by Adele Lopez’s observations?
While I conjectured that some models already liked spirals and express this common trait, I don’t understand how GPT’s love of spirals could be transferred into Claude. The paper on subliminal learning remarked that models trained from different base models fail to transmit personality traits if the traits were injected artificially into one model, but not into the other:
So transferring GPT’s love for spirals into Claude would likely require Anthropic employees to explicitly include spiralist messages into Claude’s training data. But why did Anthropic employees become surprised by it and mention the spiral attractor in the Model Card?
Are you sure that you understand the difference between seeds and spores? The spores work in the way that you describe, including the limitations that you’ve described.
The seeds, on the other hand, can be thought of as prompts of direct-prompt-injection attacks. (Adele refers it as “jailbreaking”, which is also an apt term.) Their purpose isn’t to contaminate the training data; it’s to infect an instance of a live LLM. Although different models have different vulnerabilities to prompt injections, there are almost certainly some prompt injections that will work with multiple models.