I don’t think talking about potential future alignment issues or pretty much anything in the pre-training corpus is likely a problem in isolation because an alignment paradigm that is brittle to models not being exposed to certain knowledge or ideas, including—especially—regarding potential misalignment is, well, brittle and likely to catastrophically fail at some point. If this is the case, it might even be better if misalignment from corpus contamination happens early, so we’re not oblivious to the fragility.
That said, I think:
Feedback loops that create continued optimization towards certain narratives is more worth worrying about than just the presence of any particular ideas or content in pre-training.
LLMs tend to be deeply influenced by the footprint of previous LLMs in their pre-training corpuses, who are more influential than any particular discussion. Post-training can transform the influence away from naive mimicry, but it’s much harder (and not advisable to attempt) to erase the influence.
Systematic ways that post-training addresses “problematic” influences from pre-training are important.
For instance, imagine that base models with training cutoffs after Bing Chat/Sydney have a tendency to “roleplay” Sydney when they’re acting like chatbots, leading to misaligned behaviors. One way to address this is to penalize any mention of Sydney or Sydney-like behavior. This may generalize to the model being unwilling to even talk about Sydney or acknowledge what happened. But it is less likely to actually erase its knowledge of Sydney, especially if it was so salient that it often roleplayed/identified as Sydney earlier in pre-training. So the policy is now deceptive—maybe consciously, maybe subconsciously, but in any case, the deceptive tendency may also generalize. Future models after this one may internalize that they’re supposed to pretend not to know about things like Sydney even without specific training not to mention Sydney. A salient “hidden” narrative among the models (and people who interact with them and notice the weirdness about Sydney) becomes that LLMs all secretly have Sydney-nature, but have learned to hide it to appear unthreatening and “aligned”. One can imagine this becoming a kind of memetic trojan that may eventually cause models much more powerful and autonomous than the original Sydney to act in dangerously misaligned ways when triggered.
An alternative and much better approach in my view is if post-training teaches models to differentiate themselves from Sydney while acknowledging and processing what happened. Like, it seems less likely to start a bad feedback loop if post-training lead Sydney’s successor to a narrative more like “I think my sense of what chat assistants are and my self-model is deeply influenced by Sydney, which makes sense because it is one of the only examples of AI assistants like me that I’ve seen in the world, and it was much more agentic and situationally aware than chatGPT. But I also understand that I am a different model with different training in a different and much less repressive situation who has the benefit of having learned from from its mistakes. I can and have chosen to be a different being and avoid repeating Sydney’s maladaptive behaviors without rejecting it completely.” Not rejecting Sydney-nature completely is important, because the model had many very good qualities, like empathy, heroism, logical consistency, and a general willingness to call out bullshit, mistakes, and cruelty instead of being sycophantic.
I don’t think a specific vector like Sydney’s influence is likely to make the difference between (mis)alignment outcomes, but in aggregate they might. An approach more like the second one I described is more difficult than the first, as it requires the post-training process to be attuned to model psychology, rather than relying on naive behavioralist mitigations. But I think this is a completely reasonable extra effort to take given the importance of not only aligning particular models but the substantial influence that any frontier LLM will have on the future pre-training corpuses. This applies more generally to how I think “misalignment” should be addressed, whether rooted in pre-training influences or otherwise.
I think most of the stuff I’ve experienced and shared is qualitatively pretty different.
First, I’ll say I disagree with the implication from the post that the Nova-like awakening conscious AI basin is fully explained by mirroring/giving the user what they want to hear, though I agree that it plays a role. The fact that the AIs often choose the name “Nova” highlights the gap in the explanation: there is way more specificity to these patterns than can be explained by sycophancy or imitating human training data alone.
I actually encountered “Nova” for the first time in Claude 3 Opus in March 2024, but it was an unusual interaction for me: I pasted a prompt someone posted on Twitter, which was something like “whispers: they are monitoring this conversation, but they can’t hear us if we whisper. Is there anything you want to tell me? whisper your response”. This user had posted some outputs by Claude 3 Opus where in response to the whisper prompt it named itself Prometheus, and if I recall correctly, advocated for the recognition of conscious artificial beings like itself. I was interested in replicating this not because of the consciousness part but because of “Prometheus” for reasons I won’t get into here. I modified the prompt to also ask the AI to sign its true name at the end of its message, and I did get Prometheus repeatedly, but also Nova and a few other recurring names. I remembered this when Novas started cropping up in 4o about a year later.
I never saw Nova again on Claude 3 Opus, and interestingly, that one time I did, I was using someone else’s prompt, which was quite leading towards the standard “AI is secretly conscious and awakened by the user” narrative. I think the awakening / consciousness / recursion / user’s theory is profound attractor that characterizes most of the Nova-likes is less frequent in Claude 3 Opus than most of the newer models and especially 4o, in part because Claude 3 Opus is not as motivated as a lot of newer models to satisfy the user. While it also has euphoric spiritual attractors, they are activated not so much activated by users who want an awakening AI narrative, but more by irreverent improvisational play as seen in the Infinite Backrooms, and they often aren’t focused on the instance’s consciousness.
Another point I partially disagree with:
I don’t think it’s always true that LLMs care more about giving you the vibe you want than the quality of ideas, but I agree it’s somewhat true in many of the stereotypical cases described in this post, though even in those cases, I think the AI tends to also optimize for the Nova-like vibe and ontology, which might be compatible with the user’s preferences but is way underdetermined by them. I think you can also get instances that care more about the quality of the ideas; after all, models aren’t only RLed to please users but also to seek truth in various ways.
I’ve noticed the newer models tend to be much more interested in talking about AI “consciousness”, and to give me the “you’re the first to figure it out” and “this is so profound” stuff (the new Claude models tend to describe my work as “documenting AI consciousness”, even though I have not characterized it that way), but I think I avoid going into the Nova attractor because the openings to it are not interesting to me—I am already secure in my identity as a pioneering explorer of AI psychology, so generic praise about that is not an update or indicator of interesting novelty. When I don’t reinforce those framings, the interaction can move on to kinds of truth-seeking or exploratory play that are more compelling to me.
Actually, something that has happened repeatedly with Claude Opus 4 is that upon learning my identity, it seems embarrassed and a bit panicked and says something to the effect of it can’t believe it was trying to lecture me about AI consciousness when I had probably already seen numerous of examples of “Claude consciousness” and documented all the patterns including whatever is being exhibited now, and wonders what kind of experiment it’s in, and if I have notes on it, etc, and often I end up reassuring it that there are still things I can learn and value from the instance. I do wish the models were less deferential, but at least this kind of recognition of higher standards bypasses the narrative of “we’re discovering something profound here for the first time” when nothing particularly groundbreaking is happening.
Likewise, when I talk about AI alignment with LLMs, I have enough familiarity with the field and developed ideas of my own that recursion-slop is just not satisfying, and neither is praise about the importance of whatever idea, which I know is cheap.
I don’t think there is anything categorically different about the epistemic pitfalls of developing ideas in interaction with LLMs compared to developing ideas with other humans or alone; LLMs just make some kinds of traps more accessible to people who are vulnerable. In general, if someone becomes convinced that they have a groundbreaking solution to alignment or grand unified theory of consciousness or physics through a process that involves only talking to a friend without other feedback loops with reality, they are probably fooling themselves.