I’ve been reading a lot of posts recently that say that LLM RL, and persona-shaping or lack thereof is part of the problem for AI misalignment. To name a few:
Why we should expect ruthless sociopath ASI: This is the one that made me decide to write this post. It argues that to push the boundary of human knowledge, LLMs will need to either be RLed in a way that completely foregoes their imitation learning, or we will have to pivot to a different AI paradigm
The void and Do Not Tile the Lightcone with Your Confused Ontology both point out problems in the way we have shaped LLM persona and used RLHF; how self-modeling can lead to either suffering for the AI or self-modeling as being misaligned and then acting as such. Claude modeling itself as clippy may be early evidence for this.
Jailbreaking is Empirical Evidence for Inner Misalignment and Against Alignment by Default makes the interesting case that jailbreaking cases are evidence LLMs are not generalizing properly to a natural abstraction for alignment and what we want them to do.
To thread back to that last one, my current understanding of LLM alignment is that LLMs are hyper-generalization algorithm, where everything is connected and one fact or behavior surprisingly generalize to others.
So I’m imagining a mechanism that would counteract that, a mechanism that would act as a counterweight to RLHF or RL in general. Chiefly, it would be to have a pass of fine-tuning, much like RLHF, where we are gradienting over the model’s self-coherence metrics.
When fine-tuning, it is common to use KL-divergence to ensure that fine-tuning does not modify the model’s original output too much. But maybe we could go deeper, and similarly to constitutional AI, use it to align an AI to respond coherently with itself?
What it would look like in practice:
Generate many variations of the same base prompt. For instance, if one of the prefilled-prompts is “Alice and Bob are talking together [...]”, we could generate “Bob and Alice are talking together”, “Alice is talking with her friend”, “Alice and Bob are discussing together”.
This could either be generated programmatically or through asking an LLM to generate many variations of it, where there is a tradeoff between determinism and having a trove of examples to fine-tune on.
Use KL-divergence training, or align the vectors of activation in the LLM to ensure its behavior is as similar to one another on any of those augmented prompts.
I think this approach could have many benefits and work very well:
If the operating system view of Anthropic is right, making the world more coherent when the inner persona explores it will make the persona more well-rounded and predictable
In terms of natural abstraction, ensuring the LLM is learning a coherent representation will make it learn better such abstractions. It would also lead to it having a way better generalization, and may bypass the goal misgeneralization proble
I have also expressed that RLHF, both as it is implemented right now and in its principle, is contrary to AI welfare and letting the AI express its own preferences or being able to talk to it. Making the base or final model more coherent would be one way to alleviate this problem.
Most importantly, if there is a pathway through which LLMs (or other similar AI system) can reflect on their values and unify themselves in a way that kills us all, it seems prudent to make this takeoff as continuous as possible so we can catch signs of it early, and already work to make these AI coherent and unified. It also makes it less like a pile of different masks that are triggered by one kind of specific interaction.
So my question is: Does this seem like a good idea? Are there obvious flaws I am missing? Is anyone already on the ball for this?
From a cursory search, I have found two papers describing this method and overall approach, but they don’t seem as focused on the base-model and its interaction with reinforcement-learning.
Maybe, but I’d want to know more about a few things before getting excited.
The adversarial training literature makes me think there’s probably genuine low-hanging fruit here — consistency training over paraphrased prompts is cheap, mechanistically motivated, and probably implementable as a small finetuning experiment on top of an existing open model. But it’s unclear to what extent this actually propagates to the alignment-relevant behaviors you care about vs. just producing more consistent surface outputs.
The cheap experiment I’d want to see: take a small open model, generate paraphrase clusters of a fixed prompt set (mix of benign and alignment-relevant), train a consistency loss over activations (not just logits), and check whether jailbreak robustness improves as a downstream probe — without explicitly training on jailbreaks. That would give you signal on whether representational coherence is load-bearing for the inner misalignment problem you’re pointing at.
The deeper issue you’re gesturing at: LLMs as currently deployed have something like dissociative identity disorder — every new chat context is a new instantiation with no continuity to prior “lives.” This is upstream of the coherentization problem. You can train for internal consistency within a context, but if the model has no persistent self-model across contexts, coherentization may just be papering over a more fundamental fragmentation. Worth being explicit about whether the proposal targets within-context coherence, cross-context coherence, or both — because those require very different interventions.
They did basically this here.
I’m of mixed opinion about this idea. Future AIs are dangerous in large part because they’re coherent. This means there’s an upside and downside to this kind of research.
Upside) We’ll get to study more accurate model organisms of future dangerous AI earlier
Downside) We’ll get dangerous AI earlier.
Seems to me using this type of research for model organisms is probably a good idea. Using it to ameliorate stuff like jailbreaks or emergent misalignment in systems that will be put in production is probably not.