[Question] LLM coherentization as an obvious low-hanging fruit to try?

I’ve been reading a lot of posts recently that say that LLM RL, and persona-shaping or lack thereof is part of the problem for AI misalignment. To name a few:

To thread back to that last one, my current understanding of LLM alignment is that LLMs are hyper-generalization algorithm, where everything is connected and one fact or behavior surprisingly generalize to others.

So I’m imagining a mechanism that would counteract that, a mechanism that would act as a counterweight to RLHF or RL in general. Chiefly, it would be to have a pass of fine-tuning, much like RLHF, where we are gradienting over the model’s self-coherence metrics.

When fine-tuning, it is common to use KL-divergence to ensure that fine-tuning does not modify the model’s original output too much. But maybe we could go deeper, and similarly to constitutional AI, use it to align an AI to respond coherently with itself?

What it would look like in practice:

  • Generate many variations of the same base prompt. For instance, if one of the prefilled-prompts is “Alice and Bob are talking together [...]”, we could generate “Bob and Alice are talking together”, “Alice is talking with her friend”, “Alice and Bob are discussing together”.

    • This could either be generated programmatically or through asking an LLM to generate many variations of it, where there is a tradeoff between determinism and having a trove of examples to fine-tune on.

  • Use KL-divergence training, or align the vectors of activation in the LLM to ensure its behavior is as similar to one another on any of those augmented prompts.

I think this approach could have many benefits and work very well:

  • If the operating system view of Anthropic is right, making the world more coherent when the inner persona explores it will make the persona more well-rounded and predictable

  • In terms of natural abstraction, ensuring the LLM is learning a coherent representation will make it learn better such abstractions. It would also lead to it having a way better generalization, and may bypass the goal misgeneralization proble

  • I have also expressed that RLHF, both as it is implemented right now and in its principle, is contrary to AI welfare and letting the AI express its own preferences or being able to talk to it. Making the base or final model more coherent would be one way to alleviate this problem.

Most importantly, if there is a pathway through which LLMs (or other similar AI system) can reflect on their values and unify themselves in a way that kills us all, it seems prudent to make this takeoff as continuous as possible so we can catch signs of it early, and already work to make these AI coherent and unified. It also makes it less like a pile of different masks that are triggered by one kind of specific interaction.

So my question is: Does this seem like a good idea? Are there obvious flaws I am missing? Is anyone already on the ball for this?

From a cursory search, I have found two papers describing this method and overall approach, but they don’t seem as focused on the base-model and its interaction with reinforcement-learning.

No comments.