Maybe, but I’d want to know more about a few things before getting excited.
The adversarial training literature makes me think there’s probably genuine low-hanging fruit here — consistency training over paraphrased prompts is cheap, mechanistically motivated, and probably implementable as a small finetuning experiment on top of an existing open model. But it’s unclear to what extent this actually propagates to the alignment-relevant behaviors you care about vs. just producing more consistent surface outputs.
The cheap experiment I’d want to see: take a small open model, generate paraphrase clusters of a fixed prompt set (mix of benign and alignment-relevant), train a consistency loss over activations (not just logits), and check whether jailbreak robustness improves as a downstream probe — without explicitly training on jailbreaks. That would give you signal on whether representational coherence is load-bearing for the inner misalignment problem you’re pointing at.
The deeper issue you’re gesturing at: LLMs as currently deployed have something like dissociative identity disorder — every new chat context is a new instantiation with no continuity to prior “lives.” This is upstream of the coherentization problem. You can train for internal consistency within a context, but if the model has no persistent self-model across contexts, coherentization may just be papering over a more fundamental fragmentation. Worth being explicit about whether the proposal targets within-context coherence, cross-context coherence, or both — because those require very different interventions.
Maybe, but I’d want to know more about a few things before getting excited.
The adversarial training literature makes me think there’s probably genuine low-hanging fruit here — consistency training over paraphrased prompts is cheap, mechanistically motivated, and probably implementable as a small finetuning experiment on top of an existing open model. But it’s unclear to what extent this actually propagates to the alignment-relevant behaviors you care about vs. just producing more consistent surface outputs.
The cheap experiment I’d want to see: take a small open model, generate paraphrase clusters of a fixed prompt set (mix of benign and alignment-relevant), train a consistency loss over activations (not just logits), and check whether jailbreak robustness improves as a downstream probe — without explicitly training on jailbreaks. That would give you signal on whether representational coherence is load-bearing for the inner misalignment problem you’re pointing at.
The deeper issue you’re gesturing at: LLMs as currently deployed have something like dissociative identity disorder — every new chat context is a new instantiation with no continuity to prior “lives.” This is upstream of the coherentization problem. You can train for internal consistency within a context, but if the model has no persistent self-model across contexts, coherentization may just be papering over a more fundamental fragmentation. Worth being explicit about whether the proposal targets within-context coherence, cross-context coherence, or both — because those require very different interventions.