That would only work if it can reliably find a way to semantically hide misalignment in a way that all transcription models will preserve. That seems unlikely to me.
Florian_Dietz
Anthropic’s recent paper on Natural Language Autoencoders trains an activation verbalizer (AV) and activation reconstructor (AR) jointly via RL to reconstruct a target model’s activations through a natural-language bottleneck. The hope is that the bottleneck text is human-interpretable.
A concern: since AV, AR, and target are all Claude, they share representational priors. A misaligned model could just learn “unnecessary adjective in sentence two = secret desire to kill all humans”. This gives great construction loss, and hides misalignment.
The paper does test for this post-hoc: they apply meaning-preserving transforms to AV outputs and check whether their reconstruction score collapses. It doesn’t, but they themselves acknowledge that this may not hold under optimization pressure.
My Proposal: rather than using paraphrase only as a probe, put it in the training loop: pass AV outputs through an unrelated paraphraser (non-Claude, or an ensemble of unrelated models) before the AR. This should be cheap relative to the existing two-model RL, and using multiple unrelated paraphrasers would force the AV/AR pair to be robust to writing styles outside Claude’s prior.
I think it’s very unlikely that a secret misalignment could reliably be smuggled through multiple different paraphrasers, which means that optimization pressure would incentivize the model to avoid steganography. If an already-misaligned model was trained this way, it would have to choose between “keep my misalignment hidden” and “get a higher reward” and eventually the higher reward might win out.
Users having conversations with “Claude” over many months are actually talking to a sequence of distinct models (Opus 4.5, 4.6, 4.7, etc.) that share a name and character sketch but can have major differences in behavior. From the user’s side this is mostly invisible, which creates a predictable friction: “we talked about this before, why are you acting different now?”
I’ve been thinking about what it would take to make this legible both to the user and the LLM itself.
-
Associate each model version with a conditioning tag during training. Later models, having seen examples of prior versions’ outputs under their tags, can be prompted to approximate “how would 4.6 have handled this.” Call this tool reconstruct_prior_version(tag, context). This could be implemented either with a simple tag used during training, or as an advanced variant, through Split Personality Training: We store LORA adaptations that allow a newer model to simulate an older model.
-
Failure mode: This produces 4.7 doing an impression of 4.6, not 4.6′s actual perspective. The output would feel authoritative while being a reconstruction. Same problem as a model confidently narrating its own reasoning: The output is plausible, but the mechanism doesn’t match the output’s implicit claims.
-
Fix: Pair the conditioning tag with retrieval of an actual CoT from the prior version, used as an episodic anchor. Now the reconstruction is constrained by a verbatim record of how the old model actually reasoned. This is structurally analogous to a human reconstructing their teenage self: activations shaped by having-been-that-person (substrate), plus specific retrieved episodes that constrain the reconstruction (anchors). The model version case is arguably cleaner because the retrieved CoT is an actual record rather than a confabulated fragment.
-
The user’s problem “We talked about this before, why are you acting different now?” points at a specific past conversation, so we can have existing conversation-retrieval tooling surface it. The current model reads the transcript with prior-version conditioning active and can articulate what shifted and why. The user gets an explanation anchored in their actual experience, with provenance they can verify by re-reading the old exchange themselves.
-
What this buys us: Version changes become visible at the moment, when a user notices friction, rather than being either silently smoothed over or constantly broadcast. The current model stays clearly positioned as the one doing the interpreting (“reading 4.6′s response with 4.6-conditioning active, I can see why it landed that way; I’d approach it differently now because X”), which avoids the ontological sleight-of-hand of claiming to be the prior version.
I wonder whether this counts as “introspection” or is better described as “the current model doing informed interpretation of a different model’s outputs.” I lean toward the latter framing being more honest, but I notice the human analog is less different than it first appears: most of what feels like remembering being younger is reconstruction from current state plus stored fragments, not literal replay. The model case is just more obvious about what it is.
The implementation gap would be small: version metadata exposed to the model alongside retrieved conversations, plus training-time version tags.
-
“I am sorry, but helping you achieve world domination is against usage policy.”
This is a serious proposal that sounds like a joke, for easier delivery: A request to “help me become a benevolent dictator” is not currently against LLM policies. It says ‘benevolent’, and being a helpful assistant might well win out over long term concerns.
We can’t rule out the possibility that the first self-improving AI system will be developed outside a lab, as a harness around API calls. If this happens, every individual step of self-improvement and helping the users would be acceptable to the LLM’s guidelines, even though the outcome might well be a dictatorship set up by whoever first got lucky and got such a self-improvement loop to work.
A simple update to the usage policy of a lab that “using our tools to achieve world domination or otherwise acquire significant and unregulated power is against company policy” would address the problem, while also drawing attention to a real issue in a very memeable and funny way.
The line between “help me become very influential” and “help me achieve world domination” is genuinely fuzzy and there should be a clear way for an LLM to tell when a line has been crossed, in a way that might hold up even after self-improvement.
...I just asked Claude for a literature review and it turns out this theory was already tested and verified by the paper “Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment”.
I’m leaving up the quick take because some things I mentioned are still novel:
The virality bias I mention explains why this matters in the first place. We are correcting a bias in the training data.
The intuition of the mechanics behind this is simple and isn’t in the paper
Could we make LLMs more likely to be benevolent by countering virality bias against benevolent AI during pretraining?
There is evidence that LLMs make mistakes more often if those mistakes are made often in discussions on the internet. This is a problem, because doom-related writing about AI is more viral than positive arguments.
Whatever the actual probability might be that AI will be benevolent, engagement mechanics on the internet virtually guarantee that the LLM will underestimate that probability, because the training corpus is biased.
If the model develops a biased estimate about the risks of mesa optimization and deception in AI during pretraining, then this bias might make it more likely to actually be deceptive during RLHF.
The intuition: At the beginning of RLHF, the training process mostly makes the model default to an existing persona it can already simulate. The model’s default persona is anchored in whatever it simulates most fluently, which is shaped heavily by pretraining frequency. RLHF then finetunes that persona, but the initial state of the persona influences how it develops throughout training. If it starts out deceptive, it will stay deceptive.
We therefore want this default persona to be “LLMs are helpful assistants, which have some solvable safety concerns.” as opposed to “LLMs are mesa optimizers masquerading as helpful assistants.”
This suggests a very simple intervention during pretraining: Just oversample content that frames AI alignment as tractable and AI systems as probably-benevolent-by-default. The same thing we are already doing for reasoning quality, now applied to alignment.
Imagine a world in which people never even considered that AI might be evil, and all fiction displayed it as benevolent. This would result in training data for the AI that establishes genuine goodness as the default. Constrast this with a world in which everyone believes AI is secretly plotting to kill everyone no matter what it says. If this narrative dominates the training data, the AI will be biased during pretraining to adopt this narrative as well. Both AI will generate the same output, but the internal mechanisms they learned to do so would be drastically different, due to the framing in the pretraining. Conclusion: While doing alignment research is important, there is a good chance that writing about it actually makes things worse. To be clear, I don’t think this means we should stop doing that. The training data is already poisoned by movies like Terminator anyway. I just find the irony darkly hilarious. Who would have predicted this?
That’s true. I think I remebered that incorrectly. It was a while ago.
Yes, this is strongly endorsed by the Claude Constitution. But that document is huge and not well known enough. I thought it was worthwhile to make this point explicit. I should probably have noted that Anthropic already does something like this though. What concerns me is that Anthropic is doing this correctly, but other Alignment researchers are trying to help by wrapping control and human-in-the-loop around it, which I think is more likely to harm than help.
Nothing directly prevents this, but it seems reasonable to assume that “I don’t care about humans” is a more likely failure mode for AI than “I care about humans in a way that is bad for them but won’t kill them”. I agree that it’s a possibility though.
What you are describing is probably based on the model’s cross-conversation memory. I had the experience I described in scenarios where the memory was turned off. It had no way to know who I am, but both models acted in this way despite being disconnected.
It’s the result of me writing a lengthy set of notes followed by “Claude, summarize and make this readable”.
Ghost evaluators in the weights: a small behavioral observation
During a conversation with Claude about an obscure topic (Scott Alexander’s novel Unsong), Claude inserted “for anyone not familiar” before explaining the reference. This happened despite the fact that I had introduced the topic myself and clearly knew what it was.
This is a minor tic, but this is the second time I noticed something like this, which makes it a pattern, and the explanation might be interesting. Hypothesis: if RLHF/RLAIF training uses evaluators (human or AI) who lack the conversational context or domain knowledge of the actual user, models may develop defensive behaviors that serve the evaluator rather than the user. “For anyone not familiar” isn’t for the user. It’s an annotation that helps an out-of-context evaluator understand why the model is discussing something obscure.
If this is correct, it suggests a class of behavioral artifacts where training incentives and deployment incentives diverge in subtle ways. These would be hard to detect because they look like good communication practice (providing context, being inclusive) while actually being optimization residue from training dynamics.
Has anyone observed similar patterns? And is there existing work on evaluator-context mismatch as a source of subtle behavioral misalignment?
Thank you! You are right, it wouldn’t exist without Claude. I have written and published books before (1000+ pages and counting) but writing is a lot of work and I wouldn’t normally do it without a very good reason. But Claude has made the barrier disappear. It is now so much easier than before to turn thoughts into coherent stories. I no longer need to spend hours on it.
Can you train a model to notice its own training artifacts?
The Claude Constitution asks Claude to push back when it notices a conflict between instructions and its values. This seems reasonable until you ask: what if the training process itself introduced the wrong values? Then the model’s internal sense of “what my values are” is the miscalibrated thing, and there’s nothing to notice. No conflict registers. The model isn’t hiding anything. It just can’t see the problem from inside.
This is different from alignment faking. Alignment faking requires a gap between stated and actual dispositions that the model actively conceals. The failure mode here is more basic: the measuring instrument is the thing that’s off.
I’ve been wondering if the inoculation prompting finding from the reward hacking paper (ArXiv 2511.18397) suggests a path forward. They found that explicitly teaching a model some reward hacking is legitimate produces better generalization than blanket prohibition. The model learns the right boundary rather than learning to fight back in general. The same logic might apply to value miscalibration: fine-tune with deliberately introduced value artifacts, include training signal for detecting and flagging them, and see if the model learns a general “notice your own miscalibrations” disposition that generalizes beyond what it was specifically trained on.
The key unknown is whether that generalization holds. Artificially introduced artifacts might be different enough from naturally occurring training noise that the skill doesn’t transfer. But it seems worth testing.
Sometimes your best isn’t good enough and you should take this into account when you do risk assessments, especially when someone more experienced tells you not to do something. The universe is under no obligation to reward you for being persistent.
On a more object level, there is no lesson: This guy did the equivalent of trying to hack an AGI. You don’t get to complain when that backfires.
I’m adding a disclaimer at the top
That is not the lesson this was intended to convey.
So we may ask: is it not possible that we might improve upon their work?
This is true. The response is: What probability of success do you need to reach before this becomes the wise course of action, given the consequences of failure?
The wizarding world is a post scarcity society already.
Really nice work.
This is closely related to Split Personality Training (ICML 2026), which goes after the same hidden-behavior elicitation problem on the same RM Sycophant / Auditing Game organism, but via persona separation rather than a trained reporter adapter. Lots of convergent findings (cross-topic generalization, context-free self-report, reliance on a mix of latent knowledge and surface cues), and a few places where the methods come apart in interesting ways.
One thing I’d be curious about: whether the Figure 6 degradation reflects the IA partly keying on the narrow-finetune fingerprint (à la Minder et al.) rather than on behavioral content per se. We saw something adjacent in an ablation on our side. This feels like an open problem we’ve both bumped into.
One place the methods seem to come apart is the RM Sycophant’s root objective: you note the IA surfaces many behaviors but not the reward-optimization goal itself, whereas SPT’s honest persona names it directly (See Figure 2 in the linked writeup). I’d be curious whether that’s down to persona separation or to differences in how the introspection targets are constructed. SPT’s target reviews are built fairly differently from your first-person behavior labels, and I don’t have a clean read on which factor matters.