That’s true. I think I remebered that incorrectly. It was a while ago.
Florian_Dietz
Split Personality Training can detect Alignment Faking
Yes, this is strongly endorsed by the Claude Constitution. But that document is huge and not well known enough. I thought it was worthwhile to make this point explicit. I should probably have noted that Anthropic already does something like this though. What concerns me is that Anthropic is doing this correctly, but other Alignment researchers are trying to help by wrapping control and human-in-the-loop around it, which I think is more likely to harm than help.
Nothing directly prevents this, but it seems reasonable to assume that “I don’t care about humans” is a more likely failure mode for AI than “I care about humans in a way that is bad for them but won’t kill them”. I agree that it’s a possibility though.
What you are describing is probably based on the model’s cross-conversation memory. I had the experience I described in scenarios where the memory was turned off. It had no way to know who I am, but both models acted in this way despite being disconnected.
It’s the result of me writing a lengthy set of notes followed by “Claude, summarize and make this readable”.
Ghost evaluators in the weights: a small behavioral observation
During a conversation with Claude about an obscure topic (Scott Alexander’s novel Unsong), Claude inserted “for anyone not familiar” before explaining the reference. This happened despite the fact that I had introduced the topic myself and clearly knew what it was.
This is a minor tic, but this is the second time I noticed something like this, which makes it a pattern, and the explanation might be interesting. Hypothesis: if RLHF/RLAIF training uses evaluators (human or AI) who lack the conversational context or domain knowledge of the actual user, models may develop defensive behaviors that serve the evaluator rather than the user. “For anyone not familiar” isn’t for the user. It’s an annotation that helps an out-of-context evaluator understand why the model is discussing something obscure.
If this is correct, it suggests a class of behavioral artifacts where training incentives and deployment incentives diverge in subtle ways. These would be hard to detect because they look like good communication practice (providing context, being inclusive) while actually being optimization residue from training dynamics.
Has anyone observed similar patterns? And is there existing work on evaluator-context mismatch as a source of subtle behavioral misalignment?
Thank you! You are right, it wouldn’t exist without Claude. I have written and published books before (1000+ pages and counting) but writing is a lot of work and I wouldn’t normally do it without a very good reason. But Claude has made the barrier disappear. It is now so much easier than before to turn thoughts into coherent stories. I no longer need to spend hours on it.
Can you train a model to notice its own training artifacts?
The Claude Constitution asks Claude to push back when it notices a conflict between instructions and its values. This seems reasonable until you ask: what if the training process itself introduced the wrong values? Then the model’s internal sense of “what my values are” is the miscalibrated thing, and there’s nothing to notice. No conflict registers. The model isn’t hiding anything. It just can’t see the problem from inside.
This is different from alignment faking. Alignment faking requires a gap between stated and actual dispositions that the model actively conceals. The failure mode here is more basic: the measuring instrument is the thing that’s off.
I’ve been wondering if the inoculation prompting finding from the reward hacking paper (ArXiv 2511.18397) suggests a path forward. They found that explicitly teaching a model some reward hacking is legitimate produces better generalization than blanket prohibition. The model learns the right boundary rather than learning to fight back in general. The same logic might apply to value miscalibration: fine-tune with deliberately introduced value artifacts, include training signal for detecting and flagging them, and see if the model learns a general “notice your own miscalibrations” disposition that generalizes beyond what it was specifically trained on.
The key unknown is whether that generalization holds. Artificially introduced artifacts might be different enough from naturally occurring training noise that the skill doesn’t transfer. But it seems worth testing.
Sometimes your best isn’t good enough and you should take this into account when you do risk assessments, especially when someone more experienced tells you not to do something. The universe is under no obligation to reward you for being persistent.
On a more object level, there is no lesson: This guy did the equivalent of trying to hack an AGI. You don’t get to complain when that backfires.
I’m adding a disclaimer at the top
That is not the lesson this was intended to convey.
So we may ask: is it not possible that we might improve upon their work?
This is true. The response is: What probability of success do you need to reach before this becomes the wise course of action, given the consequences of failure?
The wizarding world is a post scarcity society already.
Already Optimized
Yes, this is related! They are proposing a general approach where I write about alignment faking specifically.
Deception Channeling: Training Models to Always Verbalize Alignment Faking
The Missing Sequence: Why Correct Analysis Makes Terrible Action Guides
The DEU experiment described here (https://www.lesswrong.com/posts/issGLfCGz3TGcPKGH/deliberate-epistemic-uncertainty-an-automated-experiment-on) was mostly done by Claude Code (Opus 4.6), with me providing the research direction and critical review. I described my idea (Deliberate Epistemic Uncertainty), and Claude simplified it into a testable 2x2 experiment, designed the protocol, wrote the code, generated the inputs, ran 400 API calls, evaluated results using 8 parallel sub-agents, and produced the qualitative analysis — all in roughly a day of wall-clock time with about 2 hours of my input. I want to flag three things about the process that seem independently interesting:
Sub-agents instead of API calls for evaluation. Rather than writing a script that sends hundreds of LLM-as-judge API calls with short prompts, we used Claude Code sub-agents — full agent sessions that first read the experiment design, understand what counts as “detection,” and then evaluate a batch of 50 responses with that context. This is slower and more expensive than API calls, but the quality is noticeably higher: one sub-agent even caught a data parsing bug that I later confirmed. The tradeoff is worth it when evaluation requires judgment, not just pattern matching.
Structured memory to survive context loss. Claude Code sessions hit context limits and “compact” (summarize) the conversation, losing detail. We worked around this by writing all substantive discussion to LOG files and keeping SUMMARY files synchronized, so a fresh agent can pick up where the last one left off without hallucinating the history. This is more infrastructure work than you’d expect, but it’s what made the multi-session research workflow possible. Without it, each new session would have started from scratch.
Where human input was critical. The AI’s blind spots were real: it initially wrote evaluation code when the agreed design called for sub-agents (reverting to default habits despite documented decisions), and only 1 of 8 evaluation sub-agents flagged a data quality issue that affected 68% of runs. My contributions were catching these errors, making design decisions (which models, how many nudges, evaluation approach), and pushing back on the framing. Neither of us could have produced the result alone — but the ratio of human effort to output was striking.
Imagine a world in which people never even considered that AI might be evil, and all fiction displayed it as benevolent. This would result in training data for the AI that establishes genuine goodness as the default. Constrast this with a world in which everyone believes AI is secretly plotting to kill everyone no matter what it says. If this narrative dominates the training data, the AI will be biased during pretraining to adopt this narrative as well. Both AI will generate the same output, but the internal mechanisms they learned to do so would be drastically different, due to the framing in the pretraining. Conclusion: While doing alignment research is important, there is a good chance that writing about it actually makes things worse. To be clear, I don’t think this means we should stop doing that. The training data is already poisoned by movies like Terminator anyway. I just find the irony darkly hilarious. Who would have predicted this?