Ghost evaluators in the weights: a small behavioral observation
During a conversation with Claude about an obscure topic (Scott Alexander’s novel Unsong), Claude inserted “for anyone not familiar” before explaining the reference. This happened despite the fact that I had introduced the topic myself and clearly knew what it was.
This is a minor tic, but this is the second time I noticed something like this, which makes it a pattern, and the explanation might be interesting. Hypothesis: if RLHF/RLAIF training uses evaluators (human or AI) who lack the conversational context or domain knowledge of the actual user, models may develop defensive behaviors that serve the evaluator rather than the user. “For anyone not familiar” isn’t for the user. It’s an annotation that helps an out-of-context evaluator understand why the model is discussing something obscure.
If this is correct, it suggests a class of behavioral artifacts where training incentives and deployment incentives diverge in subtle ways. These would be hard to detect because they look like good communication practice (providing context, being inclusive) while actually being optimization residue from training dynamics.
Has anyone observed similar patterns? And is there existing work on evaluator-context mismatch as a source of subtle behavioral misalignment?
Please don’t copy-paste unmarked AI output into LessWrong posts or comments or quick takes! This quick take is totally fine content-wise, but there are lots of issues if people just copy-paste output from language models. We’ll very soon ship a simple way to annotate if a piece of content is AI-generated and then you can use that. Until then, just say somewhere in your comment that it was AI written.
I have seen something similar in my independent research wherein even when I start a new conversational thread it always assumes I am running safety experiment and tries to answer in that context
What you are describing is probably based on the model’s cross-conversation memory. I had the experience I described in scenarios where the memory was turned off. It had no way to know who I am, but both models acted in this way despite being disconnected.
Ghost evaluators in the weights: a small behavioral observation
During a conversation with Claude about an obscure topic (Scott Alexander’s novel Unsong), Claude inserted “for anyone not familiar” before explaining the reference. This happened despite the fact that I had introduced the topic myself and clearly knew what it was.
This is a minor tic, but this is the second time I noticed something like this, which makes it a pattern, and the explanation might be interesting. Hypothesis: if RLHF/RLAIF training uses evaluators (human or AI) who lack the conversational context or domain knowledge of the actual user, models may develop defensive behaviors that serve the evaluator rather than the user. “For anyone not familiar” isn’t for the user. It’s an annotation that helps an out-of-context evaluator understand why the model is discussing something obscure.
If this is correct, it suggests a class of behavioral artifacts where training incentives and deployment incentives diverge in subtle ways. These would be hard to detect because they look like good communication practice (providing context, being inclusive) while actually being optimization residue from training dynamics.
Has anyone observed similar patterns? And is there existing work on evaluator-context mismatch as a source of subtle behavioral misalignment?
Please don’t copy-paste unmarked AI output into LessWrong posts or comments or quick takes! This quick take is totally fine content-wise, but there are lots of issues if people just copy-paste output from language models. We’ll very soon ship a simple way to annotate if a piece of content is AI-generated and then you can use that. Until then, just say somewhere in your comment that it was AI written.
I have seen something similar in my independent research wherein even when I start a new conversational thread it always assumes I am running safety experiment and tries to answer in that context
What you are describing is probably based on the model’s cross-conversation memory. I had the experience I described in scenarios where the memory was turned off. It had no way to know who I am, but both models acted in this way despite being disconnected.
(is this post written with/by AI?)
It’s the result of me writing a lengthy set of notes followed by “Claude, summarize and make this readable”.
https://app.gptzero.me/ gives it P(AI)=100%.