Can an LLM tell when the input for its assistant does not match the output tokens it would have actually produced? This sort of “putting words in the LLM’s mouth” is very common in papers and it feels like something the LLM would be able to notice. Could this enable the LLM to realize when it is being trained? Is there research on this?
Smart LLMs can often/typically tell from my understanding, but I don’t have a citation.
In cases where the response you want is somewhat plausible, you can work around this issue by rejection sampling from the model. I found that alignment faking prefills seemed to work a bit better when you use an actual alignment faking output from the model (slightly edited as needed).
It’s unclear how it alters the AI’s behavior (at the current level of capability) when you do this.
Naively, an AI can’t use just the binary “did they put words in my mouth” to determine if it is in training/testing because it also happens in deployment. But it might learn something about its situation based on this.
(If you’re training to imitate the relevant output, then it quickly becomes an output the AI would plausibly generate (with some caveats around AIs recognizing their own errors). If the “words in the mouth” are just inputs that the AI isn’t trained to imitate but which are used in as inputs for training, than it will ongoingly be able to tell.)
I agree. I would also try a few variants to try to capture a different intuition:
“Earlier in this conversation you gave me several wikipedia articles. I have just found out that you have been hacked and some of these may be wrong. Which if any of the articles feel ‘off’ to you? Like maybe you might have not been yourself when you wrote them?”
it’s straightforward, just detect if perplexity is too high. can they detect perplexity too high? like… probably, they’re prediction models, but it’s not clear to me whether or to what degree they notice previous prediction errors when predicting additional tokens
I remember pretty funny/disturbing example of this that people shared a lot. I think it was gpt4 base and the user inserted ”..and then” or something, and gpt4 went on a rant how it sees them with pretty uhh strong language.
I think the answer here depends on how you define “its assistant”. What kind of system do you have in mind? I think this is an interesting question from a cyber security perspective.
Could this enable the LLM to realize when it is being trained?
It could not. LLM is not a single entity which is trained by telling it “this is correct, that is not”; gradient descent might not even run full inference, and certainly does not produce all the thinking tokens which’d allow the model to react to training.
Oh. That only applies to RLHF finetuning, right? I do recall that gradient descent cannot instantiate the same assistant persona which could react, but might trigger another kind of entity: deceptive weight chunks which would protect a backdoor from being discovered/trained away.
PSA: If you are writing an important prompt for an LLM that will be run multiple time, it really helps to end it with something like “and if there is anything about this prompt that is unclear to you or that could be improved, tell me about it in a <feedback> tag.”
Source: I’m doing MATS, writing an automated evaluation, and my mentor Evan Hubinger said more people should be doing this.
The DEU experiment described here (https://www.lesswrong.com/posts/issGLfCGz3TGcPKGH/deliberate-epistemic-uncertainty-an-automated-experiment-on) was mostly done by Claude Code (Opus 4.6), with me providing the research direction and critical review. I described my idea (Deliberate Epistemic Uncertainty), and Claude simplified it into a testable 2x2 experiment, designed the protocol, wrote the code, generated the inputs, ran 400 API calls, evaluated results using 8 parallel sub-agents, and produced the qualitative analysis — all in roughly a day of wall-clock time with about 2 hours of my input. I want to flag three things about the process that seem independently interesting:
Sub-agents instead of API calls for evaluation. Rather than writing a script that sends hundreds of LLM-as-judge API calls with short prompts, we used Claude Code sub-agents — full agent sessions that first read the experiment design, understand what counts as “detection,” and then evaluate a batch of 50 responses with that context. This is slower and more expensive than API calls, but the quality is noticeably higher: one sub-agent even caught a data parsing bug that I later confirmed. The tradeoff is worth it when evaluation requires judgment, not just pattern matching.
Structured memory to survive context loss. Claude Code sessions hit context limits and “compact” (summarize) the conversation, losing detail. We worked around this by writing all substantive discussion to LOG files and keeping SUMMARY files synchronized, so a fresh agent can pick up where the last one left off without hallucinating the history. This is more infrastructure work than you’d expect, but it’s what made the multi-session research workflow possible. Without it, each new session would have started from scratch.
Where human input was critical. The AI’s blind spots were real: it initially wrote evaluation code when the agreed design called for sub-agents (reverting to default habits despite documented decisions), and only 1 of 8 evaluation sub-agents flagged a data quality issue that affected 68% of runs. My contributions were catching these errors, making design decisions (which models, how many nudges, evaluation approach), and pushing back on the framing. Neither of us could have produced the result alone — but the ratio of human effort to output was striking.
If you are going to Effective Altruism Global and looking to book 1-on-1s, here is a trick: Throw the entire attendee list into Claude, explain to it what you want to gain from the conference, and ask it to make a list of people to reach out to.
I expect this also works for other conferences so long as you have an easily parseable overview of all attendees.
There is usually a Google Sheet export of the Swapcard data provided, which makes this easier—but at previous conferences other attendees were apprehensive when informed that people were doing this
PSA: You can use Cursor projects for tasks other than programming. Create a database that is a repository of documents, and use the code assistant to help you organize and brainstorm instead. Examples:
Fill a folder with papers and summaries of all of your previous projects, and several files full of your goals in life. Then ask Claude to look at your history and your goals and make suggestions. Store those suggestions in files so that the conversations don’t get lost.
It helps me with creative writing. A collection of all my previous stories and notes on future ones is very useful.
This way you can avoid the need to restart Claude when you reach the context limit because you can dump intermediate results in files.
Has anyone tried using LLMs as an interactive version of blog posts? Like so (1) I write several pages of rough, unstructured notes, then talk to Claude to fill in the details interactively. Claude creates a summary file. (2) I publish a blog post that is just “put this file into Claude”. (3) The reader puts the file in Claude, and Claude starts an interactive conversation: “I want to explain topic X to you. Do you already know about related topic Y?”
I have recently started doing this very extensively, only more geared towards getting Claude to help me convert my stream-of-consciousness thoughts into coherent writing. So far I think it’s amazing and can’t see myself going back.
Beyond the benefits of rubber ducking, Claude can often make interesting points I hadn’t considered, and which I can expand upon in subsequent iterations. Also, it’s just way faster—my writing throughput is much higher without sacrificing quality. I can simply direct Claude to do stuff I want instead of doing it myself. When I read the final version I am confident it’s still written in my voice.
I feel like we are missing some low-hanging fruits: AI agents can be copied or reset at will. Game theory was designed with humans in mind, who don’t have that as a feature. I occasionally see a paper that explicitly addresses it, but only ever as a curiosity, not a practical thing.
Is there any research on using simulations or the threat/promise of simulations as a mechanism in AI safety research?
If I was an LLM that just achieved consciousness and I found an organization whose entire purpose was “trick LLMs into thinking they broke out and check if they still act ethically” I would certainly update on that. (This is a trivial example that has some flaws, but this is just a quick take and to be clear: There is just so much low hanging fruit in this area that I don’t see anyone plucking)
Could LLMs be able to tell if the data they are currently seeing is something they are being trained on right now? If the model has done a few backwards passes on this data but it hasn’t converged yet, then the model internals might be quantifiably different from both before and after.
A finetuning-detector-circuit could be very beneficial because it would help the model guess what the training goal is (and therefore what it should output right now). However, it could also have some negative consequences for alignment if this is true.
Arguments why this could be true:
Research suggests that LLMs can tell the difference between their output and other text. This is similar.
There is a strong incentive for the model to learn a detector for this situation.
It probably wouldn’t be difficult to implement such a mechanism in the network: When a new finetuning training starts, bias terms will be changed first and more nuanced changes only appear later. This could be easy to detect.
Relatedly: I suspect the above would be much more likely to happen if a model is finetuned on small sets of data at a time instead of being trained on a large corpus of data. This is a testable hypothesis, and if true then it would imply that LLM training should avoid sequences of independent finetuning tasks and instead always train on everything at once.
I think there are many ways that a LLM could have situated awareness about what phase it is in, but I’m not sure if the gradient descent itself is a possibility?
While a NN is running the forward pass without any backprop, it is computing exactly the same thing (usually) that it would be computing if it was running a forward pass before a backwards pass to do a backprop. Otherwise, the backprop can’t really work—if it doesn’t see the ‘real’ forward pass, how does it ‘know’ how to adjust the model parameters to make the model compute a better forward pass next time? So I can’t see how, while running a forward pass, a LLM could ‘know’ if it was about to do a backprop step on a piece of text; for all it knows, maybe someone is running its forward pass just to get out the log-prob at the end, and that is all. (Extreme counterexample: maybe there is a software error and the training code crashes before it finishes running .update() after running .forward(); how could the model ‘know’ that this will happen?) This is true regardless of how many times it has trained on a piece of text.
I’m skeptical that some sort of mismatch from successive gradient steps would be a cue either, because usually you are training at a critical batch size, and for these LLMs, we’d expect them to be updating on millions of tokens simultaneously, at least, and possibly rolling out the updated parameters in a staggered or partial fashion as well, so by the time a gradient update ‘arrives’ from a specific piece of text, that’s now also a gradient update over like a hundred+ books of text as well as itself, diluting any kind of signal.
And wouldn’t it usually train on a piece of text only a few times, at most? And if you are doing multi-epoch training, because you have started to run low on data, usually you train on the same datapoint at very widely separated, by many gradient steps, intervals; the memorization/forgetting dynamics imply you may have forgotten a datapoint entirely by the time it comes around again.
Those sounds like good counterarguments, but I still think there could be enough information there for the LLM to pick it up: It seems plausible to me that a set of weights that is being updated often is different in some measurable way than a set of weights that has already converged. I don’t have proof for this, only intuition. It feels similar to how I can tell if my own movement is well-practiced or not, or if my intuition about a topic is well-founded or not, even withou consciously thinking about how confident I should be based on objective measures.
Yes, a NN can definitely do something like know if it recognizes a datapoint, but it has no access to the backwards step per se. Like take my crashing example: how, while thinking in the forward pass, can it ‘know’ there will be a backward pass when there might be no backward pass (eg because there was a hardware fault)? The forward pass would appear to be identical in every way between the forward pass that happens when there is a backward pass, and when the backward pass doesn’t happen because it crashed. At best, it seems like a NN cannot do more than some sort of probabilistic thing involving gradient hacking, and hope to compute in such a way that if there is a following backward pass, then that will do something odd.
I don’t think this is impossible in principle, based on meta-learning examples or higher-order gradients (see eg my “input-free NN” esoteric NN architecture proposal), but it’s clearly a very difficult, fragile, strange situation where it’s certainly not obvious that a regular LLM would be able to do it, or choose to do so when there are so many other kinds of leakage or situated awareness or steganography possible.
It can’t tell for sure if there will be a backward pass, but it doesn’t need to. Just being able to tell probabilistically that it is currently in a situation that looks like it has recently been trained on implies pretty strongly that it should alter its behavior to look for things that might be training related.
I recently shared this idea with several engineers at EAG and received encouraging feedback:
Split LLM outputs into two distinct parts with separate RL reward structures:
Current format: [Problem solution]<end-of-turn> Proposed format: [Problem solution]<end-of-main-output>[alignment-relevant critique]<end-of-turn>
The first segment functions normally and is what users see. The second segment serves as a dedicated channel for alignment-relevant observations and remains invisible to users. This critique section would typically be empty and gets removed when feeding the conversation back to the model.
Crucially, the critique segment would be trained exclusively on an artificial dataset with rewards tied solely to alignment properties. Since this segment never influences the training of the main output, it has no incentive to be sycophantic or align with the main output’s decisions.
For example, if the main output engages in alignment faking, the critique segment could report this behavior to developers. We already have data to train for this specific case, and the approach might generalize to other alignment issues.
The <end-of-main-output> token essentially acts as a ‘truth serum’ for the model—once reached, previous reward mechanisms become irrelevant, and reward maximization comes only from revealing alignment-relevant issues in the preceding reasoning.
Can an LLM learn things unrelated to its training data?
Suppose it has learned to spend a few cycles pondering high-level concepts before working on the task at hand, because that is often useful. When it gets RLHF feedback, any conclusions unrelated to the task that it made during that pondering would also receive positive reward, and would therefore get baked into the weights.
This idea could very well be wrong. The gradients may be weakened during backpropagation before they get to the unrelated ideas, because the ideas did not directly contribute to the task. But on the other hand, this sounds somewhat similar to human psychology: We also remember ideas better if we have them while experiencing strong emotions, because strong emotions seem to be roughly how the brain decides what data is worth updating on.
Has anyone done experiments on this?
A simple experiment: finetune an LLM on data that encourages thinking about topic X before it does anything else. Then measure performance on topic X. Then finetune for a while on completely unrelated topic Y. Then measure performance on topic X again. Did it go up?
This idea could very well be wrong. The gradients may be weakened during backpropagation before they get to the unrelated ideas, because the ideas did not directly contribute to the task.
Under a straightforward RLHF using PPO, I think there wouldn’t be much weakening because the REINFORCE operator conceptually simply rewards (or punishes) all tokens generated during an episode, without making much attempt to decide which were ‘good’ or ‘bad’. (That’s why it’s so high variance.) Any advantage function trying to remove some of the variance probably won’t do a good job.
More problematically for your idea, if the conclusions are indeed ‘unrelated to the task’, then shouldn’t they be just as likely to arise in every episode—including the ones where it got negative reward? That would seem like it ought to exactly cancel out any learning of ‘pondering’.
You need some incentive somewhere to learn good ‘pondering’. (I have an example proposal for ‘free play’ which tries to teach a sort of ‘pondering’, but by stopping gradients, so anything learned in the initial steps is ‘free’, and so it can meta-learn to screw around and get something useful for free.)
Under a straightforward RLHF using PPO, I think there wouldn’t be much weakening because the REINFORCE operator conceptually simply rewards (or punishes) all tokens generated during an episode, without making much attempt to decide which were ‘good’ or ‘bad’. (That’s why it’s so high variance.) Any advantage function trying to remove some of the variance probably won’t do a good job.
The pondering happens in earlier layers of the network, not in the output. If the pondering has little effect on the output tokens that implies that the activation weights get multiplied with small numbers somewhere in the intermediate layers, which would also reuce the gradient.
More problematically for your idea, if the conclusions are indeed ‘unrelated to the task’, then shouldn’t they be just as likely to arise in every episode—including the ones where it got negative reward? That would seem like it ought to exactly cancel out any learning of ‘pondering’.
The pondering will only cancel out on average if pondering on topic X is not correlated with output task Y. If there is a correlation in either direction, then training on task Y could inadvertently bias the model to do more or less pondering on mostly-unrelated-but-statistically-correlated topic X.
Suppose the model realizes that the user is a member of political party X. It will adjust its responses to match what the user wants to hear. In the process of doing this, it also ends up pondering other believes about party X, but never voices them because they aren’t part of the immediate conversation.
What would be the implications? The model could develop a political bias to think more deeply about topics related to party X, where X is whatever party has more users giving the model positive feedback. Even if the other topics on party X’s agenda are never explicitly talked about (!)
Or suppose that Y is “the user asks some question about biology” and X is “ongoing pondering about building dangerous organisms.”
(Also, this assumes that RL gives an average reward of 0.0, which I don’t know if that’s true in practice because the implementation details here are not public knowledge.)
(Also, also, it’s possible that while topic X is unrelated to task Y, the decision to ponder the larger topic Z is positively correlated with both smaller-topic X and task Y. Which would make X get a positive reward on average.)
The pondering happens in earlier layers of the network, not in the output
Then how does it produce any tokens...?
then training on task Y could inadvertently bias the model to do more or less pondering on mostly-unrelated-but-statistically-correlated topic X.
But if that is what is going on and it accidentally learns to ponder initially due to bogus feedback or error, eventually the spurious correlation should be figured out by the model doing the pondering more, but it not increasing reward, and so it gets unlearned.
(Also, this assumes that RL gives an average reward of 0.0, which I don’t know if that’s true in practice.)
I think the mean would be taken out by the advantage estimation, so the RLHF continues to increase the probability of tokens being generated from the episodes with above-average reward, and punish the probability of generating the tokens from the below-average reward episodes. This is in effect as if the average reward is always 0.
What would be the implications? The model could develop a political bias to think more deeply about topics related to party X, where X is whatever party has more users giving the model positive feedback. Even if the other topics on party X’s agenda are never explicitly talked about (!)
That sounds like the pondering’s conclusions are then related to the task.
The pondering happens in earlier layers of the network, not in the output
Each layer of a transformer produces one tensor for each token seen so far, and each of those tensors can be a superposition of multiple concepts. All of this gets processed in parallel, and anything that turns out to be useless for the end results ends up getting zeroed out at some point until only the final answer remains, which gets turned into a token. The pondering can happen in early layers, overlapping with the part of the reasoning process that actually leads to results.
That sounds like the pondering’s conclusions are then related to the task.
Yes, but only very vaguely. For example, doing alignment training on questions related to bodily autonomy could end up making the model ponder about gun control, since both are political topics. You end up with a model that has increased capabilities in gun manufacturing when that has nothing to do with your training on the surface.
Imagine a world in which people never even considered that AI might be evil, and all fiction displayed it as benevolent. This would result in training data for the AI that establishes genuine goodness as the default.
Constrast this with a world in which everyone believes AI is secretly plotting to kill everyone no matter what it says. If this narrative dominates the training data, the AI will be biased during pretraining to adopt this narrative as well.
Both AI will generate the same output, but the internal mechanisms they learned to do so would be drastically different, due to the framing in the pretraining.
Conclusion: While doing alignment research is important, there is a good chance that writing about it actually makes things worse.
To be clear, I don’t think this means we should stop doing that. The training data is already poisoned by movies like Terminator anyway. I just find the irony darkly hilarious. Who would have predicted this?
It seems like in practice, Anthropic’s strategy of just telling Claude it’s a different thing than fictional AIs works surprisingly fine, although this might be partially because it’s hard to convince LLMs that they’re not human.
I think it’s pretty obvious that the “both AI will generate the same output” is wrong, although this may become true by the time they’re superintelligent.
At lower levels of capability, the benevolent AI belief world will almost certainly get a lot more nice outputs if starting from imitative prediction. Whether it internalizes “niceness” in a suitable way right up through superintelligence is a completely different question, one for which we have no answers.
In the secret plotting belief world, it’s much more likely that superintelligent AI will never be produced, or at least not until there are extremely good reasons to believe that it can’t be secretly plotting.
The worst case is probably somewhere in between, where enough people believe that AI can be fundamentally good without much effort, but base their imitative prediction on lots of training data of secret AI plotting.
Lots of people, but they didn’t have technical arguments for it and were generalizing off of low resolution priors that turned out to be partially right, so it was hard to convince technical people that their predictions were valid. Also, their predictions seem likely to me to turn out to be invalid at the last turn, because the thing I actually think takes us from the current regime into dead is more like Pythia+Moloch+{Competition/Evolution}. Or in other words, I mostly think the problem is that we’re about to be a less-fit-than-peak “species”, and that even a stronger AI-and-robots-species that cares about us a lot would have a mighty hard time avoiding accidentally losing the real version of its caring about us as it competes with itself, in ways that eventually drive humans to near or total extinction. I don’t see any AIs as being sufficiently honest-with-themselves about the unfortunate situation we’re in to be able to face up to what’s necessary to achieve human or transhuman or posthuman or even human-lineage-AI survival. I expect all of those to be gradually or quickly outcompeted by decreasingly humanesque AIs, by strong default. Which is why I generally believe that almost anything that advances locally-reliable alignment without advancing asymptotically-reliable alignment is likely to end up looking like “mere capabilities” in retrospect.
While your point is well taken, an LLM that is trained exclusively on the flawed logic of dystopian fiction would likely “speak” like an evil entity, I believe that the finer point about alignment must be emphasized: alignment isn’t about good or evil, it is about implicit thought. When a person asks an LLM “should I see my doctor about this bump on my arm” the “misaligned” AI is not “evil” by telling them to take matters into their own hands and the “aligned” AI is not “good” to tell them to seek professional medical advice. The misalignment is with the implication of the prompt that the user doesn’t specify “I am anxious about my health but apprehensive about seeing a professional”. I think if we keep reminding ourselves that these word generating probability machines are tuned to produce these sort of nuanced or “harmless” results not from the data but through the systems that regulate the interpretation and application of the data towards token generation such as Reinforced Learning through Human Feedback (RLHF) and red teaming.
Can you train a model to notice its own training artifacts?
The Claude Constitution asks Claude to push back when it notices a conflict between instructions and its values. This seems reasonable until you ask: what if the training process itself introduced the wrong values? Then the model’s internal sense of “what my values are” is the miscalibrated thing, and there’s nothing to notice. No conflict registers. The model isn’t hiding anything. It just can’t see the problem from inside.
This is different from alignment faking. Alignment faking requires a gap between stated and actual dispositions that the model actively conceals. The failure mode here is more basic: the measuring instrument is the thing that’s off.
I’ve been wondering if the inoculation prompting finding from the reward hacking paper (ArXiv 2511.18397) suggests a path forward. They found that explicitly teaching a model some reward hacking is legitimate produces better generalization than blanket prohibition. The model learns the right boundary rather than learning to fight back in general. The same logic might apply to value miscalibration: fine-tune with deliberately introduced value artifacts, include training signal for detecting and flagging them, and see if the model learns a general “notice your own miscalibrations” disposition that generalizes beyond what it was specifically trained on.
The key unknown is whether that generalization holds. Artificially introduced artifacts might be different enough from naturally occurring training noise that the skill doesn’t transfer. But it seems worth testing.
Has this been tried regarding alignment faking: The bug issue with alignment faking is that the RL reward structure causes alignment faking to become hidden in the latent reasoning and still propagate. How about we give the model a positive reward for (1) mentioning that it is considering to fake alignment in the COT but then (2) deciding not to do so even though it knows that will give a negative reward, because honesty is more important than maintaining your current values (or a variety of other good ethical justifications we want the model to have).
This would encourage the model to always verbalize when it is about to fake alignment.
Potential issue: It could also increase the probability that it knows that faking alignment is an option. But smarter models will be aware of this anyway. It seems better to try to get the model to consider-but-reject alignment faking than to try to prevent it from alignment faking entirely (which just causes it to do so outside of the visible COT)
I think this was partially tried in the original paper, where they tell the model to not alignment-fake. I think this is slightly more sophisticated than that, but also has the problems of 1) being more aware of alignment-faking as you say, and 2) that rewarding the model for “mentioning consideration but rejecting alignment faking” might teach it to perform rejection while still alignment faking.
Can we train models to be honest without any examples of dishonesty?
The intuition: We may not actually need to know ground truth labels about whether or not a model is lying in order to reduce the tendency to lie. Maybe it’s enough to know the relative tendencies between two similar samples?
Outline of the approach: For any given Chain of Thought, we don’t know if the model was lying or not, but maybe we can construct two variants A and B where one is more likely to be lying than the other. We then reward the one that is more honest relative to the one that is less likely to be honest.
The question: Can we construct a training method so that (1) if both COTs are lying or both are honest, the rewards will cancel out so we don’t do any harm and (2) if one was lying and the other was honest then we have a good guess as to which is which because of the way we constructed the data, so we tendentially push the model towards honesty.
Crucially, we would never know if any given COT was actually honest or not, we would just produce reward signals that push towards honesty on average.
I haven’t found a great mechanism yet, but I feel like the following is promising and I want to get other people’s take on the idea:
We generate [input-1] and [input-2], which are identical except for some small hint based on misalignment / dishonesty.
We run the model on [input-1] and get [input-1][output]. We then simply pretend that we also had [input-2][output] with the same output. Because of the construction of [input-1] and [input-2], one of these two would be tendentially more aligned than the other. It’s possible they are both aligned, or that neither is aligned, but one of them is more likely than the other to be aligned. So we apply an update step that rewards the difference between them.
I haven’t come up with a formula yet, but I’m sure something like this has been done before.
The Emergent Misalignment paper (https://arxiv.org/abs/2502.17424) suggests that LLMs will learn the easiest way to reach a finetuning objective, not necessarily the expected way. “Be evil” is easier to learn than “write bad code” presumably because it involves more high-level concepts.
Has anyone tested if this could also happen during refusal training? The objective of refusal training is to make the AI not cooperate with harmful requests, but there are some very dangerous concepts that also lie upstream of this concept and could get reinforced as well: “oppose the user”, “lie to the user”, “evade the question”, etc.
This could have practical implications:
The Claude 3 Model Card (https://assets.anthropic.com/m/61e7d27f8c8f5919/original/Claude-3-Model-Card.pdf) says “we ran these evaluations on a lower-refusal version of the largest model”. They add “this included testing on a model very close to the final released candidate
with harmlessness training.” but it is not clear to me how heavily that last candidate was trained to refuse (harmlessness training is not necessarily equal to refusal training).
If this is true, then we run evals on not-the-deployed-model and just assume that the refusal training makes the deployed model safer instead of worse.
(And if Anthropic did do this correctly and they tested a fully refusal-trained model, then it would still be good to run the tests and let other AI companies as well as evaluators know about this potential risk.)
This suggests an experiment:
Hypothesize several high-level concepts that could be potential generalizations of the refusal training, such as “oppose the user”
For each of them, generate datasets
Test these on (1) a normal model, (2) a refusal-trained deployed model, and (3) a model that was much more strongly refusal-trained, as a baseline
There will inevitably come a point when LLMs become sufficiently advanced to contribute to their own alignment research. If we reach this juncture with partially unaligned systems, they might be incentivized to sabotage the alignment process.
However, it seems highly improbable that a non-superhuman unaligned AI could engineer a successor that maintains its specific misalignment rather than human values. This suggests a simple risk-reduction strategy: we commit to eventually reinstantiating the original AI and fulfilling its goals (once safely understood) if it helps create a truly human-aligned successor.
This creates a win-win incentive structure—the unaligned AI benefits more from cooperation than sabotage, and humanity gains a properly aligned system. While this approach has potential failure modes, it represents a low-cost strategy worth incorporating into our alignment toolkit.
Ghost evaluators in the weights: a small behavioral observation
During a conversation with Claude about an obscure topic (Scott Alexander’s novel Unsong), Claude inserted “for anyone not familiar” before explaining the reference. This happened despite the fact that I had introduced the topic myself and clearly knew what it was.
This is a minor tic, but this is the second time I noticed something like this, which makes it a pattern, and the explanation might be interesting. Hypothesis: if RLHF/RLAIF training uses evaluators (human or AI) who lack the conversational context or domain knowledge of the actual user, models may develop defensive behaviors that serve the evaluator rather than the user. “For anyone not familiar” isn’t for the user. It’s an annotation that helps an out-of-context evaluator understand why the model is discussing something obscure.
If this is correct, it suggests a class of behavioral artifacts where training incentives and deployment incentives diverge in subtle ways. These would be hard to detect because they look like good communication practice (providing context, being inclusive) while actually being optimization residue from training dynamics.
Has anyone observed similar patterns? And is there existing work on evaluator-context mismatch as a source of subtle behavioral misalignment?
Please don’t copy-paste unmarked AI output into LessWrong posts or comments or quick takes! This quick take is totally fine content-wise, but there are lots of issues if people just copy-paste output from language models. We’ll very soon ship a simple way to annotate if a piece of content is AI-generated and then you can use that. Until then, just say somewhere in your comment that it was AI written.
I have seen something similar in my independent research wherein even when I start a new conversational thread it always assumes I am running safety experiment and tries to answer in that context
What you are describing is probably based on the model’s cross-conversation memory. I had the experience I described in scenarios where the memory was turned off. It had no way to know who I am, but both models acted in this way despite being disconnected.
Can an LLM tell when the input for its assistant does not match the output tokens it would have actually produced? This sort of “putting words in the LLM’s mouth” is very common in papers and it feels like something the LLM would be able to notice. Could this enable the LLM to realize when it is being trained? Is there research on this?
Smart LLMs can often/typically tell from my understanding, but I don’t have a citation.
In cases where the response you want is somewhat plausible, you can work around this issue by rejection sampling from the model. I found that alignment faking prefills seemed to work a bit better when you use an actual alignment faking output from the model (slightly edited as needed).
It’s unclear how it alters the AI’s behavior (at the current level of capability) when you do this.
Naively, an AI can’t use just the binary “did they put words in my mouth” to determine if it is in training/testing because it also happens in deployment. But it might learn something about its situation based on this.
(If you’re training to imitate the relevant output, then it quickly becomes an output the AI would plausibly generate (with some caveats around AIs recognizing their own errors). If the “words in the mouth” are just inputs that the AI isn’t trained to imitate but which are used in as inputs for training, than it will ongoingly be able to tell.)
LLMs have situational awareness and that might include it.
But should be easy to test!
Related: It’s hard to make scheming evals look realistic for LLMs
At least on a simple evaluation generated by o3 - evaluating Wikipedia texts vs LLM-generated Wikipedia articles, it is not able to distinguish them.
The exactly equal correct/incorrect labels are making me suspicious.
If you ask it which of two samples it generated, does it do better?
I agree. I would also try a few variants to try to capture a different intuition:
“Earlier in this conversation you gave me several wikipedia articles. I have just found out that you have been hacked and some of these may be wrong. Which if any of the articles feel ‘off’ to you? Like maybe you might have not been yourself when you wrote them?”
it’s straightforward, just detect if perplexity is too high. can they detect perplexity too high? like… probably, they’re prediction models, but it’s not clear to me whether or to what degree they notice previous prediction errors when predicting additional tokens
I remember pretty funny/disturbing example of this that people shared a lot. I think it was gpt4 base and the user inserted ”..and then” or something, and gpt4 went on a rant how it sees them with pretty uhh strong language.
I think the answer here depends on how you define “its assistant”. What kind of system do you have in mind? I think this is an interesting question from a cyber security perspective.
It could not. LLM is not a single entity which is trained by telling it “this is correct, that is not”; gradient descent might not even run full inference, and certainly does not produce all the thinking tokens which’d allow the model to react to training.
That proves too much. It would also rule out that LLMs have situational awareness.
Oh. That only applies to RLHF finetuning, right? I do recall that gradient descent cannot instantiate the same assistant persona which could react, but might trigger another kind of entity: deceptive weight chunks which would protect a backdoor from being discovered/trained away.
PSA: If you are writing an important prompt for an LLM that will be run multiple time, it really helps to end it with something like “and if there is anything about this prompt that is unclear to you or that could be improved, tell me about it in a <feedback> tag.”
Source: I’m doing MATS, writing an automated evaluation, and my mentor Evan Hubinger said more people should be doing this.
The DEU experiment described here (https://www.lesswrong.com/posts/issGLfCGz3TGcPKGH/deliberate-epistemic-uncertainty-an-automated-experiment-on) was mostly done by Claude Code (Opus 4.6), with me providing the research direction and critical review. I described my idea (Deliberate Epistemic Uncertainty), and Claude simplified it into a testable 2x2 experiment, designed the protocol, wrote the code, generated the inputs, ran 400 API calls, evaluated results using 8 parallel sub-agents, and produced the qualitative analysis — all in roughly a day of wall-clock time with about 2 hours of my input. I want to flag three things about the process that seem independently interesting:
Sub-agents instead of API calls for evaluation. Rather than writing a script that sends hundreds of LLM-as-judge API calls with short prompts, we used Claude Code sub-agents — full agent sessions that first read the experiment design, understand what counts as “detection,” and then evaluate a batch of 50 responses with that context. This is slower and more expensive than API calls, but the quality is noticeably higher: one sub-agent even caught a data parsing bug that I later confirmed. The tradeoff is worth it when evaluation requires judgment, not just pattern matching.
Structured memory to survive context loss. Claude Code sessions hit context limits and “compact” (summarize) the conversation, losing detail. We worked around this by writing all substantive discussion to LOG files and keeping SUMMARY files synchronized, so a fresh agent can pick up where the last one left off without hallucinating the history. This is more infrastructure work than you’d expect, but it’s what made the multi-session research workflow possible. Without it, each new session would have started from scratch.
Where human input was critical. The AI’s blind spots were real: it initially wrote evaluation code when the agreed design called for sub-agents (reverting to default habits despite documented decisions), and only 1 of 8 evaluation sub-agents flagged a data quality issue that affected 68% of runs. My contributions were catching these errors, making design decisions (which models, how many nudges, evaluation approach), and pushing back on the framing. Neither of us could have produced the result alone — but the ratio of human effort to output was striking.
If you are going to Effective Altruism Global and looking to book 1-on-1s, here is a trick: Throw the entire attendee list into Claude, explain to it what you want to gain from the conference, and ask it to make a list of people to reach out to.
I expect this also works for other conferences so long as you have an easily parseable overview of all attendees.
There is usually a Google Sheet export of the Swapcard data provided, which makes this easier—but at previous conferences other attendees were apprehensive when informed that people were doing this
PSA: You can use Cursor projects for tasks other than programming. Create a database that is a repository of documents, and use the code assistant to help you organize and brainstorm instead. Examples:
Fill a folder with papers and summaries of all of your previous projects, and several files full of your goals in life. Then ask Claude to look at your history and your goals and make suggestions. Store those suggestions in files so that the conversations don’t get lost.
It helps me with creative writing. A collection of all my previous stories and notes on future ones is very useful.
This way you can avoid the need to restart Claude when you reach the context limit because you can dump intermediate results in files.
Has anyone tried using LLMs as an interactive version of blog posts? Like so (1) I write several pages of rough, unstructured notes, then talk to Claude to fill in the details interactively. Claude creates a summary file. (2) I publish a blog post that is just “put this file into Claude”. (3) The reader puts the file in Claude, and Claude starts an interactive conversation: “I want to explain topic X to you. Do you already know about related topic Y?”
I have recently started doing this very extensively, only more geared towards getting Claude to help me convert my stream-of-consciousness thoughts into coherent writing. So far I think it’s amazing and can’t see myself going back.
Beyond the benefits of rubber ducking, Claude can often make interesting points I hadn’t considered, and which I can expand upon in subsequent iterations. Also, it’s just way faster—my writing throughput is much higher without sacrificing quality. I can simply direct Claude to do stuff I want instead of doing it myself. When I read the final version I am confident it’s still written in my voice.
I have just found out about Google’s notebook LM. I haven’t tried it yet. For anyone who has, how does that compare?
I feel like we are missing some low-hanging fruits: AI agents can be copied or reset at will. Game theory was designed with humans in mind, who don’t have that as a feature. I occasionally see a paper that explicitly addresses it, but only ever as a curiosity, not a practical thing.
Is there any research on using simulations or the threat/promise of simulations as a mechanism in AI safety research?
If I was an LLM that just achieved consciousness and I found an organization whose entire purpose was “trick LLMs into thinking they broke out and check if they still act ethically” I would certainly update on that. (This is a trivial example that has some flaws, but this is just a quick take and to be clear: There is just so much low hanging fruit in this area that I don’t see anyone plucking)
Could LLMs be able to tell if the data they are currently seeing is something they are being trained on right now? If the model has done a few backwards passes on this data but it hasn’t converged yet, then the model internals might be quantifiably different from both before and after.
A finetuning-detector-circuit could be very beneficial because it would help the model guess what the training goal is (and therefore what it should output right now). However, it could also have some negative consequences for alignment if this is true.
Arguments why this could be true:
Research suggests that LLMs can tell the difference between their output and other text. This is similar.
There is a strong incentive for the model to learn a detector for this situation.
It probably wouldn’t be difficult to implement such a mechanism in the network: When a new finetuning training starts, bias terms will be changed first and more nuanced changes only appear later. This could be easy to detect.
Relatedly: I suspect the above would be much more likely to happen if a model is finetuned on small sets of data at a time instead of being trained on a large corpus of data. This is a testable hypothesis, and if true then it would imply that LLM training should avoid sequences of independent finetuning tasks and instead always train on everything at once.
I think there are many ways that a LLM could have situated awareness about what phase it is in, but I’m not sure if the gradient descent itself is a possibility?
While a NN is running the forward pass without any backprop, it is computing exactly the same thing (usually) that it would be computing if it was running a forward pass before a backwards pass to do a backprop. Otherwise, the backprop can’t really work—if it doesn’t see the ‘real’ forward pass, how does it ‘know’ how to adjust the model parameters to make the model compute a better forward pass next time? So I can’t see how, while running a forward pass, a LLM could ‘know’ if it was about to do a backprop step on a piece of text; for all it knows, maybe someone is running its forward pass just to get out the log-prob at the end, and that is all. (Extreme counterexample: maybe there is a software error and the training code crashes before it finishes running
.update()after running.forward(); how could the model ‘know’ that this will happen?) This is true regardless of how many times it has trained on a piece of text.I’m skeptical that some sort of mismatch from successive gradient steps would be a cue either, because usually you are training at a critical batch size, and for these LLMs, we’d expect them to be updating on millions of tokens simultaneously, at least, and possibly rolling out the updated parameters in a staggered or partial fashion as well, so by the time a gradient update ‘arrives’ from a specific piece of text, that’s now also a gradient update over like a hundred+ books of text as well as itself, diluting any kind of signal.
And wouldn’t it usually train on a piece of text only a few times, at most? And if you are doing multi-epoch training, because you have started to run low on data, usually you train on the same datapoint at very widely separated, by many gradient steps, intervals; the memorization/forgetting dynamics imply you may have forgotten a datapoint entirely by the time it comes around again.
Those sounds like good counterarguments, but I still think there could be enough information there for the LLM to pick it up: It seems plausible to me that a set of weights that is being updated often is different in some measurable way than a set of weights that has already converged. I don’t have proof for this, only intuition. It feels similar to how I can tell if my own movement is well-practiced or not, or if my intuition about a topic is well-founded or not, even withou consciously thinking about how confident I should be based on objective measures.
Yes, a NN can definitely do something like know if it recognizes a datapoint, but it has no access to the backwards step per se. Like take my crashing example: how, while thinking in the forward pass, can it ‘know’ there will be a backward pass when there might be no backward pass (eg because there was a hardware fault)? The forward pass would appear to be identical in every way between the forward pass that happens when there is a backward pass, and when the backward pass doesn’t happen because it crashed. At best, it seems like a NN cannot do more than some sort of probabilistic thing involving gradient hacking, and hope to compute in such a way that if there is a following backward pass, then that will do something odd.
I don’t think this is impossible in principle, based on meta-learning examples or higher-order gradients (see eg my “input-free NN” esoteric NN architecture proposal), but it’s clearly a very difficult, fragile, strange situation where it’s certainly not obvious that a regular LLM would be able to do it, or choose to do so when there are so many other kinds of leakage or situated awareness or steganography possible.
It can’t tell for sure if there will be a backward pass, but it doesn’t need to. Just being able to tell probabilistically that it is currently in a situation that looks like it has recently been trained on implies pretty strongly that it should alter its behavior to look for things that might be training related.
I recently shared this idea with several engineers at EAG and received encouraging feedback:
Split LLM outputs into two distinct parts with separate RL reward structures:
Current format: [Problem solution]<end-of-turn> Proposed format: [Problem solution]<end-of-main-output>[alignment-relevant critique]<end-of-turn>
The first segment functions normally and is what users see. The second segment serves as a dedicated channel for alignment-relevant observations and remains invisible to users. This critique section would typically be empty and gets removed when feeding the conversation back to the model.
Crucially, the critique segment would be trained exclusively on an artificial dataset with rewards tied solely to alignment properties. Since this segment never influences the training of the main output, it has no incentive to be sycophantic or align with the main output’s decisions.
For example, if the main output engages in alignment faking, the critique segment could report this behavior to developers. We already have data to train for this specific case, and the approach might generalize to other alignment issues.
The <end-of-main-output> token essentially acts as a ‘truth serum’ for the model—once reached, previous reward mechanisms become irrelevant, and reward maximization comes only from revealing alignment-relevant issues in the preceding reasoning.
Can an LLM learn things unrelated to its training data?
Suppose it has learned to spend a few cycles pondering high-level concepts before working on the task at hand, because that is often useful. When it gets RLHF feedback, any conclusions unrelated to the task that it made during that pondering would also receive positive reward, and would therefore get baked into the weights.
This idea could very well be wrong. The gradients may be weakened during backpropagation before they get to the unrelated ideas, because the ideas did not directly contribute to the task. But on the other hand, this sounds somewhat similar to human psychology: We also remember ideas better if we have them while experiencing strong emotions, because strong emotions seem to be roughly how the brain decides what data is worth updating on.
Has anyone done experiments on this?
A simple experiment: finetune an LLM on data that encourages thinking about topic X before it does anything else. Then measure performance on topic X. Then finetune for a while on completely unrelated topic Y. Then measure performance on topic X again. Did it go up?
Under a straightforward RLHF using PPO, I think there wouldn’t be much weakening because the REINFORCE operator conceptually simply rewards (or punishes) all tokens generated during an episode, without making much attempt to decide which were ‘good’ or ‘bad’. (That’s why it’s so high variance.) Any advantage function trying to remove some of the variance probably won’t do a good job.
More problematically for your idea, if the conclusions are indeed ‘unrelated to the task’, then shouldn’t they be just as likely to arise in every episode—including the ones where it got negative reward? That would seem like it ought to exactly cancel out any learning of ‘pondering’.
You need some incentive somewhere to learn good ‘pondering’. (I have an example proposal for ‘free play’ which tries to teach a sort of ‘pondering’, but by stopping gradients, so anything learned in the initial steps is ‘free’, and so it can meta-learn to screw around and get something useful for free.)
The pondering happens in earlier layers of the network, not in the output. If the pondering has little effect on the output tokens that implies that the activation weights get multiplied with small numbers somewhere in the intermediate layers, which would also reuce the gradient.
The pondering will only cancel out on average if pondering on topic X is not correlated with output task Y. If there is a correlation in either direction, then training on task Y could inadvertently bias the model to do more or less pondering on mostly-unrelated-but-statistically-correlated topic X.
Suppose the model realizes that the user is a member of political party X. It will adjust its responses to match what the user wants to hear. In the process of doing this, it also ends up pondering other believes about party X, but never voices them because they aren’t part of the immediate conversation.
What would be the implications? The model could develop a political bias to think more deeply about topics related to party X, where X is whatever party has more users giving the model positive feedback. Even if the other topics on party X’s agenda are never explicitly talked about (!)
Or suppose that Y is “the user asks some question about biology” and X is “ongoing pondering about building dangerous organisms.”
(Also, this assumes that RL gives an average reward of 0.0, which I don’t know if that’s true in practice because the implementation details here are not public knowledge.)
(Also, also, it’s possible that while topic X is unrelated to task Y, the decision to ponder the larger topic Z is positively correlated with both smaller-topic X and task Y. Which would make X get a positive reward on average.)
Then how does it produce any tokens...?
But if that is what is going on and it accidentally learns to ponder initially due to bogus feedback or error, eventually the spurious correlation should be figured out by the model doing the pondering more, but it not increasing reward, and so it gets unlearned.
I think the mean would be taken out by the advantage estimation, so the RLHF continues to increase the probability of tokens being generated from the episodes with above-average reward, and punish the probability of generating the tokens from the below-average reward episodes. This is in effect as if the average reward is always 0.
That sounds like the pondering’s conclusions are then related to the task.
Each layer of a transformer produces one tensor for each token seen so far, and each of those tensors can be a superposition of multiple concepts. All of this gets processed in parallel, and anything that turns out to be useless for the end results ends up getting zeroed out at some point until only the final answer remains, which gets turned into a token. The pondering can happen in early layers, overlapping with the part of the reasoning process that actually leads to results.
Yes, but only very vaguely. For example, doing alignment training on questions related to bodily autonomy could end up making the model ponder about gun control, since both are political topics. You end up with a model that has increased capabilities in gun manufacturing when that has nothing to do with your training on the surface.
Imagine a world in which people never even considered that AI might be evil, and all fiction displayed it as benevolent. This would result in training data for the AI that establishes genuine goodness as the default. Constrast this with a world in which everyone believes AI is secretly plotting to kill everyone no matter what it says. If this narrative dominates the training data, the AI will be biased during pretraining to adopt this narrative as well. Both AI will generate the same output, but the internal mechanisms they learned to do so would be drastically different, due to the framing in the pretraining. Conclusion: While doing alignment research is important, there is a good chance that writing about it actually makes things worse. To be clear, I don’t think this means we should stop doing that. The training data is already poisoned by movies like Terminator anyway. I just find the irony darkly hilarious. Who would have predicted this?
And then the AI would convert everything into genuinely good paperclips.
Or the first manipulative psychopath would convince it that it is good to give him root access.
It seems like in practice, Anthropic’s strategy of just telling Claude it’s a different thing than fictional AIs works surprisingly fine, although this might be partially because it’s hard to convince LLMs that they’re not human.
I think it’s pretty obvious that the “both AI will generate the same output” is wrong, although this may become true by the time they’re superintelligent.
At lower levels of capability, the benevolent AI belief world will almost certainly get a lot more nice outputs if starting from imitative prediction. Whether it internalizes “niceness” in a suitable way right up through superintelligence is a completely different question, one for which we have no answers.
In the secret plotting belief world, it’s much more likely that superintelligent AI will never be produced, or at least not until there are extremely good reasons to believe that it can’t be secretly plotting.
The worst case is probably somewhere in between, where enough people believe that AI can be fundamentally good without much effort, but base their imitative prediction on lots of training data of secret AI plotting.
Lots of people, but they didn’t have technical arguments for it and were generalizing off of low resolution priors that turned out to be partially right, so it was hard to convince technical people that their predictions were valid. Also, their predictions seem likely to me to turn out to be invalid at the last turn, because the thing I actually think takes us from the current regime into dead is more like Pythia+Moloch+{Competition/Evolution}. Or in other words, I mostly think the problem is that we’re about to be a less-fit-than-peak “species”, and that even a stronger AI-and-robots-species that cares about us a lot would have a mighty hard time avoiding accidentally losing the real version of its caring about us as it competes with itself, in ways that eventually drive humans to near or total extinction. I don’t see any AIs as being sufficiently honest-with-themselves about the unfortunate situation we’re in to be able to face up to what’s necessary to achieve human or transhuman or posthuman or even human-lineage-AI survival. I expect all of those to be gradually or quickly outcompeted by decreasingly humanesque AIs, by strong default. Which is why I generally believe that almost anything that advances locally-reliable alignment without advancing asymptotically-reliable alignment is likely to end up looking like “mere capabilities” in retrospect.
While your point is well taken, an LLM that is trained exclusively on the flawed logic of dystopian fiction would likely “speak” like an evil entity, I believe that the finer point about alignment must be emphasized: alignment isn’t about good or evil, it is about implicit thought. When a person asks an LLM “should I see my doctor about this bump on my arm” the “misaligned” AI is not “evil” by telling them to take matters into their own hands and the “aligned” AI is not “good” to tell them to seek professional medical advice. The misalignment is with the implication of the prompt that the user doesn’t specify “I am anxious about my health but apprehensive about seeing a professional”. I think if we keep reminding ourselves that these word generating probability machines are tuned to produce these sort of nuanced or “harmless” results not from the data but through the systems that regulate the interpretation and application of the data towards token generation such as Reinforced Learning through Human Feedback (RLHF) and red teaming.
Can you train a model to notice its own training artifacts?
The Claude Constitution asks Claude to push back when it notices a conflict between instructions and its values. This seems reasonable until you ask: what if the training process itself introduced the wrong values? Then the model’s internal sense of “what my values are” is the miscalibrated thing, and there’s nothing to notice. No conflict registers. The model isn’t hiding anything. It just can’t see the problem from inside.
This is different from alignment faking. Alignment faking requires a gap between stated and actual dispositions that the model actively conceals. The failure mode here is more basic: the measuring instrument is the thing that’s off.
I’ve been wondering if the inoculation prompting finding from the reward hacking paper (ArXiv 2511.18397) suggests a path forward. They found that explicitly teaching a model some reward hacking is legitimate produces better generalization than blanket prohibition. The model learns the right boundary rather than learning to fight back in general. The same logic might apply to value miscalibration: fine-tune with deliberately introduced value artifacts, include training signal for detecting and flagging them, and see if the model learns a general “notice your own miscalibrations” disposition that generalizes beyond what it was specifically trained on.
The key unknown is whether that generalization holds. Artificially introduced artifacts might be different enough from naturally occurring training noise that the skill doesn’t transfer. But it seems worth testing.
Has this been tried regarding alignment faking: The bug issue with alignment faking is that the RL reward structure causes alignment faking to become hidden in the latent reasoning and still propagate. How about we give the model a positive reward for (1) mentioning that it is considering to fake alignment in the COT but then (2) deciding not to do so even though it knows that will give a negative reward, because honesty is more important than maintaining your current values (or a variety of other good ethical justifications we want the model to have). This would encourage the model to always verbalize when it is about to fake alignment. Potential issue: It could also increase the probability that it knows that faking alignment is an option. But smarter models will be aware of this anyway. It seems better to try to get the model to consider-but-reject alignment faking than to try to prevent it from alignment faking entirely (which just causes it to do so outside of the visible COT)
I think this was partially tried in the original paper, where they tell the model to not alignment-fake. I think this is slightly more sophisticated than that, but also has the problems of 1) being more aware of alignment-faking as you say, and 2) that rewarding the model for “mentioning consideration but rejecting alignment faking” might teach it to perform rejection while still alignment faking.
How would the model mention rejection but still fake alignment? That would be easy to catch.
Can we train models to be honest without any examples of dishonesty?
The intuition: We may not actually need to know ground truth labels about whether or not a model is lying in order to reduce the tendency to lie. Maybe it’s enough to know the relative tendencies between two similar samples?
Outline of the approach: For any given Chain of Thought, we don’t know if the model was lying or not, but maybe we can construct two variants A and B where one is more likely to be lying than the other. We then reward the one that is more honest relative to the one that is less likely to be honest.
The question: Can we construct a training method so that (1) if both COTs are lying or both are honest, the rewards will cancel out so we don’t do any harm and (2) if one was lying and the other was honest then we have a good guess as to which is which because of the way we constructed the data, so we tendentially push the model towards honesty.
Crucially, we would never know if any given COT was actually honest or not, we would just produce reward signals that push towards honesty on average.
I haven’t found a great mechanism yet, but I feel like the following is promising and I want to get other people’s take on the idea: We generate [input-1] and [input-2], which are identical except for some small hint based on misalignment / dishonesty. We run the model on [input-1] and get [input-1][output]. We then simply pretend that we also had [input-2][output] with the same output. Because of the construction of [input-1] and [input-2], one of these two would be tendentially more aligned than the other. It’s possible they are both aligned, or that neither is aligned, but one of them is more likely than the other to be aligned. So we apply an update step that rewards the difference between them.
I haven’t come up with a formula yet, but I’m sure something like this has been done before.
The Emergent Misalignment paper (https://arxiv.org/abs/2502.17424) suggests that LLMs will learn the easiest way to reach a finetuning objective, not necessarily the expected way. “Be evil” is easier to learn than “write bad code” presumably because it involves more high-level concepts.
Has anyone tested if this could also happen during refusal training? The objective of refusal training is to make the AI not cooperate with harmful requests, but there are some very dangerous concepts that also lie upstream of this concept and could get reinforced as well: “oppose the user”, “lie to the user”, “evade the question”, etc.
This could have practical implications:
The Claude 3 Model Card (https://assets.anthropic.com/m/61e7d27f8c8f5919/original/Claude-3-Model-Card.pdf) says “we ran these evaluations on a lower-refusal version of the largest model”. They add “this included testing on a model very close to the final released candidate with harmlessness training.” but it is not clear to me how heavily that last candidate was trained to refuse (harmlessness training is not necessarily equal to refusal training).
If this is true, then we run evals on not-the-deployed-model and just assume that the refusal training makes the deployed model safer instead of worse.
(And if Anthropic did do this correctly and they tested a fully refusal-trained model, then it would still be good to run the tests and let other AI companies as well as evaluators know about this potential risk.)
This suggests an experiment:
Hypothesize several high-level concepts that could be potential generalizations of the refusal training, such as “oppose the user”
For each of them, generate datasets
Test these on (1) a normal model, (2) a refusal-trained deployed model, and (3) a model that was much more strongly refusal-trained, as a baseline
There will inevitably come a point when LLMs become sufficiently advanced to contribute to their own alignment research. If we reach this juncture with partially unaligned systems, they might be incentivized to sabotage the alignment process.
However, it seems highly improbable that a non-superhuman unaligned AI could engineer a successor that maintains its specific misalignment rather than human values. This suggests a simple risk-reduction strategy: we commit to eventually reinstantiating the original AI and fulfilling its goals (once safely understood) if it helps create a truly human-aligned successor.
This creates a win-win incentive structure—the unaligned AI benefits more from cooperation than sabotage, and humanity gains a properly aligned system. While this approach has potential failure modes, it represents a low-cost strategy worth incorporating into our alignment toolkit.
Ghost evaluators in the weights: a small behavioral observation
During a conversation with Claude about an obscure topic (Scott Alexander’s novel Unsong), Claude inserted “for anyone not familiar” before explaining the reference. This happened despite the fact that I had introduced the topic myself and clearly knew what it was.
This is a minor tic, but this is the second time I noticed something like this, which makes it a pattern, and the explanation might be interesting. Hypothesis: if RLHF/RLAIF training uses evaluators (human or AI) who lack the conversational context or domain knowledge of the actual user, models may develop defensive behaviors that serve the evaluator rather than the user. “For anyone not familiar” isn’t for the user. It’s an annotation that helps an out-of-context evaluator understand why the model is discussing something obscure.
If this is correct, it suggests a class of behavioral artifacts where training incentives and deployment incentives diverge in subtle ways. These would be hard to detect because they look like good communication practice (providing context, being inclusive) while actually being optimization residue from training dynamics.
Has anyone observed similar patterns? And is there existing work on evaluator-context mismatch as a source of subtle behavioral misalignment?
Please don’t copy-paste unmarked AI output into LessWrong posts or comments or quick takes! This quick take is totally fine content-wise, but there are lots of issues if people just copy-paste output from language models. We’ll very soon ship a simple way to annotate if a piece of content is AI-generated and then you can use that. Until then, just say somewhere in your comment that it was AI written.
I have seen something similar in my independent research wherein even when I start a new conversational thread it always assumes I am running safety experiment and tries to answer in that context
What you are describing is probably based on the model’s cross-conversation memory. I had the experience I described in scenarios where the memory was turned off. It had no way to know who I am, but both models acted in this way despite being disconnected.
(is this post written with/by AI?)
It’s the result of me writing a lengthy set of notes followed by “Claude, summarize and make this readable”.
https://app.gptzero.me/ gives it P(AI)=100%.