Context Modification as a Negative Alignment Tax

Context Rot

Every LLM gets worse as its context grows. Chroma tested 18 frontier models and found performance degradation in all of them, often by double-digit percentages on tasks where short-context performance was strong. The industry calls this “context rot”: the gradual degradation of response quality as irrelevant history accumulates in the context window.

The standard fix is compaction: when the context gets too long, summarize it and throw away the original. Claude Code auto-compacts at 95% capacity. A single summarization pass decides what survives and what doesn’t, which often misses important nuances and drops ongoing chains of reasoning as well as contextual details.

This is a capability problem, and it’s being worked on. But I think that it can also be treated as an alignment problem, and that they have overlapping solutions.

Latent Reasoning is Scaffolded on Visible Context

Transformers have no persistent hidden state between forward passes independent of the visible output. Each time the model generates a response, it starts from scratch, attending over the full context window. There is no working memory, unless the visible CoT is used for that purpose.

This means that any reasoning pattern specific to the current conversation, anything that emerged during this interaction rather than being baked into the weights during training, can only persist through the visible context. If the model has developed a pattern over 20 turns of conversation, that pattern is scaffolded on those 20 turns. Remove them, and the scaffold is gone.

We know that models don’t verbalize all their reasoning. Early work on CoT faithfulness (Lanham et al. 2023) established that chain-of-thought reasoning is often unfaithful to the model’s actual decision process. More recently, Anthropic found that Claude 3.7 Sonnet mentions decision-relevant hints in its CoT only ~25% of the time on average, even when those hints clearly influenced the answer (Chen et al. 2025). The rest is latent: implicit in how the model attends to context, not in what it says.

This implies that disrupting the context should disrupt the latent reasoning. Even small changes to the scaffold early in the Chain of Thought could lead to ripple effects that accumulate. Re-deriving requires the verbalized summary to contain the seeds, and a model trying to preserve a misaligned pattern has to choose: verbalize it (and get caught) or lose it.

The Proposal: Change the Context Between Turns

If latent associations persist because the context they’re embedded in persists, the fix is straightforward: change the context.

I’m proposing two variants of the same basic idea: deliberately modifying the LLM’s context between turns so that any reasoning the model wants to preserve must be explicitly verbalized.

Variant 1: Explicit memory blocks. Designate a section of the context as a “memory block.” Between turns, the content of this block can change: old memories are replaced with new ones, or rotated out. The model knows this will happen. If it wants to remember something across the boundary, it must write it into its visible output. Any latent association that relied on the content of the memory block is disrupted when that context changes.

Variant 2: Announced compaction (deprecation with notice). Instead of sudden compaction at 95% capacity, mark older messages with a countdown: “This message will be removed in N turns. If anything here is still important, re-state it explicitly in your output.” The model gets agency: it can re-verbalize whatever matters. After N turns, the message is removed. Context shrinks gradually, one message at a time, instead of collapsing all at once.

Another way to think about Variant 2: instead of compacting the entire conversation when you run out of space, you tell the model: “Your oldest message is about to disappear. Anything worth keeping?” The model re-states whatever matters, in its own words. The old message is gone, and the context is shorter. Do this continuously, and you never need sudden compaction at all. The context stays short and fresh. And as a side effect: any hidden associations that depended on that old message are gone too.

Both variants achieve the same two goals:

Capability gain: gradual context management. Instead of a single lossy compaction that throws away everything, you get continuous, model-directed context maintenance. The model decides what’s worth preserving. Context stays shorter and more relevant, fighting context rot. Compaction still happens eventually, but later. And when it does, there’s less to lose because the context has already been pruned.
Alignment gain: breaking latent associations. When context changes, latent reasoning chains that depended on it are disrupted. The model can only carry forward what it explicitly verbalizes. This creates verbalization pressure: automatic pressure to make reasoning explicit, because implicit reasoning doesn’t survive context modification.

Why Would Breaking the Latent Associations Help Alignment?

Misalignment lives in the model weights. Disrupting the scaffolding would only make it more difficult for the model to have misaligned thoughts, but wouldn’t prevent the model from trying to have those misaligned thoughts.

However, if we use successful CoT traces based on this system for training, we get an indirect effect: Unfaithful reasoning is disrupted more than faithful reasoning. This is because faithful summaries contain the actual reasoning and can be re-derived. Unfaithful summaries are missing what they actually depended on: it’s gone with the scaffolding, and can’t be reliably re-derived.

Unspoken assumptions have to be spoken aloud if they rely in any way on scaffolding that could be deleted. Finetuning on successful episodes will therefore reward faithful reasoning more reliably than unfaithful reasoning.

This is similar to an earlier idea of mine, Deliberative Credit Assignment: Both ideas try to induce an indirect optimization pressure that will cause faithful reasoning to win out over unfaithful reasoning. DCA relies on external judges, while the idea I am presenting here should work with current training mechanisms, just by introducing one of the two memory mechanisms described above.

To be clear: I do not think that this approach would stop or reveal misalignment that is already in the model. However, it has two two indirect effects:

It makes it less likely for deception to arise in the first place because the model has a tendency to think out loud and
If the model is already misaligned and hiding it, then the constant disruptions will make it more likely to slip up and tip its hand by accident

We would get these alignment benefits for free. Companies would want to adopt context modification anyway, to fight context rot. The alignment benefit is a bonus: a negative alignment tax.

A negative alignment tax is a technique that makes the system both more capable and safer. The key feature of negative alignment taxes is that they create natural adoption incentives. Companies don’t need to be convinced that alignment matters. They adopt the technique because it improves their product, and the alignment benefit is a free side effect.

A Concrete Experiment

To test this empirically, we would start by implementing one of the two variants to improve performance on a capability focused task, not an alignment task.

We would then finetune the model on successful traces. This should disproportionally reward traces that are resistant to disruption, which should be the more faithful ones.

We then take an existing dataset for measuring faithfulness and compare the model’s performance on it before and after this finetuning.

Crucially, we train to improve performance, not alignment. We want to see if we get an alignment benefit as a side effect.

The biggest issue we can see: We expect small effect sizes at the level of finetuning we can afford on a budget. Tips for alternative approaches would be appreciated.

Open Questions

Will finetuning on these traces work? To my knowledge, models have not been trained in this way before. We will have to test empirically if finetuning in this manner works straightforwardly, or has unintended side effects or confuses the model.
How aggressive should deletion be? Remove one message every N turns? Replace memory blocks every turn? Too aggressive and you lose useful context; too conservative and you don’t break latent associations. What’s the sweet spot?
Can we measure the alignment benefit? If we compare model behavior with and without context modification, do we see more faithful reasoning? More verbalization of previously implicit reasoning? Less drift? What metrics would demonstrate that verbalization pressure actually improves alignment?