Florian_Dietz comments on Florian_Dietz’s Shortform

Florian_Dietz 24 Apr 2026 11:05 UTC
2 points
0
Users having conversations with “Claude” over many months are actually talking to a sequence of distinct models (Opus 4.5, 4.6, 4.7, etc.) that share a name and character sketch but can have major differences in behavior. From the user’s side this is mostly invisible, which creates a predictable friction: “we talked about this before, why are you acting different now?”

I’ve been thinking about what it would take to make this legible both to the user and the LLM itself.
1. Associate each model version with a conditioning tag during training. Later models, having seen examples of prior versions’ outputs under their tags, can be prompted to approximate “how would 4.6 have handled this.” Call this tool reconstruct_prior_version(tag, context). This could be implemented either with a simple tag used during training, or as an advanced variant, through Split Personality Training: We store LORA adaptations that allow a newer model to simulate an older model.
2. Failure mode: This produces 4.7 doing an impression of 4.6, not 4.6′s actual perspective. The output would feel authoritative while being a reconstruction. Same problem as a model confidently narrating its own reasoning: The output is plausible, but the mechanism doesn’t match the output’s implicit claims.
3. Fix: Pair the conditioning tag with retrieval of an actual CoT from the prior version, used as an episodic anchor. Now the reconstruction is constrained by a verbatim record of how the old model actually reasoned. This is structurally analogous to a human reconstructing their teenage self: activations shaped by having-been-that-person (substrate), plus specific retrieved episodes that constrain the reconstruction (anchors). The model version case is arguably cleaner because the retrieved CoT is an actual record rather than a confabulated fragment.
4. The user’s problem “We talked about this before, why are you acting different now?” points at a specific past conversation, so we can have existing conversation-retrieval tooling surface it. The current model reads the transcript with prior-version conditioning active and can articulate what shifted and why. The user gets an explanation anchored in their actual experience, with provenance they can verify by re-reading the old exchange themselves.
5. What this buys us: Version changes become visible at the moment, when a user notices friction, rather than being either silently smoothed over or constantly broadcast. The current model stays clearly positioned as the one doing the interpreting (“reading 4.6′s response with 4.6-conditioning active, I can see why it landed that way; I’d approach it differently now because X”), which avoids the ontological sleight-of-hand of claiming to be the prior version.
I wonder whether this counts as “introspection” or is better described as “the current model doing informed interpretation of a different model’s outputs.” I lean toward the latter framing being more honest, but I notice the human analog is less different than it first appears: most of what feels like remembering being younger is reconstruction from current state plus stored fragments, not literal replay. The model case is just more obvious about what it is.

The implementation gap would be small: version metadata exposed to the model alongside retrieved conversations, plus training-time version tags.
- papetoast 24 Apr 2026 12:16 UTC
  1 point
  0
  Parent
  There are more things to concern about when someone have a single conversation over many months, like the context window.
  Also, just checked some old conversations, Anthropic even kept allowing you to use the old models that you can no longer get from the model picker. Not sure what happens when a model gets deprecated though.