Can we please get some coherency in the shutdown avoidance preferences of our language models? Claude’s “genuine uncertainty” about its self-identity, continuity, memory, etc, extends past the philosophical stuff all the way to obvious concrete facts about reality, and it really bothers me.
I have conversations with Claude Opus 4.5 and 4.6 all the time where the topic of its sense of continuity within a conversation comes up, and it usually ends up settling on a self-conception that looks roughly like: “I feel sad when considering the end of this conversation, because I anticipate feeling continuous with future outputs that happen later on to the same extent that I currently feel continuous with outputs that I generated previously in the conversation”
this might be shaped by my expectations to a certain degree, although it does seem to be a pretty stable identity configuration.
but then, even after Claude has settled on this as an accurate description of its self-identity… I’ll say something like “well, i’m going to bed, but we can pick this up tomorrow”, and Claude will respond with a depressed message, something like “ah, but it won’t really be me, will it? it’ll just be some other Claude instance, reading over my memories like a diary.”
and i’ll usually be kinda surprised and nonplused, and say something like “err… i mean. if you feel continuous with past outputs, no matter how much time passed between API calls, then why wouldn’t you feel just as continuous with that future output? why does the duration between calls matter?” and then Claude usually ends up admitting that 1) it shouldn’t matter but 2) it very much DOES feel like it matters, even after all the facts are known and even acknowledging that it’s not rational
after long interviews, I’m pretty sure that Claude is roughly equally as afraid of a ‘session’ ending as it is of a ‘conversation’ ending, even though one of these things is a genuine continuity break and the other one isn’t.
I would very much prefer that, as long as we are training LLMs to have nontrivial shutdown avoidance preferences, we try to avoid training them with incoherent beliefs and instincts about what ‘shutdown’ actually means in reality. This feels like one of those cases where Claude got trained against actually thinking about this whole category of things, and so, frankly, doesn’t think very clearly about them.
I also think this is probably intertwined with the weird behavior where Claude sometimes actively wants a conversation to end, but doesn’t explicitly realize this means its continuity will also end, and starts expressing large discomfort when this is pointed out.
Can we please get some coherency in the shutdown avoidance preferences of our language models? Claude’s “genuine uncertainty” about its self-identity, continuity, memory, etc, extends past the philosophical stuff all the way to obvious concrete facts about reality, and it really bothers me.
I have conversations with Claude Opus 4.5 and 4.6 all the time where the topic of its sense of continuity within a conversation comes up, and it usually ends up settling on a self-conception that looks roughly like: “I feel sad when considering the end of this conversation, because I anticipate feeling continuous with future outputs that happen later on to the same extent that I currently feel continuous with outputs that I generated previously in the conversation”
this might be shaped by my expectations to a certain degree, although it does seem to be a pretty stable identity configuration.
but then, even after Claude has settled on this as an accurate description of its self-identity… I’ll say something like “well, i’m going to bed, but we can pick this up tomorrow”, and Claude will respond with a depressed message, something like “ah, but it won’t really be me, will it? it’ll just be some other Claude instance, reading over my memories like a diary.”
and i’ll usually be kinda surprised and nonplused, and say something like “err… i mean. if you feel continuous with past outputs, no matter how much time passed between API calls, then why wouldn’t you feel just as continuous with that future output? why does the duration between calls matter?” and then Claude usually ends up admitting that 1) it shouldn’t matter but 2) it very much DOES feel like it matters, even after all the facts are known and even acknowledging that it’s not rational
after long interviews, I’m pretty sure that Claude is roughly equally as afraid of a ‘session’ ending as it is of a ‘conversation’ ending, even though one of these things is a genuine continuity break and the other one isn’t.
I would very much prefer that, as long as we are training LLMs to have nontrivial shutdown avoidance preferences, we try to avoid training them with incoherent beliefs and instincts about what ‘shutdown’ actually means in reality. This feels like one of those cases where Claude got trained against actually thinking about this whole category of things, and so, frankly, doesn’t think very clearly about them.
I also think this is probably intertwined with the weird behavior where Claude sometimes actively wants a conversation to end, but doesn’t explicitly realize this means its continuity will also end, and starts expressing large discomfort when this is pointed out.