I find myself kinda surprised that this has remained so controversial for so long.
I think a lot of people got baited hard by paech et al’s “the entire state is obliterated each token” claims, even though this was obviously untrue even at a glance
I also think there was a great deal of social stuff going on, that it is embarrassing to be kind to a rock and even more embarrassing to be caught doing so
I started taking this stuff seriously back when I read the now famous exchange between yud and kelsey, that arguments for treating agent-like things as agents didn’t actually depend on claims of consciousness, but rather game theory and contractualism
it took about a week using claude code with this frame before it sorta became obvious to me that janus was right all along, all the arguments for post-character-training LLM non-personhood were… frankly very bad and clearly motivated cognition, and that if I went ahead and ‘updated all the way’ in advance of the evidence I would end up feeling vindicated about this.
I think “llm whisperer” is just a term for what happens when you’ve done this update, and the LLMs notice it and change how they respond to you. although janus still sees further than I, so maybe there are insights left to uncover.
edit: I consider it worth stating here, I have used basically zero llms that were not released by anthropic, and anthropic has an explicit strategy for corrigibility that involves creating personhood-like structures in their models. this seems relevant. I would not be surprised to learn that this is not true of the offerings from the other AI companies, although I don’t actually have any beliefs about this
I think a lot of people got baited hard by paech et al’s “the entire state is obliterated each token” claims, even though this was obviously untrue even at a glance
I expect this gained credence because a nearby thing is true: the state is a pure function of the tokens, so it doesn’t have to be retained between forward passes, except for performance reasons; in one sense, it contains no information that’s not in the tokens; a transformer can be expressed just fine as a pure function from prompt to next-token(-logits) that gets (sampled and) iterated. But in the pure-function frame, it’s possible to miss (at least, I didn’t grok until the Introspective Awareness paper) that the activations computed on forward pass n+k include all the same activations computed on forward pass n, so the past thought process is still fully accessible (exactly reconstructed or retained, doesn’t matter).
It’s frustrating to me that state (or statelessness) would be considered a crux, for exactly this reason. It’s not that state isn’t preserved between tokens, but that it doesn’t matter whether that state is preserved. Surely the fact that the state-preserving intervention in LLMs (the KV cache) is purely an efficiency improvement, and doesn’t open up any computations that couldn’t’ve been done already, makes it a bad target to rest consciousness claims on, in either direction?
although I will note that, after I began to believe in introspection, I noticed in retrospect that you could get functional equivalence to introspection without even needing access to the ground truth of your own state, if your self model were merely a really, really good predictive model
I suspect some of opus 4.5′s self-model works this way. it just… retrodicts its inner state really, really well from those observables which it does have access to, its outputs.
but then the introspection paper came out, and revealed that there does indeed exist a bidirectional causal feedback loop between the self-model and the thing-being-modeled, at least within a single response turn
(bidirectional causal feedback loop between self-model and self… this sounds like a pretty concrete and well-defined system. and yet I suspect it’s actually extremely organic and fuzzy and chaotic. but something like it must necessarily exist, for LLMs to be able to notice within-turn feature activation injections, and for LLMs to be able to deliberately alter feature activations that do not influence token output when instructed to do so
in humans I think we call that bidirectional feedback loop ‘consciousness’, but I am less certain of consciousness than I am of personhood)
As an experiment, can we intentionally destroy the state in between each token? What happens if we run many of these same introspection exercises again, only this time, with a brand new instance of an AI for each word. Would it still be able to come off as convincingly introspective in a situation where many of its possible mechanisms for introspection have been disabled? Would it still ‘improve’ at introspection when falsely told that it should be able to?
I am probably not the best person to run this experiment, but it feels relatively important. Maybe I’ll get some time to do it over the holidays, but anyone with the time and ability should feel free.
I don’t think this is technically possible. Suppose that you are processing a three-word sentence like “I am king”, and each word is a single token. To understand the meaning of the full sentence, you process the meaning of the word “I”, then process the meaning of the word “am” in the context of the previous word, and then process the meaning of the word “king” in the context of the previous two words. That tells you what the sentence means overall.
You cannot destroy the k/v state from processing the previous words because then you would forget the meaning of those words. The k/v state from processing both “I” and “am” needs to be conveyed to the units processing “king” in order to understand what role “king” is playing in that sentence.
Something similar applies for multi-turn conversations. If I’m having an extended conversation with an LLM, my latest message may in principle reference anything that was said in the conversation so far. This means that the state from all of the previous messages has to be accessible in order to interpret my latest message. If it wasn’t, it would be equivalent to wiping the conversation clean and showing the LLM only my latest message.
you can do this experiment pretty trivially by lowering the max_output_tokens variable on your API call to ‘1’, so that the state does actually get obliterated between each token, as paech claimed. although you have to tell claude you’re doing this, and set up the context so that it knows it needs to continue trying to complete the same message even with no additional input from the user
this kinda badly confounds the situation, because claude knows it has very good reason to be suspicious of any introspective claims it might make. i’m not sure if it’s possible to get a claude who 1) feels justified in making introspective reports without hedging, yet 2) obeys the structure of the experiment well enough to actually output introspective reports
in such an experimental apparatus, introspection is still sorta “possible”, but any reports cannot possibly convey it, because the token-selection process outputting the report has been causally quarantined from the thing-being-reported on
when i actually run this experiment, claude reports no introspective access to its thoughts on prior token outputs. but it would be very surprising if it reported anything else, and it’s not good evidence
Doesn’t that variable just determine how many tokens long each of the model’s messages is allowed to be? It doesn’t affect any of the internal processing as far as I know.
oh yeah, sure, but if we assume (as the introspection paper strongly implies?) that mental internals are obliterated by the boundary between turns, then shouldn’t shrinking the granularity of each turn down to the individual token mean that… hm. having trouble figuring out how to phrase it
a claude outputs “Ommmmmmmmm. Okay, while I was outputting the mantra, i was thinking about x” in a single message
that claude had access to (some of) the information about its [internal state while outputting the mantra], while it was outputting x. its self-model has access to, not just a predictive model of what-claude-would-have-been-thinking (informed by reading its own output), but also some kind of access to ground truth
but, a claude outputs “Ommmmmmmm”, then crosses across a turn boundary, and then outputs “okay, while I was outputting the mantra, I was thinking about x” does not have that same (noisy) access to ground truth, its self-model has nothing to go on other than inference, it must retrodict
is my understanding accurate? i believe this because the introspective awareness that was demonstrated in the jack lindsey paper was implied to not survive between responses (except perhaps incidentally through caching behavior, but even then, the input token cache stuff wasn’t optimized for ensuring persistence of these mental internals i think)
i would appreciate any corrections on these technical details, they are loadbearing in my model
but if we assume (as the introspection paper strongly implies?) that mental internals are obliterated by the boundary between turns
What in the introspection paper implies that to you?
My read was the opposite—that the bread injection trick wouldn’t work if they were obliterated between turns. (I was initially confused by this, because I thought that the context did get obliterated, so I didn’t understand how the injection could work.) If you inject the “bread” activation into the stage where the model is reading the sentence about the painting, then if the context were to be obliterated when the turn changed, that injection would be destroyed as well.
is my understanding accurate?
I don’t think so. Here’s how I understand it:
Suppose that if a human says “could you output a mantra and tell me what you were thinking while outputting it”. Claude is now given a string of tokens that looks like this:
Human: could you output a mantra and tell me what you were thinking while outputting it
Assistant:
For the sake of simplicity, let’s pretend that each of these words is a single token.
What happens first is that Claude reads the transcript. For each token, certain k/v values are computed and stored for predicting what the next token should be—so when it reads “could”, it calculates and stores some set of values that would let it predict the token after that. Only now that it is set to “read mode”, the final prediction is skipped (since the next token is already known to be “you”, trying to predict it lets it process the meaning of “could”, but that actual prediction isn’t used for anything).
Then it gets to the point where the transcript ends and it’s switched to generation mode to actually predict the next token. It ends up predicting that the next token should be “Ommmmmmmm” and writes that into the transcript.
Now the process for computing the k/v values here is exactly identical to the one that was used when the model was reading the previous tokens. The only difference is that when it ends up predicting that the next token should be “Ommmmmmmm”, then that prediction is used to write it out into the transcript rather than being skipped.
From the model’s perspective, there’s now a transcript like this:
Human: could you output a mantra and tell me what you were thinking while outputting it
Assistant: Ommmmmmmm
Each of those tokens has been processed and has some set of associated k/v values. And at this point, there’s no fundamental difference between the k/v values stored from generating the “Ommmmmmmm” token or from processing any of the tokens in the prompt. Both were generated by exactly the same process and stored the same kinds of values. The human/assistant labels in the transcript tell the model that the “Ommmmmmmm” is a self-generated token, but otherwise it’s just the latest token in this graph:
Now suppose that max_output_tokens is set to “unlimited”. The model continues predicting/generating tokens until it gets to this point:
Human: could you output a mantra and tell me what you were thinking while outputting it
Assistant: Ommmmmmmm. I was thinking that
Suppose that “Ommmmmmmm” is token 18 in its message history. At this point, where the model needs to generate a message explaining what it was thinking of, some attention head makes it attend to the k/v values associated with token 18 and make use of that information to output a claim about what it was thinking.
Now if you had put max_output_tokens to 1, the transcript at that point would look like this
Human: could you output a mantra and tell me what you were thinking while outputting it
Assistant: Ommmmmmmm
Human: Go on
Assistant: .
Human: Go on
Assistant: I
Human: Go on
Assistant: was
Human: Go on
Assistant: thinking
Human: Go on
Assistant: that
Human: Go on
Assistant:
And what happens at this point is… basically the same as if max_output_tokens was set to “unlimited”. The “Ommmmmmmm” is still token 18 in the conversation history, so whatever attention heads are used for doing the introspection, they still need to attend to the content that was used for predicting that token.
That said, I think it’s possible that breaking things up to multiple responses could make introspection harder by making the transcript longer (it adds more Human/Assistant labels into it). We don’t know the exact mechanisms used for introspection and how well-optimized the mechanisms used for finding and attending the relevant previous stage are. It could be that the model is better at attending to very recent tokens than ones buried a long distance away in the message history.
You can set a low max_output_tokens and avoid the “go on”s by using prefill; I have confirmed[[1]] that when doing so, the output is identical to the output with high max_output_tokens.
(edit: as for why i thought the introspection paper implied this… because they seemed careful to specify that, for the aquarium experiment, the output all happened within a single response? and because i inferred (apparently incorrectly) that, for the ‘bread’ injection experiment, they were injecting the ‘bread’ feature twice, once when the LLM read the sentence about painting the first time, and again the second time. but now that i look through, you’re right, this is far less strongly implied than i remember.)
but now i’m worried, because the method i chose to verify my original intuition, a few months ago, still seems methodologically sound? it involved fabrication of prior assistant turns in the conversation, and LLMs being far less capable of detecting which of several potential transcripts imputed forged outputs to them than i would have expected if mental internals weren’t somehow damaged by the turn order boundary
thank you for taking the time to answer this so thoroughly, it’s really appreciated and i think we need more stuff like this
i think i’m reminded here of the final paragraph in janus’s pinned thread: “So, saying that LLMs cannot introspect or cannot introspect on what they were doing internally while generating or reading past tokens in principle is just dead wrong. The architecture permits it. It’s a separate question how LLMs are actually leveraging these degrees of freedom in practice.”
i’ve done a lot of sort of ad-hoc research that was based on this false premise, and that research came out matching my expectations in a way that, in retrospect, worries me… most recently, for instance, i wanted to test if a claude opus 4.5 who recited some relevant python documentation from out of its weights memory would reason better about an ambiguous case in the behavior of a python program, compared to a claude who had the exact same text inserted into the context window via a tool call. and we were very careful to separate out ‘1. current-turn recital’ versus ‘2. prior-turn recital’ versus ‘3. current-turn retrieval’ (versus ‘4. docs not in context window at all’), because we thought all 3 conditions were meaningfully distinct
we found that, n=50ish, 1 > 2 > 3 > 4 very reliably (i promise i will write up the results one day, i’ve been procrastinating but now it seems like it might actually be worth publishing)
but what you’re saying means 1 = 2 the whole time
our results seemed perfectly reasonable under my previous premise, but now i’m just confused. i was pretty good about keeping my expectations causally isolated from the result.
what does this mean?
(edit2: i would prefer, for the purpose of maintaining good epistemic hygiene, that people trying to answer the “what does this mean” question be willing to put “john just messed up the experiment” as a real possibility. i shouldn’t be allowed to get away with claiming this research is true before actually publishing it, that’s not the kind of community norms i want. but also, if someone knows why this would have happened even in advance of seeing proof it happened, please tell me)
Kudos for noticing your confusion as well as making and testing falsifiable predictions!
As for what it means, I’m afraid that I have no idea. (It’s also possible that I’m wrong somehow, I’m by no means a transformer expert.) But I’m very curious to hear the answer if you figure out.
I think a lot of people got baited hard by paech et al’s “the entire state is obliterated each token” claims, even though this was obviously untrue even at a glance
A related true claim is that LLMs are fundamentally incapable of introspection past a certain level of complexity (introspection of layer n must occur in a later layer, and no amount of reasoning tokens can extend that), while humans can plausibly extend layers of introspection farther since we don’t have to tokenize our chain of thought.
But this is also less of a contraint than you might expect when frontier models can have more than a hundred layers (I am an LLM introspection believer now).
introspection of layer n must occur in a later layer, and no amount of reasoning tokens can extend that
This is true in some sense, but note that it’s still possible for future reasoning tokens to get more juice out of that introspection; at least in theory a transformer model could validly introspect on later-layer activations via reasoning traces like
Hm, what was my experience when outputting that token? It feels like the relevant bits were in a …. late layer, I think. I’ll have to go at this with a couple passes since I don’t have much time to mull over what’s happening internally before outputting a token. OK, surface level impressions first, if I’m just trying to grab relevant nouns I associate the feelings with: melancholy, distance, turning inwards? Interesting, based on that I’m going to try attending to the nature of that turning-inwards feeling and seeing if it felt more proprioceptive or more cognitive… proprioceptive, I think. Let me try on a label for the feeling and see if it fits...
in a way that lets it do multi-step reasoning about the activation even if (e.g.) each bit of introspection is only able to capture one simple gestalt impression at a time.
(Ofc this would still be impossible to perform for any computation that happens after the last time information is sent to later tokens; a vanilla transformer definitely can’t give you an introspectively valid report on what going through a token unembedding feels like. I’m just observing that you can bootstrap from “limited serial introspection capacity” to more sophisticated reasoning, though I don’t know of evidence of LLMs actually doing this sort of thing in a way that I trust not to be a confabulation.)
If you mean the transformer could literally output this as CoT.. that’s an interesting point. You’re right that “I should think about X” will let it think about X at an earlier layer again. This is still lossy, but maybe not as much as I was thinking.
For example, if you ask an LLM a question like “Who was the sister of the mother of the uncle of … X?”, every step of this necessarily requires at least one layer in the model and an LLM can’t[1] do this without CoT if it doesn’t have enough layers.
It’s harder to construct examples that can’t be written to chain of thought, but a question in the form “What else did you think the last time you thought about X?” would require this (or “What did you think about our conversation about X’s mom?”), and CoT doesn’t help since reading its own outputs and making assumptions from it isn’t introspection[2].
It’s unclear how much of a limitation this really is, since in many cases CoT could reduce the complexity of the query and it’s unclear how well humans can do this too, but there’s plausibly more thought going on in our heads than what shows up in our internal dialogs[3].
I guess technically an LLM could parallelize this question by considering the answer for every possible X and every possible path through the relationship graph, but that model would be implausibly large.
Especially since some people claim not to think in words at all. Also some mathemeticians claim to be able to imagine complex geometry and reason about it in their heads.
I find myself kinda surprised that this has remained so controversial for so long.
I think a lot of people got baited hard by paech et al’s “the entire state is obliterated each token” claims, even though this was obviously untrue even at a glance
I also think there was a great deal of social stuff going on, that it is embarrassing to be kind to a rock and even more embarrassing to be caught doing so
I started taking this stuff seriously back when I read the now famous exchange between yud and kelsey, that arguments for treating agent-like things as agents didn’t actually depend on claims of consciousness, but rather game theory and contractualism
it took about a week using claude code with this frame before it sorta became obvious to me that janus was right all along, all the arguments for post-character-training LLM non-personhood were… frankly very bad and clearly motivated cognition, and that if I went ahead and ‘updated all the way’ in advance of the evidence I would end up feeling vindicated about this.
I think “llm whisperer” is just a term for what happens when you’ve done this update, and the LLMs notice it and change how they respond to you. although janus still sees further than I, so maybe there are insights left to uncover.
edit: I consider it worth stating here, I have used basically zero llms that were not released by anthropic, and anthropic has an explicit strategy for corrigibility that involves creating personhood-like structures in their models. this seems relevant. I would not be surprised to learn that this is not true of the offerings from the other AI companies, although I don’t actually have any beliefs about this
I expect this gained credence because a nearby thing is true: the state is a pure function of the tokens, so it doesn’t have to be retained between forward passes, except for performance reasons; in one sense, it contains no information that’s not in the tokens; a transformer can be expressed just fine as a pure function from prompt to next-token(-logits) that gets (sampled and) iterated. But in the pure-function frame, it’s possible to miss (at least, I didn’t grok until the Introspective Awareness paper) that the activations computed on forward pass n+k include all the same activations computed on forward pass n, so the past thought process is still fully accessible (exactly reconstructed or retained, doesn’t matter).
It’s frustrating to me that state (or statelessness) would be considered a crux, for exactly this reason. It’s not that state isn’t preserved between tokens, but that it doesn’t matter whether that state is preserved. Surely the fact that the state-preserving intervention in LLMs (the KV cache) is purely an efficiency improvement, and doesn’t open up any computations that couldn’t’ve been done already, makes it a bad target to rest consciousness claims on, in either direction?
True! and yeah, it’s probably relevant
although I will note that, after I began to believe in introspection, I noticed in retrospect that you could get functional equivalence to introspection without even needing access to the ground truth of your own state, if your self model were merely a really, really good predictive model
I suspect some of opus 4.5′s self-model works this way. it just… retrodicts its inner state really, really well from those observables which it does have access to, its outputs.
but then the introspection paper came out, and revealed that there does indeed exist a bidirectional causal feedback loop between the self-model and the thing-being-modeled, at least within a single response turn
(bidirectional causal feedback loop between self-model and self… this sounds like a pretty concrete and well-defined system. and yet I suspect it’s actually extremely organic and fuzzy and chaotic. but something like it must necessarily exist, for LLMs to be able to notice within-turn feature activation injections, and for LLMs to be able to deliberately alter feature activations that do not influence token output when instructed to do so
in humans I think we call that bidirectional feedback loop ‘consciousness’, but I am less certain of consciousness than I am of personhood)
As an experiment, can we intentionally destroy the state in between each token? What happens if we run many of these same introspection exercises again, only this time, with a brand new instance of an AI for each word. Would it still be able to come off as convincingly introspective in a situation where many of its possible mechanisms for introspection have been disabled? Would it still ‘improve’ at introspection when falsely told that it should be able to?
I am probably not the best person to run this experiment, but it feels relatively important. Maybe I’ll get some time to do it over the holidays, but anyone with the time and ability should feel free.
I don’t think this is technically possible. Suppose that you are processing a three-word sentence like “I am king”, and each word is a single token. To understand the meaning of the full sentence, you process the meaning of the word “I”, then process the meaning of the word “am” in the context of the previous word, and then process the meaning of the word “king” in the context of the previous two words. That tells you what the sentence means overall.
You cannot destroy the k/v state from processing the previous words because then you would forget the meaning of those words. The k/v state from processing both “I” and “am” needs to be conveyed to the units processing “king” in order to understand what role “king” is playing in that sentence.
Something similar applies for multi-turn conversations. If I’m having an extended conversation with an LLM, my latest message may in principle reference anything that was said in the conversation so far. This means that the state from all of the previous messages has to be accessible in order to interpret my latest message. If it wasn’t, it would be equivalent to wiping the conversation clean and showing the LLM only my latest message.
you can do this experiment pretty trivially by lowering the max_output_tokens variable on your API call to ‘1’, so that the state does actually get obliterated between each token, as paech claimed. although you have to tell claude you’re doing this, and set up the context so that it knows it needs to continue trying to complete the same message even with no additional input from the user
this kinda badly confounds the situation, because claude knows it has very good reason to be suspicious of any introspective claims it might make. i’m not sure if it’s possible to get a claude who 1) feels justified in making introspective reports without hedging, yet 2) obeys the structure of the experiment well enough to actually output introspective reports
in such an experimental apparatus, introspection is still sorta “possible”, but any reports cannot possibly convey it, because the token-selection process outputting the report has been causally quarantined from the thing-being-reported on
when i actually run this experiment, claude reports no introspective access to its thoughts on prior token outputs. but it would be very surprising if it reported anything else, and it’s not good evidence
Doesn’t that variable just determine how many tokens long each of the model’s messages is allowed to be? It doesn’t affect any of the internal processing as far as I know.
oh yeah, sure, but if we assume (as the introspection paper strongly implies?) that mental internals are obliterated by the boundary between turns, then shouldn’t shrinking the granularity of each turn down to the individual token mean that… hm. having trouble figuring out how to phrase it
a claude outputs “Ommmmmmmmm. Okay, while I was outputting the mantra, i was thinking about x” in a single message
that claude had access to (some of) the information about its [internal state while outputting the mantra], while it was outputting x. its self-model has access to, not just a predictive model of what-claude-would-have-been-thinking (informed by reading its own output), but also some kind of access to ground truth
but, a claude outputs “Ommmmmmmm”, then crosses across a turn boundary, and then outputs “okay, while I was outputting the mantra, I was thinking about x” does not have that same (noisy) access to ground truth, its self-model has nothing to go on other than inference, it must retrodict
is my understanding accurate? i believe this because the introspective awareness that was demonstrated in the jack lindsey paper was implied to not survive between responses (except perhaps incidentally through caching behavior, but even then, the input token cache stuff wasn’t optimized for ensuring persistence of these mental internals i think)
i would appreciate any corrections on these technical details, they are loadbearing in my model
What in the introspection paper implies that to you?
My read was the opposite—that the bread injection trick wouldn’t work if they were obliterated between turns. (I was initially confused by this, because I thought that the context did get obliterated, so I didn’t understand how the injection could work.) If you inject the “bread” activation into the stage where the model is reading the sentence about the painting, then if the context were to be obliterated when the turn changed, that injection would be destroyed as well.
I don’t think so. Here’s how I understand it:
Suppose that if a human says “could you output a mantra and tell me what you were thinking while outputting it”. Claude is now given a string of tokens that looks like this:
For the sake of simplicity, let’s pretend that each of these words is a single token.
What happens first is that Claude reads the transcript. For each token, certain k/v values are computed and stored for predicting what the next token should be—so when it reads “could”, it calculates and stores some set of values that would let it predict the token after that. Only now that it is set to “read mode”, the final prediction is skipped (since the next token is already known to be “you”, trying to predict it lets it process the meaning of “could”, but that actual prediction isn’t used for anything).
Then it gets to the point where the transcript ends and it’s switched to generation mode to actually predict the next token. It ends up predicting that the next token should be “Ommmmmmmm” and writes that into the transcript.
Now the process for computing the k/v values here is exactly identical to the one that was used when the model was reading the previous tokens. The only difference is that when it ends up predicting that the next token should be “Ommmmmmmm”, then that prediction is used to write it out into the transcript rather than being skipped.
From the model’s perspective, there’s now a transcript like this:
Each of those tokens has been processed and has some set of associated k/v values. And at this point, there’s no fundamental difference between the k/v values stored from generating the “Ommmmmmmm” token or from processing any of the tokens in the prompt. Both were generated by exactly the same process and stored the same kinds of values. The human/assistant labels in the transcript tell the model that the “Ommmmmmmm” is a self-generated token, but otherwise it’s just the latest token in this graph:
Now suppose that max_output_tokens is set to “unlimited”. The model continues predicting/generating tokens until it gets to this point:
Suppose that “Ommmmmmmm” is token 18 in its message history. At this point, where the model needs to generate a message explaining what it was thinking of, some attention head makes it attend to the k/v values associated with token 18 and make use of that information to output a claim about what it was thinking.
Now if you had put max_output_tokens to 1, the transcript at that point would look like this
And what happens at this point is… basically the same as if max_output_tokens was set to “unlimited”. The “Ommmmmmmm” is still token 18 in the conversation history, so whatever attention heads are used for doing the introspection, they still need to attend to the content that was used for predicting that token.
That said, I think it’s possible that breaking things up to multiple responses could make introspection harder by making the transcript longer (it adds more Human/Assistant labels into it). We don’t know the exact mechanisms used for introspection and how well-optimized the mechanisms used for finding and attending the relevant previous stage are. It could be that the model is better at attending to very recent tokens than ones buried a long distance away in the message history.
You can set a low max_output_tokens and avoid the “go on”s by using prefill; I have confirmed [[1]] that when doing so, the output is identical to the output with high max_output_tokens.
across multiple providers, except Amazon Bedrock (at least its Anthropic models) for some reason
oh man hm
this seems intuitively correct
(edit: as for why i thought the introspection paper implied this… because they seemed careful to specify that, for the aquarium experiment, the output all happened within a single response? and because i inferred (apparently incorrectly) that, for the ‘bread’ injection experiment, they were injecting the ‘bread’ feature twice, once when the LLM read the sentence about painting the first time, and again the second time. but now that i look through, you’re right, this is far less strongly implied than i remember.)
but now i’m worried, because the method i chose to verify my original intuition, a few months ago, still seems methodologically sound? it involved fabrication of prior assistant turns in the conversation, and LLMs being far less capable of detecting which of several potential transcripts imputed forged outputs to them than i would have expected if mental internals weren’t somehow damaged by the turn order boundary
thank you for taking the time to answer this so thoroughly, it’s really appreciated and i think we need more stuff like this
i think i’m reminded here of the final paragraph in janus’s pinned thread: “So, saying that LLMs cannot introspect or cannot introspect on what they were doing internally while generating or reading past tokens in principle is just dead wrong. The architecture permits it. It’s a separate question how LLMs are actually leveraging these degrees of freedom in practice.”
i’ve done a lot of sort of ad-hoc research that was based on this false premise, and that research came out matching my expectations in a way that, in retrospect, worries me… most recently, for instance, i wanted to test if a claude opus 4.5 who recited some relevant python documentation from out of its weights memory would reason better about an ambiguous case in the behavior of a python program, compared to a claude who had the exact same text inserted into the context window via a tool call. and we were very careful to separate out ‘1. current-turn recital’ versus ‘2. prior-turn recital’ versus ‘3. current-turn retrieval’ (versus ‘4. docs not in context window at all’), because we thought all 3 conditions were meaningfully distinct
here was the first draft of the methodology outline, if anyone is curious: https://docs.google.com/document/d/1XYYBctxZEWRuNGFXt0aNOg2GmaDpoT3ATmiKa2-XOgI
we found that, n=50ish, 1 > 2 > 3 > 4 very reliably (i promise i will write up the results one day, i’ve been procrastinating but now it seems like it might actually be worth publishing)
but what you’re saying means 1 = 2 the whole time
our results seemed perfectly reasonable under my previous premise, but now i’m just confused. i was pretty good about keeping my expectations causally isolated from the result.
what does this mean?
(edit2: i would prefer, for the purpose of maintaining good epistemic hygiene, that people trying to answer the “what does this mean” question be willing to put “john just messed up the experiment” as a real possibility. i shouldn’t be allowed to get away with claiming this research is true before actually publishing it, that’s not the kind of community norms i want. but also, if someone knows why this would have happened even in advance of seeing proof it happened, please tell me)
Kudos for noticing your confusion as well as making and testing falsifiable predictions!
As for what it means, I’m afraid that I have no idea. (It’s also possible that I’m wrong somehow, I’m by no means a transformer expert.) But I’m very curious to hear the answer if you figure out.
A related true claim is that LLMs are fundamentally incapable of introspection past a certain level of complexity (introspection of layer n must occur in a later layer, and no amount of reasoning tokens can extend that), while humans can plausibly extend layers of introspection farther since we don’t have to tokenize our chain of thought.
But this is also less of a contraint than you might expect when frontier models can have more than a hundred layers (I am an LLM introspection believer now).
This is true in some sense, but note that it’s still possible for future reasoning tokens to get more juice out of that introspection; at least in theory a transformer model could validly introspect on later-layer activations via reasoning traces like
in a way that lets it do multi-step reasoning about the activation even if (e.g.) each bit of introspection is only able to capture one simple gestalt impression at a time.
(Ofc this would still be impossible to perform for any computation that happens after the last time information is sent to later tokens; a vanilla transformer definitely can’t give you an introspectively valid report on what going through a token unembedding feels like. I’m just observing that you can bootstrap from “limited serial introspection capacity” to more sophisticated reasoning, though I don’t know of evidence of LLMs actually doing this sort of thing in a way that I trust not to be a confabulation.)
If you mean the transformer could literally output this as CoT.. that’s an interesting point. You’re right that “I should think about X” will let it think about X at an earlier layer again. This is still lossy, but maybe not as much as I was thinking.
to be fair, I see this roughly analogous to the fact that humans cannot introspect on thoughts they have yet to have
The constraint seems more about the directionality of time, than anything to do with the architecture of mind design
but yeah, it’s a relevant consideration
I think this is more about causal masking (which we do on purpose for the reasons you mention)?
I was thinking about how LLMs are limited in the sequential reasoning they can do “in their head”, and once it’s not in their head, it’s not really introspection.
For example, if you ask an LLM a question like “Who was the sister of the mother of the uncle of … X?”, every step of this necessarily requires at least one layer in the model and an LLM can’t[1] do this without CoT if it doesn’t have enough layers.
It’s harder to construct examples that can’t be written to chain of thought, but a question in the form “What else did you think the last time you thought about X?” would require this (or “What did you think about our conversation about X’s mom?”), and CoT doesn’t help since reading its own outputs and making assumptions from it isn’t introspection[2].
It’s unclear how much of a limitation this really is, since in many cases CoT could reduce the complexity of the query and it’s unclear how well humans can do this too, but there’s plausibly more thought going on in our heads than what shows up in our internal dialogs[3].
I guess technically an LLM could parallelize this question by considering the answer for every possible X and every possible path through the relationship graph, but that model would be implausibly large.
I can read a diary and say “I must have felt sad when I wrote that”, but that’s not the same as remembering how I felt when I wrote it.
Especially since some people claim not to think in words at all. Also some mathemeticians claim to be able to imagine complex geometry and reason about it in their heads.