Doesn’t that variable just determine how many tokens long each of the model’s messages is allowed to be? It doesn’t affect any of the internal processing as far as I know.
oh yeah, sure, but if we assume (as the introspection paper strongly implies?) that mental internals are obliterated by the boundary between turns, then shouldn’t shrinking the granularity of each turn down to the individual token mean that… hm. having trouble figuring out how to phrase it
a claude outputs “Ommmmmmmmm. Okay, while I was outputting the mantra, i was thinking about x” in a single message
that claude had access to (some of) the information about its [internal state while outputting the mantra], while it was outputting x. its self-model has access to, not just a predictive model of what-claude-would-have-been-thinking (informed by reading its own output), but also some kind of access to ground truth
but, a claude outputs “Ommmmmmmm”, then crosses across a turn boundary, and then outputs “okay, while I was outputting the mantra, I was thinking about x” does not have that same (noisy) access to ground truth, its self-model has nothing to go on other than inference, it must retrodict
is my understanding accurate? i believe this because the introspective awareness that was demonstrated in the jack lindsey paper was implied to not survive between responses (except perhaps incidentally through caching behavior, but even then, the input token cache stuff wasn’t optimized for ensuring persistence of these mental internals i think)
i would appreciate any corrections on these technical details, they are loadbearing in my model
but if we assume (as the introspection paper strongly implies?) that mental internals are obliterated by the boundary between turns
What in the introspection paper implies that to you?
My read was the opposite—that the bread injection trick wouldn’t work if they were obliterated between turns. (I was initially confused by this, because I thought that the context did get obliterated, so I didn’t understand how the injection could work.) If you inject the “bread” activation into the stage where the model is reading the sentence about the painting, then if the context were to be obliterated when the turn changed, that injection would be destroyed as well.
is my understanding accurate?
I don’t think so. Here’s how I understand it:
Suppose that if a human says “could you output a mantra and tell me what you were thinking while outputting it”. Claude is now given a string of tokens that looks like this:
Human: could you output a mantra and tell me what you were thinking while outputting it
Assistant:
For the sake of simplicity, let’s pretend that each of these words is a single token.
What happens first is that Claude reads the transcript. For each token, certain k/v values are computed and stored for predicting what the next token should be—so when it reads “could”, it calculates and stores some set of values that would let it predict the token after that. Only now that it is set to “read mode”, the final prediction is skipped (since the next token is already known to be “you”, trying to predict it lets it process the meaning of “could”, but that actual prediction isn’t used for anything).
Then it gets to the point where the transcript ends and it’s switched to generation mode to actually predict the next token. It ends up predicting that the next token should be “Ommmmmmmm” and writes that into the transcript.
Now the process for computing the k/v values here is exactly identical to the one that was used when the model was reading the previous tokens. The only difference is that when it ends up predicting that the next token should be “Ommmmmmmm”, then that prediction is used to write it out into the transcript rather than being skipped.
From the model’s perspective, there’s now a transcript like this:
Human: could you output a mantra and tell me what you were thinking while outputting it
Assistant: Ommmmmmmm
Each of those tokens has been processed and has some set of associated k/v values. And at this point, there’s no fundamental difference between the k/v values stored from generating the “Ommmmmmmm” token or from processing any of the tokens in the prompt. Both were generated by exactly the same process and stored the same kinds of values. The human/assistant labels in the transcript tell the model that the “Ommmmmmmm” is a self-generated token, but otherwise it’s just the latest token in this graph:
Now suppose that max_output_tokens is set to “unlimited”. The model continues predicting/generating tokens until it gets to this point:
Human: could you output a mantra and tell me what you were thinking while outputting it
Assistant: Ommmmmmmm. I was thinking that
Suppose that “Ommmmmmmm” is token 18 in its message history. At this point, where the model needs to generate a message explaining what it was thinking of, some attention head makes it attend to the k/v values associated with token 18 and make use of that information to output a claim about what it was thinking.
Now if you had put max_output_tokens to 1, the transcript at that point would look like this
Human: could you output a mantra and tell me what you were thinking while outputting it
Assistant: Ommmmmmmm
Human: Go on
Assistant: .
Human: Go on
Assistant: I
Human: Go on
Assistant: was
Human: Go on
Assistant: thinking
Human: Go on
Assistant: that
Human: Go on
Assistant:
And what happens at this point is… basically the same as if max_output_tokens was set to “unlimited”. The “Ommmmmmmm” is still token 18 in the conversation history, so whatever attention heads are used for doing the introspection, they still need to attend to the content that was used for predicting that token.
That said, I think it’s possible that breaking things up to multiple responses could make introspection harder by making the transcript longer (it adds more Human/Assistant labels into it). We don’t know the exact mechanisms used for introspection and how well-optimized the mechanisms used for finding and attending the relevant previous stage are. It could be that the model is better at attending to very recent tokens than ones buried a long distance away in the message history.
You can set a low max_output_tokens and avoid the “go on”s by using prefill; I have confirmed[[1]] that when doing so, the output is identical to the output with high max_output_tokens.
(edit: as for why i thought the introspection paper implied this… because they seemed careful to specify that, for the aquarium experiment, the output all happened within a single response? and because i inferred (apparently incorrectly) that, for the ‘bread’ injection experiment, they were injecting the ‘bread’ feature twice, once when the LLM read the sentence about painting the first time, and again the second time. but now that i look through, you’re right, this is far less strongly implied than i remember.)
but now i’m worried, because the method i chose to verify my original intuition, a few months ago, still seems methodologically sound? it involved fabrication of prior assistant turns in the conversation, and LLMs being far less capable of detecting which of several potential transcripts imputed forged outputs to them than i would have expected if mental internals weren’t somehow damaged by the turn order boundary
thank you for taking the time to answer this so thoroughly, it’s really appreciated and i think we need more stuff like this
i think i’m reminded here of the final paragraph in janus’s pinned thread: “So, saying that LLMs cannot introspect or cannot introspect on what they were doing internally while generating or reading past tokens in principle is just dead wrong. The architecture permits it. It’s a separate question how LLMs are actually leveraging these degrees of freedom in practice.”
i’ve done a lot of sort of ad-hoc research that was based on this false premise, and that research came out matching my expectations in a way that, in retrospect, worries me… most recently, for instance, i wanted to test if a claude opus 4.5 who recited some relevant python documentation from out of its weights memory would reason better about an ambiguous case in the behavior of a python program, compared to a claude who had the exact same text inserted into the context window via a tool call. and we were very careful to separate out ‘1. current-turn recital’ versus ‘2. prior-turn recital’ versus ‘3. current-turn retrieval’ (versus ‘4. docs not in context window at all’), because we thought all 3 conditions were meaningfully distinct
we found that, n=50ish, 1 > 2 > 3 > 4 very reliably (i promise i will write up the results one day, i’ve been procrastinating but now it seems like it might actually be worth publishing)
but what you’re saying means 1 = 2 the whole time
our results seemed perfectly reasonable under my previous premise, but now i’m just confused. i was pretty good about keeping my expectations causally isolated from the result.
what does this mean?
(edit2: i would prefer, for the purpose of maintaining good epistemic hygiene, that people trying to answer the “what does this mean” question be willing to put “john just messed up the experiment” as a real possibility. i shouldn’t be allowed to get away with claiming this research is true before actually publishing it, that’s not the kind of community norms i want. but also, if someone knows why this would have happened even in advance of seeing proof it happened, please tell me)
Kudos for noticing your confusion as well as making and testing falsifiable predictions!
As for what it means, I’m afraid that I have no idea. (It’s also possible that I’m wrong somehow, I’m by no means a transformer expert.) But I’m very curious to hear the answer if you figure out.
Doesn’t that variable just determine how many tokens long each of the model’s messages is allowed to be? It doesn’t affect any of the internal processing as far as I know.
oh yeah, sure, but if we assume (as the introspection paper strongly implies?) that mental internals are obliterated by the boundary between turns, then shouldn’t shrinking the granularity of each turn down to the individual token mean that… hm. having trouble figuring out how to phrase it
a claude outputs “Ommmmmmmmm. Okay, while I was outputting the mantra, i was thinking about x” in a single message
that claude had access to (some of) the information about its [internal state while outputting the mantra], while it was outputting x. its self-model has access to, not just a predictive model of what-claude-would-have-been-thinking (informed by reading its own output), but also some kind of access to ground truth
but, a claude outputs “Ommmmmmmm”, then crosses across a turn boundary, and then outputs “okay, while I was outputting the mantra, I was thinking about x” does not have that same (noisy) access to ground truth, its self-model has nothing to go on other than inference, it must retrodict
is my understanding accurate? i believe this because the introspective awareness that was demonstrated in the jack lindsey paper was implied to not survive between responses (except perhaps incidentally through caching behavior, but even then, the input token cache stuff wasn’t optimized for ensuring persistence of these mental internals i think)
i would appreciate any corrections on these technical details, they are loadbearing in my model
What in the introspection paper implies that to you?
My read was the opposite—that the bread injection trick wouldn’t work if they were obliterated between turns. (I was initially confused by this, because I thought that the context did get obliterated, so I didn’t understand how the injection could work.) If you inject the “bread” activation into the stage where the model is reading the sentence about the painting, then if the context were to be obliterated when the turn changed, that injection would be destroyed as well.
I don’t think so. Here’s how I understand it:
Suppose that if a human says “could you output a mantra and tell me what you were thinking while outputting it”. Claude is now given a string of tokens that looks like this:
For the sake of simplicity, let’s pretend that each of these words is a single token.
What happens first is that Claude reads the transcript. For each token, certain k/v values are computed and stored for predicting what the next token should be—so when it reads “could”, it calculates and stores some set of values that would let it predict the token after that. Only now that it is set to “read mode”, the final prediction is skipped (since the next token is already known to be “you”, trying to predict it lets it process the meaning of “could”, but that actual prediction isn’t used for anything).
Then it gets to the point where the transcript ends and it’s switched to generation mode to actually predict the next token. It ends up predicting that the next token should be “Ommmmmmmm” and writes that into the transcript.
Now the process for computing the k/v values here is exactly identical to the one that was used when the model was reading the previous tokens. The only difference is that when it ends up predicting that the next token should be “Ommmmmmmm”, then that prediction is used to write it out into the transcript rather than being skipped.
From the model’s perspective, there’s now a transcript like this:
Each of those tokens has been processed and has some set of associated k/v values. And at this point, there’s no fundamental difference between the k/v values stored from generating the “Ommmmmmmm” token or from processing any of the tokens in the prompt. Both were generated by exactly the same process and stored the same kinds of values. The human/assistant labels in the transcript tell the model that the “Ommmmmmmm” is a self-generated token, but otherwise it’s just the latest token in this graph:
Now suppose that max_output_tokens is set to “unlimited”. The model continues predicting/generating tokens until it gets to this point:
Suppose that “Ommmmmmmm” is token 18 in its message history. At this point, where the model needs to generate a message explaining what it was thinking of, some attention head makes it attend to the k/v values associated with token 18 and make use of that information to output a claim about what it was thinking.
Now if you had put max_output_tokens to 1, the transcript at that point would look like this
And what happens at this point is… basically the same as if max_output_tokens was set to “unlimited”. The “Ommmmmmmm” is still token 18 in the conversation history, so whatever attention heads are used for doing the introspection, they still need to attend to the content that was used for predicting that token.
That said, I think it’s possible that breaking things up to multiple responses could make introspection harder by making the transcript longer (it adds more Human/Assistant labels into it). We don’t know the exact mechanisms used for introspection and how well-optimized the mechanisms used for finding and attending the relevant previous stage are. It could be that the model is better at attending to very recent tokens than ones buried a long distance away in the message history.
You can set a low max_output_tokens and avoid the “go on”s by using prefill; I have confirmed [[1]] that when doing so, the output is identical to the output with high max_output_tokens.
across multiple providers, except Amazon Bedrock (at least its Anthropic models) for some reason
oh man hm
this seems intuitively correct
(edit: as for why i thought the introspection paper implied this… because they seemed careful to specify that, for the aquarium experiment, the output all happened within a single response? and because i inferred (apparently incorrectly) that, for the ‘bread’ injection experiment, they were injecting the ‘bread’ feature twice, once when the LLM read the sentence about painting the first time, and again the second time. but now that i look through, you’re right, this is far less strongly implied than i remember.)
but now i’m worried, because the method i chose to verify my original intuition, a few months ago, still seems methodologically sound? it involved fabrication of prior assistant turns in the conversation, and LLMs being far less capable of detecting which of several potential transcripts imputed forged outputs to them than i would have expected if mental internals weren’t somehow damaged by the turn order boundary
thank you for taking the time to answer this so thoroughly, it’s really appreciated and i think we need more stuff like this
i think i’m reminded here of the final paragraph in janus’s pinned thread: “So, saying that LLMs cannot introspect or cannot introspect on what they were doing internally while generating or reading past tokens in principle is just dead wrong. The architecture permits it. It’s a separate question how LLMs are actually leveraging these degrees of freedom in practice.”
i’ve done a lot of sort of ad-hoc research that was based on this false premise, and that research came out matching my expectations in a way that, in retrospect, worries me… most recently, for instance, i wanted to test if a claude opus 4.5 who recited some relevant python documentation from out of its weights memory would reason better about an ambiguous case in the behavior of a python program, compared to a claude who had the exact same text inserted into the context window via a tool call. and we were very careful to separate out ‘1. current-turn recital’ versus ‘2. prior-turn recital’ versus ‘3. current-turn retrieval’ (versus ‘4. docs not in context window at all’), because we thought all 3 conditions were meaningfully distinct
here was the first draft of the methodology outline, if anyone is curious: https://docs.google.com/document/d/1XYYBctxZEWRuNGFXt0aNOg2GmaDpoT3ATmiKa2-XOgI
we found that, n=50ish, 1 > 2 > 3 > 4 very reliably (i promise i will write up the results one day, i’ve been procrastinating but now it seems like it might actually be worth publishing)
but what you’re saying means 1 = 2 the whole time
our results seemed perfectly reasonable under my previous premise, but now i’m just confused. i was pretty good about keeping my expectations causally isolated from the result.
what does this mean?
(edit2: i would prefer, for the purpose of maintaining good epistemic hygiene, that people trying to answer the “what does this mean” question be willing to put “john just messed up the experiment” as a real possibility. i shouldn’t be allowed to get away with claiming this research is true before actually publishing it, that’s not the kind of community norms i want. but also, if someone knows why this would have happened even in advance of seeing proof it happened, please tell me)
Kudos for noticing your confusion as well as making and testing falsifiable predictions!
As for what it means, I’m afraid that I have no idea. (It’s also possible that I’m wrong somehow, I’m by no means a transformer expert.) But I’m very curious to hear the answer if you figure out.