The previous experiments study cases where we explicitly ask the model to introspect. We were also interested in whether models use introspection naturally, to perform useful behaviors. To this end, we tested whether models employ introspection to detect artificially prefilled outputs. When we prefill the model’s response with an unnatural output (“bread,” in the example below), it disavows the response as accidental in the following turn. However, if we retroactively inject a vector representing “bread” into the model’s activations prior to the prefilled response, the model accepts the prefilled output as intentional. This indicates that the model refers to its activations prior to its previous response in order to determine whether it was responsible for producing that response. We found that Opus 4.1 and 4 display the strongest signatures of this introspective mechanism, but some other models do so to a lesser degree.
I thought that an LLM’s responses for each turn are generated entirely separately from each other, so that when you give it an old conversation history with some of its messages included, it re-reads the whole conversation from scratch and then generates an entirely new response. In that case, it shouldn’t matter what you injected into its activations during a previous conversation turn, since only the resulting textual output is used for calculating new activations and generating the next response. Do I have this wrong?
it re-reads the whole conversation from scratch and then generates an entirely new response
You can think of what is being done here as the experiment done here interfering with the “re-reading” of the response.
A simple example: when the LLM sees the word “brot” in German, it probably “translates” it internally into “bread” at the position where the “brot” token is, and so if you tamper with activations at the “brot” position (on all forward passes—though in practice you only need to do it the first time “brot” enters the context if you do KV caching) it will have effects many tokens later. In the Transformer architecture the process that computes the next token happens in part “at” previous tokens so it makes sense to tamper with activations “at” previous tokens.
Maybe this diagram from this post is helpful (though it’s missing some arrows to reduce clutter).
Right, so the “retroactively” means that it doesn’t inject the vector when the response is originally prefilled, but rather when the model is re-reading the conversation with the prefilled response and it gets to the point with the bread? That makes sense.
Very interesting!
I’m confused by this section:
I thought that an LLM’s responses for each turn are generated entirely separately from each other, so that when you give it an old conversation history with some of its messages included, it re-reads the whole conversation from scratch and then generates an entirely new response. In that case, it shouldn’t matter what you injected into its activations during a previous conversation turn, since only the resulting textual output is used for calculating new activations and generating the next response. Do I have this wrong?
You can think of what is being done here as the experiment done here interfering with the “re-reading” of the response.
A simple example: when the LLM sees the word “brot” in German, it probably “translates” it internally into “bread” at the position where the “brot” token is, and so if you tamper with activations at the “brot” position (on all forward passes—though in practice you only need to do it the first time “brot” enters the context if you do KV caching) it will have effects many tokens later. In the Transformer architecture the process that computes the next token happens in part “at” previous tokens so it makes sense to tamper with activations “at” previous tokens.
Maybe this diagram from this post is helpful (though it’s missing some arrows to reduce clutter).
Right, so the “retroactively” means that it doesn’t inject the vector when the response is originally prefilled, but rather when the model is re-reading the conversation with the prefilled response and it gets to the point with the bread? That makes sense.
Read “injected at the index of a previous conversational turn”