Kaj_Sotala comments on Emergent Introspective Awareness in Large Language Models

Kaj_Sotala 30 Oct 2025 7:26 UTC
4 points
0
Very interesting!
I’m confused by this section:
The previous experiments study cases where we explicitly ask the model to introspect. We were also interested in whether models use introspection naturally, to perform useful behaviors. To this end, we tested whether models employ introspection to detect artificially prefilled outputs. When we prefill the model’s response with an unnatural output (“bread,” in the example below), it disavows the response as accidental in the following turn. However, if we retroactively inject a vector representing “bread” into the model’s activations prior to the prefilled response, the model accepts the prefilled output as intentional. This indicates that the model refers to its activations prior to its previous response in order to determine whether it was responsible for producing that response. We found that Opus 4.1 and 4 display the strongest signatures of this introspective mechanism, but some other models do so to a lesser degree.
I thought that an LLM’s responses for each turn are generated entirely separately from each other, so that when you give it an old conversation history with some of its messages included, it re-reads the whole conversation from scratch and then generates an entirely new response. In that case, it shouldn’t matter what you injected into its activations during a previous conversation turn, since only the resulting textual output is used for calculating new activations and generating the next response. Do I have this wrong?
- Fabien Roger 30 Oct 2025 10:46 UTC
  12 points
  0
  Parent
  it re-reads the whole conversation from scratch and then generates an entirely new response
  You can think of what is being done here as the experiment done here interfering with the “re-reading” of the response.
  A simple example: when the LLM sees the word “brot” in German, it probably “translates” it internally into “bread” at the position where the “brot” token is, and so if you tamper with activations at the “brot” position (on all forward passes—though in practice you only need to do it the first time “brot” enters the context if you do KV caching) it will have effects many tokens later. In the Transformer architecture the process that computes the next token happens in part “at” previous tokens so it makes sense to tamper with activations “at” previous tokens.
  Maybe this diagram from this post is helpful (though it’s missing some arrows to reduce clutter).
  - Kaj_Sotala 30 Oct 2025 13:56 UTC
    4 points
    1
    Parent
    Right, so the “retroactively” means that it doesn’t inject the vector when the response is originally prefilled, but rather when the model is re-reading the conversation with the prefilled response and it gets to the point with the bread? That makes sense.
- the gears to ascension 30 Oct 2025 14:19 UTC
  6 points
  0
  Parent
  Read “injected at the index of a previous conversational turn”