They can’t access the computations that led to the previous messages in the context and so can only guess at why they wrote what they previously wrote.
This is not exactly right. The internal state in LLMs is the attention keys and values (per token, layer and attention head). Using an LLM to generate text involves running the context (prior user and model messages, in a chat context) through the model in parallel to fill the K/V cache, then running it serially on one token at a time at the end of the sequence, with access to the K/V cache of previous tokens, appending the newly generated keys and values to the cache as you go.
This internal state is fully determined by the input—K/V caching is purely an inference optimization and (up to numerical issues) you would get exactly the same results if you recomputed everything on each new token—so there is exactly as much continuity between messages as there is between individual tokens (with current publicly disclosed algorithms).
This is not exactly right. The internal state in LLMs is the attention keys and values (per token, layer and attention head). Using an LLM to generate text involves running the context (prior user and model messages, in a chat context) through the model in parallel to fill the K/V cache, then running it serially on one token at a time at the end of the sequence, with access to the K/V cache of previous tokens, appending the newly generated keys and values to the cache as you go.
This internal state is fully determined by the input—K/V caching is purely an inference optimization and (up to numerical issues) you would get exactly the same results if you recomputed everything on each new token—so there is exactly as much continuity between messages as there is between individual tokens (with current publicly disclosed algorithms).
Thank you! Always good to learn.