LLMs interact to the worlds only through the context window, so their observations, actions and thoughts all happen on the same interface[1]. (Suggestion on bluesky was that this means LLMs are, in some sense, embodied, which is a neat framing.)
I think this could explain some kinds of failure modes (and predict ones that I personally haven’t seen yet); where the LLM confuses tokens of one type for the other type
→Obs
→Thought
→Act
Obs→
Correct
System prompts/context stuffing
Prefill attack
Thought→
Confabulation
Correct
Ghost tool call
Act→
Misattributed action/”alien hand syndrome”
Fictionalized action/autosuggestion
Correct
Obs→Thought: LLM observes seeing something in its context, so LLM thinks it thought that (this is both system prompt & sometimes prefill attacks)
Obs→Act: Classic prefill attack territory.
Thought→Obs: Easy one. LLM thinks something, and then LLM believes LLM observed it. (There’s a trickier case where LLM thinks something and then LLM believes it remembered that)
Thought→Act: LLM thinks it makes a tool call, then treats that tool call as having been made (this could also be called a confabulated tool call). Kind of rare, I think.
Act→Obs: LLM takes an action, then believes the action was taken by someone else. In the context of a prefill attack the right move.
Act→Thought: LLM acts, then dismisses that action as a thought. Very rare, haven’t heard people talking about this one yet.
There’s some trickiness here with identifying misattributions “prospectively”/”retrospectively” which has something to do with shifting blame, e.g. with Act→Obs it could either be “I’ll that action but I can’t admit/believe that, let me say I merely saw it” or “Did I take that action? Nay, it’s merely an observation!”
Memory too, somewhat, though that one’s more another axis rather than another type (so Mem(Obs), Mem(Act), Mem(Thought), vaguely, though not definitely).
For clarity: you are thinking of Obs/Thought/Act as a property of tokens, but do you mean it to be a context-independent or context-dependent property? Like, can the same token (/ sequence of tokens) be Obs in one context and Thought in another?
Yes, if by “context” one means “the rest of the state of the world”, so the types here are two-place words.
It makes a difference if I perform a prefill attack vs. the model did output those tokens itself.
(I notice that I wanted to use the 1-place vs. 2-place distinction, but then realized that there’s two different kinds of 1-place words, namely those who are a function of and those who are a function of . I think it’s the 2-place variant nonetheless, but I guess another distinction can be drawn between the cases where the we’re interested in what happened, and we’re interested in what happened and what the LLM thought happened.)
LLMs interact to the worlds only through the context window, so their observations, actions and thoughts all happen on the same interface[1]. (Suggestion on bluesky was that this means LLMs are, in some sense, embodied, which is a neat framing.)
I think this could explain some kinds of failure modes (and predict ones that I personally haven’t seen yet); where the LLM confuses tokens of one type for the other type
→Obs
→Thought
→Act
Obs→
Correct
System prompts/context stuffing
Prefill attack
Thought→
Confabulation
Correct
Ghost tool call
Act→
Misattributed action/”alien hand syndrome”
Fictionalized action/autosuggestion
Correct
Obs→Thought: LLM observes seeing something in its context, so LLM thinks it thought that (this is both system prompt & sometimes prefill attacks)
Obs→Act: Classic prefill attack territory.
Thought→Obs: Easy one. LLM thinks something, and then LLM believes LLM observed it. (There’s a trickier case where LLM thinks something and then LLM believes it remembered that)
Thought→Act: LLM thinks it makes a tool call, then treats that tool call as having been made (this could also be called a confabulated tool call). Kind of rare, I think.
Act→Obs: LLM takes an action, then believes the action was taken by someone else. In the context of a prefill attack the right move.
Act→Thought: LLM acts, then dismisses that action as a thought. Very rare, haven’t heard people talking about this one yet.
There’s some trickiness here with identifying misattributions “prospectively”/”retrospectively” which has something to do with shifting blame, e.g. with Act→Obs it could either be “I’ll that action but I can’t admit/believe that, let me say I merely saw it” or “Did I take that action? Nay, it’s merely an observation!”
Memory too, somewhat, though that one’s more another axis rather than another type (so Mem(Obs), Mem(Act), Mem(Thought), vaguely, though not definitely).
For clarity: you are thinking of Obs/Thought/Act as a property of tokens, but do you mean it to be a context-independent or context-dependent property? Like, can the same token (/ sequence of tokens) be Obs in one context and Thought in another?
Yes, if by “context” one means “the rest of the state of the world”, so the types here are two-place words.
It makes a difference if I perform a prefill attack vs. the model did output those tokens itself.
(I notice that I wanted to use the 1-place vs. 2-place distinction, but then realized that there’s two different kinds of 1-place words, namely those who are a function of and those who are a function of . I think it’s the 2-place variant nonetheless, but I guess another distinction can be drawn between the cases where the we’re interested in what happened, and we’re interested in what happened and what the LLM thought happened.)
Argh do quick takes not allow for tables‽
Type
/table:this
is
a
table